“Why Are Data Scientists Switching from R to Python?”

2020-02-16

It's a question I've been asked quite a few times lately, in one form or another. I ended up writing at some length about my personal experiences for a private discussion group, and thought the result was worth sharing. It's a bit of a “rough cut”.


Oh, R. I can’t tell you why “data scientists” have switched, for the same reason I can’t tell you how Santa’s reindeer achieve lift: I simply don’t believe in them. But, I can tell you why I no longer use R for new projects.

I learned R a few years after I switched from Perl to Python for my “scripting” projects (a meaningless distinction, but those of us who knew at least one “real programming language” soothed our egos with it back then). R, especially before the 2010s, had a lot going for it. You want a few things to do exploratory data analysis (EDA), statistics, and model fitting well.

Bastard Son of Lisp

First, you want an interactive, stateful, environment. In short, you need to be able to run a line of code, possibly storing the result for later use, and then decide - based on the result - what to do next. The best way to get there is for your language to provide an interactive interpreter, or “REPL” (read-eval-print loop - the terminology comes directly from the Lisp program which implements its most basic incarnation, Lisp as usual having pioneered relevant techniques for interactive computation an age before the rest of the world caught up). Python does have a REPL, but it’s a poor one. In particular, editing source code in files and “hot reloading” them was quite difficult (it’s easier now, but still not great). IPython (now Jupyter) existed but was immature, and matplotlib was the only game in town for visualization (and a reliable trigger of Matlab PTSD for UT mechanical engineers, at least).

R was designed from the ground up as an interactive environment first and a programming language second. This foreshadows some of its failings, but it’s undeniable how much more effortless the core EDA loop was in R at the time. You had built-in support for reading delimited files or ODBC connections into in-memory tabular structures (data frames), and plotting with a nice embedded domain-specific language. Consider: plot(rate ~ time, col='green', lty='dashed', data=permian.wells) vs. … well, I still don’t know! I think a plt.show() is involved somewhere.

R was, and is, a “secret Lisp” - it’s pretty explicitly derived from some of the key ideas of Lisp. Namely, interactive interpretation, and metaprogramming-in-the-language. In more modern Lisps, we use macros for the latter, but R is a living fossil: the last Lisp to use fexprs for non-standard evaluation. In fact, all function arguments in R are lazily evaluated! This is the trick behind the nice syntax above, and behind oddities like

with(some.data.frame, {
    a.column <- another.column * 17
    another.column <- a.column - 6
})

R also bundled a tremendous number of statistical and machine learning functions and algorithms in its standard library. The CRAN (copied from Perl’s greatest lasting contribution to programming - the CPAN!) augmented this with hot-off-the-presses algorithms and superior visualization tools like ggplot2.

All this came together to make R extraordinarily powerful for the itinerant analyst. I could sit in a conference room with engineers and geologists and perform data transformations, generate visuals, and even interact with 3D models in real time. I am still not sure how to achieve the same “man-machine fusion” with Python.

I Come Here Not To Praise R But To Bury Him

So why do I swear on everything holy I’ll never use R for a new project? Because R is a phenomenal interactive statistics and visualization environment, but an absolutely terrible programming language - and it turns out the latter is more important in the long term!

“Data science” seems to me to be increasingly mislabeled - as it’s practiced, it’s really a sub-discipline of software development. Data exploration and model construction are important, but to deploy these techniques successfully we must engage in the task of building robust software to load data, transform it, construct models, and apply them. This means interacting with the “programmer’s world” of network protocols, data structures, static analysis, and automated testing; the “statistician’s world” of hypothesis tests, model fitting algorithms, and so forth occupies a relatively small portion of our time.

R suffers from a few core problems which cripple it as a viable tool for robust software development. These flow mainly from one root cause: R lacks conceptual integrity. It’s a Frankenstein’s monster, fused together from whatever the mad statisticians who designed it had at hand - or could steal from the fresh graves of dead Lisps - and animated by unholy science. A point of evidence: R has not one, but three object models (S3, S4, R5) - each, like the nine cities of Troy, built upon the ruins of the last.

R provides a paucity of the data structures we need to write quality software. Out of the box, we get homogeneous arrays (vectors and matrices), heterogeneous linked lists (with a half-baked facility for indexing by “name” to approximate an associative array - in other words, a warmed-over Lisp alist), and data frames (which are really just linked lists). In principle, we can use lists to build associative arrays, trees, records, and the other structures we’d like, but the syntax (it’s hideous and error-prone) and semantics (it’s slow and error-prone) of the language conspire to make this not quite worth the effort.

By contrast, Python brilliantly provides arrays and hash tables out of the box (this latter, while commonplace now, was a phenomenal trick for those of us coming from C or Perl - Perl had hash tables, but the syntax was a chore). Python also makes it relatively easy to define classes for use as “bags of data” (i.e. records).

In addition to data structures, we need some common utilities in our standard libraries to be efficient practitioners of the informatic arts. R provides built-in functions for loading delimited text files, but any I/O that doesn’t fit that mold is beyond a chore. The “web world” of dynamically-templated HTML and Javascript strings (hey, I’m not saying it’s great, but it’s the hell we’ve built for ourselves) is a challenge to interact with when the best string-building tool you have is paste0 (not to be confused with its Satanic cousin paste). Python, for good or ill, has always been “batteries included” - HTTP clients, string templating, and file I/O are either built-in or an import away.

R became very difficult to reason about for programs beyond, say, a hundred or so lines. Performance quirks abounded, and R programmers who’ve been around longer than a few years remember the Matlab-style wars over for loops (slow) vs. the un-memorizable suite of apply functions (apply, sapply, lapply, mapply, tapply - don’t ask me which one did what!). Dynamic “typing” cripples the programmer’s ability to reason about large programs in either language, but anecdotally R seems to have a higher likelihood of turning minor syntax errors into running programs which silently do the wrong thing; Python’s a little less afraid to crash, or at least has a higher edit distance between valid programs. (As an example of this, I had an R program produce wrong results in production for two months[!] without being detected, because of an is (is.null(x)) { ... } instead of if (is.null(x)) { ... }.) The wide variety of static analyzers, automatic test runners, and other tools which have appeared in recent years is helping Python paper over its own difficulties here.

Finally, the network effects really clinch the case. Python has become a sort of “lingua franca” which different disciplines can use to communicate. My wife and I always joke when traveling that the true language of world commerce is “broken English” - when a table of drunken Russians wants to complain to a waiter in Barcelona that their “bucket of mussels” turned out to be, well, a bucket of mussels (what did they expect?), the only common language they have is extremely rough English! Python fills the same role today; to be crude, we’ve got bad programmers who call themselves data scientists, bad programmers who call themselves data engineers, bad programmers who call themselves “DevOps”, and bad programmers who call themselves bosses - and the only common language they share is (broken) Python.

And What Rough Beast, Its Hour Come Round At Last, JITs Toward Bethlehem To Be Born

As you all have guessed, I really don’t care for the state of tools and practice in “data science”. Python is a damn sight better than R for building robust software, but that’s a low bar to clear - and in a few key ways it’s actually a regression for interactive analysis. So what would I rather see, and what do I think we will actually see, in coming years?

First, I’d rather live in the universe where we built these tools around expressive static type systems. I find it hard verging on impossible to reason about even mid-sized systems (say, 10 to 30 thousand lines of code) when all the key logical invariants and specifications are “in my head”. Types give us a form of machine-checkable documentation of a programmer’s intent - they force us to honestly grapple with the problem of formalizing the informal, and provide a feedback loop which catches fundamental flaws in logic earlier rather than later. It’s true that the common-or-garden variety type systems programmers generally know (say, from suffering through a “Java school”) are poor fits for typical data science tasks. We’d like to be able to reason about things like array dimensions, heterogeneous records with structures only known at runtime, and data provenance (where did this value come from?) in the type system. I’ll just hand-wave here in the direction of dependent types, row polymorphism, and nominal types as solutions to these three problems - but I’d encourage any of you to go down the rabbit-hole.

So, in my dream world, perhaps we’d do most of our analysis in a language combining a powerful type system with an interactive REPL. Languages like Haskell or Idris are getting close, but the libraries aren’t there for data science, and things like dependent types do come with a cost - of both cognitive effort and CPU time.

Let’s grant - just for the sake of argument - that “data scientists” want dynamic languages - or, at least, that they’re unwilling to learn dependent type systems, and unlikely to derive much benefit from less expressive type disciplines. Perhaps the best we could do for them would be to provide a powerfully interactive environment, with strong runtime checking, a vast library ecosystem, and accessible metaprogramming tools. Something, perhaps, like Common Lisp! If “we all do data science in Idris 2.0” is the pie-in-the-sky world, the “Lisp is for data scientists” world was just an accident of history away. R itself is descended from Lisp-based statistical systems which boomed in the 80s and died in the “AI winter” of the 90s; as recently as 10 or 15 years ago a project called “Incanter” tried to resurrect Lisp-for-stats on the JVM.

Well, what about the world we do have? I see a lot of promise in another Lisp descendent - Julia. Julia takes from Lisp its metaprogramming system (macros with quasiquotation) and its approach to powerful type-directed multiple dispatch polymorphism (generic functions), couples them with a “traditional” math-language syntax to make R and Matlab users feel comfortable, and welds them to the LLVM compiler backend to perform type-driven compilation. The performance is great, the semantics are sensible even if “dynamic”, and the library ecosystem is growing. I can get over the 1-based array indices… eventually.

I think the “right answer” is to split these projects and systems into layers. We want a solid, robust core, designed to last years and provide reliable performance and guaranteed correctness. For this “infrastructure” I again assert that we must use a static type system - and as much additional static analysis as possible - or face years of debugging “it worked in testing”. You know my thoughts on modern typed “systems programming” languages - I think Rust is the obvious next big thing here, but you could do worse than “modern C++” as well. For systems with looser performance requirements, Haskell or O’Caml may make sense - for example, when implementing I/O bound network services with elaborate invariants.

On top of this, we need to provide tools for low-friction interactive use. I think we’ve room to grow in adapting static typing and JIT compilation to REPL environments, but for now tools like Julia and Python serve well in this niche. The metaprogramming tools in Julia seem to be an opportunity to build the kinds of embedded domain-specific languages which made R so nice for interactive use, in a more principled way. I’d like to see more of that happen in the future.