Fork me on GitHub

Which is the most popular language for data science in 2017?

Which is the most popular language for data science in 2017?

While performing a market analysis for a business case I looked at some trends in the field, and made a quick query to answer the above question "for fun". Since the result was both striking and rather unexpected I decided to share it with you in this blog post. (Plus, it gave me an excuse to add support for Bokeh to my blog.)

Loading BokehJS ...

Background

In the past a lot of articles have been written comparing different programming languages and tools for data science. Depending on exactly what was studied, and how the data was collected, different results were published. I think it is fair to say that SAS was the largest analytics program in the past, but it was overtaken by open source alternative, most notably R. Historically python has been used for much the same tasks as R, but did not have quite as large following in the data science field.

A trend that cannot have escaped anyone, at least not readers of this blog, is that the topic of data science has completely exploded in the last few years. It has gone from something that only geeks do for fun to a hot board room question. So which programming language is lingua franca of data science today?

What do we mean by "popular" anyway?

Previous studies have been made on mentions in online job postings. That is a good measure of demand for skills. However, it is difficult to obtain an unbiased result, since python is used by many people not having anything to do with data science. Also, words in job description is a measure of how popular some topics are in the minds of HR, but I have some doubts that they really know what the role really requires, and is often based on outdated assumptions.

There are many other metrics we could use to measure the popularity, e.g., activity on Stack Overflow, questions on Quora, number blog posts, discussions on twitter etc. Ultimately it comes down to what we mean by popularity.

What is measured

With that said, I decided to extract data from what people have been searching on with Google. The data source was Google Trends, and I pulled all data since the beginning of time until 18 August 2017. To ensure that we only capture searches relevant to data science, the search terms were "r data science", "python data science" and "sas data science". This was the result:

The first thing we notice is that starting around 2013 there was a signficant increase in searches for all three tools. That is the rise of data science. However, we knew already that there is a significant hype around data science.

A more interesting thing to note is that python has been growing faster than R during the rise of data science, and that python became the most popular data science languange in 2016.

The most popular language for data science in 2017

Looking at the market shares of the three languages, the trend is still true. Python now is at about 59%, while R has a relative popularity rating of about 36%. Meanwhile, SAS has a loyal user base which shows a less rapid decline than R and is now at approximately 5%. However, these numbers do not account for that data scientists are often using more than one of these tools. I am personally using all three tools, depending on the task at hand and external factors.

The most popular language for machine learning

The trend in data science is very striking, but maybe it is just that R-practitioners use a different jargon? If python is really overtaking R as the most popular language for data science, it would make sense if it also is the most popular language for machine learning. This is what is observed when we change the search terms to "R machine learning" etc:

The same trend is observed, but the python dominance is even stronger.

Why this change now?

This is where things get tricky. The descriptive analysis above is clear and simple. A diagnostic analysis answering the question why rather than what requires speculative hypothesis to be made, which can be hard to test.

A parallel trend in this field is deep learning. Major companies like Facebook, Google and Microsoft are investing heavily in this area. Google release its deep learning framework Tensorflow a while back, and it interacts well with python. After all, python is one of the most used languages at Google alongside Java, C++ and Go. Furthermore, Keras, a deep learning library for python which simplifies Tensorflow and Theano, is growing very popular. Until now, there was no equivalent for R, so data scientist who wanted to develop deep learning models had to learn python.

While this figure does not prove that deep learning is the reason for the increase in python's popularity, the timing of the two trends suggests that there is some truth to this hypothesis. If deep learning and the popularity of Keras is a driver of this change the recent Keras for R package might have an impact of the popularity of R. It will be interesting to see how this plays out!

Beyond the Horizon

It looks like the rise of python in data science will continue to grow. However, it is hard to make predictions, especially about the future! Perhaps something else will come along that will place both R and python on the sidelines, next to SAS. A potential candidate is Julia which has a number of features which are valuable for data science. Another "best of both worlds" language is Scala, which already fuels the very popular Spark. SAS, R and python are very old and mature languages with well tested features, and if this is worth giving up for the new fancy languages depends very much on how important stability is for your use case.

I do not think that R or python will die any time soon. While SAS is tiny in data science (all industries considered), it is the standard tool for my financial services clients, and that will most likely still be the case five years from now too.

Final words

Rather than focussing on which tools you use, start from the problem. Once you know what you need to do to solve it, choose the right tool for the job. For this, it helps to know many languages, to avoid going to work with a single tool in your toolbox.

Please post questions, comments, corrections and suggestions in the comment field below.

Share on: LinkedInTwitterFacebookGoogle+Email

Comments !

blogroll

social