Why every data scientist should learn SQL

It’s been quite a long time since my last post for my blog. But that has been because of a specific reason: I participated at the 2nd DFB Hackathon, which consumed a huge amount of my freetime, which I normally spent creating some content for my blog. The Hackathon was again a great experience as all this deep data science stuff is still a challenge for me. But there’s again on big question on my side: Why are data scientist often just using Python (or R) and don’t know, how and when to use SQL.

The 2nd DFB Hackathon

To provide at first some context, I want to share some information about the DFB Hackathon. The 1st one was a really small one, which just happend over the period of 2 days. The 2nd was now based on the key facts: more time, more data, more re-usable results. We were provided 2 seasons of Bundesliga position data and different challenges, which should be solved over a period of 6 months. A extension was cause by all the COVID problems. My team had the task to create a throw-in model, to measure the risk and reward of throw-ins in football. For everybody already aware of, how such models are created: we used a similar approach like for expected goal models. Based on huge amount  features for the specific match situation, a machine learning algorithm determines the expected risk or reward for specific throw-in execution.

Throw-in process defined during the 2nd Hackathon

But explaining the detailed implementation is not my goal for this post. I want to suggest something, from my view as a data engineer: Every professional data scientist should know SQL!

The advantages of SQL

Already during the 1st Hackathon I recognized, the data scientist exchanged Python scripts and packages, which should made their life easier during the Hackathon. And I asked myself: Why are they trying to achieve everything with Python? A single football match contains about 3.6 million data points of positional data. Why are they not using SQL? A data scientist I am currently working with, also couldn’t really answer this question, while he admits, it my be a good idea to take a look at SQL.

While starting working in data analytics more than 15 years ago, I learned SQL as my go-to tool, when it’s about getting insights from data. And that’s because of some simple reasons:

It’s a query language

Huge amounts of data are normally maintained in databases. And SQL in comparison to Python is a query language, which is directly aimed to extract data from relational databases. It’s not a universal programming language, but it helps to work in an easy way with database.

Small grammer

As a language SQL is simpler than Python. It has a smaller grammer and the amount of different concepts is smaller. So it’s easy to get started. But in my opinion it’s still not easier to master. Therefor you also need a understanding of data, data modeling and how it’s organized inside a relational database. As a 2nd advantage a smaller grammer makes SQL easier to understand. SQL code is also easier to maintain.

Speed

SQL code is executed inside a database. And modern database provides the possibility to have indexes and partitions for your data, to provide faster access. So you will get your results way faster, when working with SQL in comparison to Python scripts. Manipulating data with Python requires multiple steps. Data is unloaded from the database or a file. The manupulation is done inside a script and afterwords it need to be written back into the database or another file.

Filtering and aggregation

SQL excels definitely in one area: Filtering and aggregating data. 4 lines of code are already enough to have a basic aggregation containing a filter. Anybody might argue, but such basic operations are most of the time already enough to fullfill needed business requirements.

SQL vs. Pandas

With the introduction of Pandas many SQL-like operations were now also easily possible on data frames in Python. This small articel provides a great comparison for some simple operations:

TowardsDataScience: SQL vs Python

So you might think: Ok, so I can stay using Python. But there are some more advanced topics you should think about, where Pandas data frame operations seem not to be the best solution.

Filtering with SQL vs. Filtering a Pandas data frame

Data movement

The data you are working with, might not only be consumed by you, but multiple other colleagues. So the data is part of a broader workflow and may be consumed by e.g. a BI Tool. And that’s where SQL is needed as this is the most common and standarderized way of different visualization tools to enter data.

Analytical- & Window-functions

One of my personal killer criterias for SQL are analytical functions and window functions. When working with time-series data, these functions make your life a lot easier. You want to have the start and end position of a player during a defined time frame? FIRST_VALUE and LAST_VALUE provide you the posibility. You need to determine the ball status in the previous or next frame? LEAD and LAG is the answer. Also window aggregations like a moving average a fairly simple. You just need to add a WINDOW clause to your basic aggregation. And everything still within one line of code. Using Python you would need to loop over the different partitions and store intermediate value and handle the results.

Analytical function syntax for Exasol

Disadvantages of SQL

But of course there are two sides to every coin: SQL has also limitations and is not usable for every use-case.

Missing SQL functions

When using SQL extensively when transforming data, you will immediately notice some functionality limitations. The SQL standard and the different vendor implementations often miss some specialized functions. For general functions e.g. text transformation you will find a huge selection of functions. But fortunately that’s no reason to overboard SQL. Each database vendor provides the possibility to extend their SQL standard with user defined functions. And on top you often also have the possibility to use different programming languages (e.g. Python, JavaScript, R) and take advantage of the different languages functionalities.

SQL function for distance calculation between 2 points (JavaSript for Google Bigquery)

It’s a query langugae

What’s the first advantage of SQL, will also be our last disadvantage in this list: SQL is a query language. So it’s limited to work with data. But when thinking about one of the main tasks of a data scientist – creating models – that’s just not possible. In such cases you are restricted to the capabilites of a universal programming language like Python.

Conclusion

An underestimated part in the daily work of a data scientist is handling data: loading, unloading, transforming, wrangling. I know, it will not take long, till I meet the next data scientist, who is doing this stuff in a Python script. And it will be again the time, where I am giving a small advice from my perspective as a data engineer: Take a look at SQL!

Migrating Exasol Community Edition

In one of my older posts I described the data architecture, I am using for all my examples. As the database I use the Exasol Community Edition. From time to time it is necessary to update your software to the current version because of new features. This post will describe, how to migrate a Exasol community edition to anther one. These steps can also be used, to migrate nearly every database to an Exasol.

Continue reading “Migrating Exasol Community Edition”

xG data journey – the raise of M. Gladbach

After getting all this expected goals data, it’s of course most obvious to take a look at the insights such data can produce and in which way xG can be interpreted. I have decided to take a look at the current development of Borussia Moenchengladbach in the Bundesliga . Even if RB Leipzig took over now the first place, the development of Gladbach in comparison to the last seasons is impressive. And now I just want to know: Does xG data reveals the secret of Marco Rose?

Continue reading “xG data journey – the raise of M. Gladbach”

From Business Analytics to Sports Analytics

Before I started analyzing data for sports betting I have worked as a Business Intelligence (BI) consultant in different industries. During this time I learned how Business Analytics helps you to improve your business performance by analyzing data. This also helped me to understand, what’s needed to improve the performance of a sports team or the betting performance of a punter with the help of data.

Continue reading “From Business Analytics to Sports Analytics”

xG data journey – scrapping dynamic webpages

In the first part of this data journey, I took a look  at the general definition of expected goals (xG) and the usage of this metric. In the next step in the process of testing the predictive power of xG, I need to get some data. This part will focus on getting the team expected goals statistics. In one of the following parts, I will also take a look on getting the player expected goals statistics as this of course offers even deeper insights.

Continue reading “xG data journey – scrapping dynamic webpages”

xG data journey – What are ExpectedGoals?

After I realized my available data is definitely not enough to beat the bookie, I decided to start a new data journey and take a look at some more advanced statistics. And what could be better suited as Expected Goals (xG). This statistic is used more and more to explain this specific luck / bad luck factor, you feel, when watching a football match. In the first part of this journey I will explain, what are xG and what they tell you about a football match. Continue reading “xG data journey – What are ExpectedGoals?”

Retrospective for Bundesliga season 2018/19

Before the new season will start I should take a look at the last season. Everybody following my pick history already knows: the last season again was very disappointing! But I again have to point out, that I of course did not expect to find the “holy grail” after just two seasons of model testing. So how bad do the numbers really look, and what are the most important “lesson learned” are….

Continue reading “Retrospective for Bundesliga season 2018/19”

Overcome your confirmation bias (guest blog)

When you follow my twitter account, you may have noticed, since several month I started also writing blogs and articles for other platforms. Even so these are most of the time not about sports betting, I thought it would be a good idea, to share them also via my blog and also share some thoughts about the topics as the main message is often the same: Get the most out of your data!

Continue reading “Overcome your confirmation bias (guest blog)”

A data journey – market values (part 2)

In the last post I described, how I collected the market value data as the first step of my journey. The second step is – in my opinion – one of the most important ones. Get to know your data! Of course many predictive methods can be used as a black box. But that’s something I would not suggest. At least you should understand how your values are distributed. And it’s even better, when you build some kind of domain knowledge. To know your data offers you the possibility to shorten the training process of you predictive models. And visualizations always help to better understand your data. Continue reading “A data journey – market values (part 2)”