Exasol Python UDF web scraper for Bundesliga match day fixtures

The hardest part of sports analytics is getting data! Not for nothing there are companies, which earn their money just with sports data. But if you are not able or do not want to pay such amounts of money, you got just one possibility: scraping the data from the Web. In an older post, I described a R web scraper. As this one was no longer working, I needed a new one. What brings us to this post. This time I will describe, how to create a web scrapper for static HTML sites with Python and how you are able to implement such a web scrapper as a User Defined Function (UDF) in Exasol.

Web scraping with Python

Python offers several different Packages for web scrapping. You basically have to differ between scrapping from static or dynamic websites. Following website offers a good overview of the most used packages for both options:

Python web scrapping package overview

This time I want to extract the current fixtures from the German sports website Transfermarkt.com. This website is static. So I decided to use BeautifulSoup for the HTML parsing. To extract specific elements from the HTTP code, we have to know  their different attributes like type, name or class. This can easily be done by using the Chrome DevTools from Google.

chrome developer tools
Identify HTML elements with Chrome DevTools

Scraping current fixtures from transfermarkt.com

For scraping static HTML code, we need the Python packages Request and BeautifulSoup.

import urllib.request
from bs4 import BeautifulSoup as bs

For this test, I used the main page of the German Bundesliga, to extract the fixtures of the next matchday.

v_url = 'http://www.transfermarkt.com/1-bundesliga/startseite/wettbewerb/L1'

The Request library loads the HTML page. In case of Transfermarkt.com you also have to create a HTTP header containing the user agent. Otherwise the server blocks the request.

#build header
v_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
v_headers = {'User-Agent': v_user_agent}

#load web site
v_request = urllib.request.Request(url=v_url,headers=v_headers)
v_response = urllib.request.urlopen(v_request)

BeautifulSoup parses the HTML code and creates a Soup object.

soup = bs(v_response, 'html.parser')

This BeautifulSoup object can now be used to search and extract the required data structurs from the HTML page. Transfermarkt.com provides 3 different tabs with the last, the current and the next matchday.  The currrent matchday tab has got the ID “spieltagtabs-2”. The tab contains a table with a table row (class = “begegnungZeile”)  for each fixture.

#get current matchday and each single match
v_cur_matchday = soup.find('div', attrs={'id':'spieltagtabs-2'})
v_match_info = v_cur_matchday.findAll('tr', attrs={'class':'begegnungZeile'})

These table rows can be looped. As the match date is not available for each row, this must be cached.

i=0

#loop throw all match infos
while i < len(v_match_info):
#extract single values
home_team = v_match_info[i].findAll('td', attrs={'class': 'verein-heim'})
away_team = v_match_info[i].findAll('td', attrs={'class': 'verein-gast'})
match_date = v_match_info[i].findAll('span', attrs={'class': 'spielzeitpunkt'})

if len(match_date) == 1:
last_match_date =  match_date
else:
match_date = last_match_date

print(home_team[0].text.strip() + ' - ' + away_team[0].text.strip() + ' (' + match_date[0].text.strip() + ' )')

i += 1

The print command shows following output:

python_print

Web scraping Exasol UDF

The first step by building the scraping logic is done. Now this function needs to be a bit more generic and should be executable as a User Defined Function during the normal data processing inside the database.

Configuration data

To scrape multiple leagues, we have to switch the URL for each execution. The different URLs should just be stored in a separate configuration table.

create table meta.transfermarkt_cur_fixtures_config
	(
	division	varchar(50),
	url			varchar(200)
	);

The DDL script inclusive the URL Insert can be found add GitHub:

GitHub: DDL transfermarkt current fixtures

UDF defintion

The User Defined Function has to be based on the added Script Language Container explained in the last post. This container already contains the Python packages Request and BeautifulSoup. The URL of the configuration table and the division identifier have to be passed as the parameter. One execution of the web scraper UDF will return multiple records with following structure:

  • Division
  • Match date
  • Home team
  • Away team

The corresponding UDF head looks like this:

create or replace PYTHON3 scalar script stage.webscr_transfermarkt_cur_fixtures (p_division varchar(3), p_url varchar(100))
  emits (division varchar(3), match_date varchar(10), home_team varchar(100), away_team varchar(100))
as

The Python code of the Web scraper must be pasted into the function run(ctx).

def run(ctx):

The ctx object provides the passed URL parameters.

v_url = ctx.p_url

And the original print statement has to be replaced with the EMIT function, which returns the single rows.

ctx.emit(ctx.p_division, match_date[0].text.strip(), home_team[0].text.strip(), away_team[0].text.strip())

That’s all what needs to be done! Now we got web scraper UDF, which allows us to directly “query” in SQL statements the webpage.

The UDF script can also be found at GitHub:

GitHub: Web scraper Python UDF

Following simple Select Statement can be used to test the UDF:

select
stage.WEBSCR_TRANSFERMARKT_CUR_FIXTURES(division, url)
from
meta.TRANSFERMARKT_CUR_FIXTURES_CONFIG
;

The result looks good!

udf_result
Python UDF result

 

Conclusion

User defined functions offer the possibility to extend the functionality of our analytical system in every conceivable way.  Be it collection data from the web, executing predictive models or extending the statistical functionality. The ability pushing every processing logic to your data instead of moving the data, simplifies the overall process. Everything is just done inside the database.

 

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

 

One Reply to “Exasol Python UDF web scraper for Bundesliga match day fixtures”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s