The hardest part of sports analytics is getting data! Not for nothing there are companies, which earn their money just with sports data. But if you are not able or do not want to pay such amounts of money, you got just one possibility: scraping the data from the Web. In an older post, I described a R web scraper. As this one was no longer working, I needed a new one. What brings us to this post. This time I will describe, how to create a web scrapper for static HTML sites with Python and how you are able to implement such a web scrapper as a User Defined Function (UDF) in Exasol.
Web scraping with Python
Python offers several different Packages for web scrapping. You basically have to differ between scrapping from static or dynamic websites. Following website offers a good overview of the most used packages for both options:
Python web scrapping package overview
This time I want to extract the current fixtures from the German sports website Transfermarkt.com. This website is static. So I decided to use BeautifulSoup for the HTML parsing. To extract specific elements from the HTTP code, we have to know their different attributes like type, name or class. This can easily be done by using the Chrome DevTools from Google.

Scraping current fixtures from transfermarkt.com
For scraping static HTML code, we need the Python packages Request and BeautifulSoup.
import urllib.request from bs4 import BeautifulSoup as bs
For this test, I used the main page of the German Bundesliga, to extract the fixtures of the next matchday.
v_url = 'http://www.transfermarkt.com/1-bundesliga/startseite/wettbewerb/L1'
The Request library loads the HTML page. In case of Transfermarkt.com you also have to create a HTTP header containing the user agent. Otherwise the server blocks the request.
#build header v_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36' v_headers = {'User-Agent': v_user_agent} #load web site v_request = urllib.request.Request(url=v_url,headers=v_headers) v_response = urllib.request.urlopen(v_request)
BeautifulSoup parses the HTML code and creates a Soup object.
soup = bs(v_response, 'html.parser')
This BeautifulSoup object can now be used to search and extract the required data structurs from the HTML page. Transfermarkt.com provides 3 different tabs with the last, the current and the next matchday. The currrent matchday tab has got the ID “spieltagtabs-2”. The tab contains a table with a table row (class = “begegnungZeile”) for each fixture.
#get current matchday and each single match v_cur_matchday = soup.find('div', attrs={'id':'spieltagtabs-2'}) v_match_info = v_cur_matchday.findAll('tr', attrs={'class':'begegnungZeile'})
These table rows can be looped. As the match date is not available for each row, this must be cached.
i=0 #loop throw all match infos while i < len(v_match_info): #extract single values home_team = v_match_info[i].findAll('td', attrs={'class': 'verein-heim'}) away_team = v_match_info[i].findAll('td', attrs={'class': 'verein-gast'}) match_date = v_match_info[i].findAll('span', attrs={'class': 'spielzeitpunkt'}) if len(match_date) == 1: last_match_date = match_date else: match_date = last_match_date print(home_team[0].text.strip() + ' - ' + away_team[0].text.strip() + ' (' + match_date[0].text.strip() + ' )') i += 1
The print command shows following output:

Web scraping Exasol UDF
The first step by building the scraping logic is done. Now this function needs to be a bit more generic and should be executable as a User Defined Function during the normal data processing inside the database.
Configuration data
To scrape multiple leagues, we have to switch the URL for each execution. The different URLs should just be stored in a separate configuration table.
create table meta.transfermarkt_cur_fixtures_config ( division varchar(50), url varchar(200) );
The DDL script inclusive the URL Insert can be found add GitHub:
GitHub: DDL transfermarkt current fixtures
UDF defintion
The User Defined Function has to be based on the added Script Language Container explained in the last post. This container already contains the Python packages Request and BeautifulSoup. The URL of the configuration table and the division identifier have to be passed as the parameter. One execution of the web scraper UDF will return multiple records with following structure:
- Division
- Match date
- Home team
- Away team
The corresponding UDF head looks like this:
create or replace PYTHON3 scalar script stage.webscr_transfermarkt_cur_fixtures (p_division varchar(3), p_url varchar(100)) emits (division varchar(3), match_date varchar(10), home_team varchar(100), away_team varchar(100)) as
The Python code of the Web scraper must be pasted into the function run(ctx).
def run(ctx):
The ctx object provides the passed URL parameters.
v_url = ctx.p_url
And the original print statement has to be replaced with the EMIT function, which returns the single rows.
ctx.emit(ctx.p_division, match_date[0].text.strip(), home_team[0].text.strip(), away_team[0].text.strip())
That’s all what needs to be done! Now we got web scraper UDF, which allows us to directly “query” in SQL statements the webpage.
The UDF script can also be found at GitHub:
GitHub: Web scraper Python UDF
Following simple Select Statement can be used to test the UDF:
select stage.WEBSCR_TRANSFERMARKT_CUR_FIXTURES(division, url) from meta.TRANSFERMARKT_CUR_FIXTURES_CONFIG ;
The result looks good!

Conclusion
User defined functions offer the possibility to extend the functionality of our analytical system in every conceivable way. Be it collection data from the web, executing predictive models or extending the statistical functionality. The ability pushing every processing logic to your data instead of moving the data, simplifies the overall process. Everything is just done inside the database.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.
One Reply to “Exasol Python UDF web scraper for Bundesliga match day fixtures”