A data journey – market values (part1)

When a rich club in Germany goes through a bad performance phase or loses an important match, we like to use the phrase “Geld schießt eben keine Tore”. What means more or less, that big money doesn’t ensure goals. But the overall acceptance is of course,  that richer clubs are expected to win more often as they got the money to buy the best players. This inspired me to start a data journey about market values in the big 5 European leagues: What do the market values tell about the development in the different leagues? How do teams perform in relation to the money they spent? Does the market value of a team has a predictive significance?

Scraping market values

Before I am able to take a look at all these questions, I of course need the data. In an older post I already described, how to extract the current fixtures from the German football news site transfermarkt.com. This page will also be my source for the historic market values of all the teams.

Transfermarkt.com offers two different sites with market value data. The main league page contains the yearly values over the last decades. The detailed market value site provides the values 2 times a month, but only for the last years. For some basic analysis the yearly data should be enough. But the more detailed market values are of course way more interesting, as changes in the team performance also have an impact on the overall market value.

For scraping the values I used again the Python packages Request and BeautifulSoup.

import urllib.request
from bs4 import BeautifulSoup as bs

In this example I only explain scraping the yearly values. But the logic is the same for the more detailed marketed values

v_url = 'http://www.transfermarkt.com/1-bundesliga/startseite/wettbewerb/L1'

We again have to create a HTTP header to not get blocked by the website. After reading the code, we can create a parseable Soup object,

#build header
v_user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
v_headers = {'User-Agent': v_user_agent}

#load web site
v_request = urllib.request.Request(url=v_url,headers=v_headers)
v_response = urllib.request.urlopen(v_request)

#create Soup object
soup = bs(v_response, 'html.parser')

Both market value sites contain a drop down box, which provides a list with all available market value sub sites for the different seasons.

v_year_select = soup.find('div', attrs={'class':'inline-select'})
v_year_list = v_year_select.findAll('option')

To get all the data, you have to loop over all the sub sites. I added also a parameter for the number of looped entries, so that I do not have to parse all the data every time. For each sub site you again have to create a Soap object, which can be parsed for the single market values.

p_num_years = 2

#counter variable
i=0

#loop over year list
for v_year in v_year_list:
v_season_id = (v_year["value"])

#build sub site URL
v_url_year =  v_url + '/plus/?saison_id=' + v_season_id

#read sub site
v_request_year = urllib.request.Request(url=v_url_year, headers=v_headers)
v_response_year = urllib.request.urlopen(v_request_year)

# soup parsing
soup_year = bs(v_response_year, 'html.parser')

In the next step you determine the table, which contains the values

#read market value table
v_year_select = soup.find('div', attrs={'class':'inline-select'})
v_year_list = v_year_select.findAll('option')

and loop over all the rows to extract the values

#loop over table rows
for v_table_row in v_table_rows:
v_table_columns = v_table_row.findAll('td')

#extract values
v_team = v_table_columns[2].text.strip()
v_num_players = v_table_columns[3].text.strip()
v_team_market_value = v_table_columns[8].text.strip()

print(v_season_id,' ', v_team,' ', v_num_players,'  ', v_team_market_value)

And if my desired number of years is reached, I stop parsing the sub sites

#last x years break
i += 1
if i >= v_num_years:
break

The print command shows following output:

market_values_python_print
Print output

 

The Python code for the seasonal and monthly market values can be found at my GitHub repository:

 

Web scraping Exasol UDF

As for the current fixtures this scraping logic should also be put into a UDF (User Defined Function) so that the scraping is executed directly inside the database. The approach is again the same.

Configuration data

A configuration table contains the division URLs for which the market values should be extracted. This one can be used for seasonal and the detailed market values. Only the URLs need to be changed.

market_values_configuration_table
Configuration table

The corresponding DDL statement and the URLs for the Big-5 European leagues can be found at GitHub:

UDF defintion

The division and the URL of the configuration table are passed as UDF parameters together with the limitation for the number of years, which was already introduced in the Python script.

The corresponding UDF head looks like this:

create or replace PYTHON3 scalar script stage.webscr_transfermarkt_market_values_season
(p_division varchar(3), p_url varchar(100), p_num_years decimal(2,0))
emits
(division varchar(5), season varchar(4), team varchar(50), num_players varchar(100), team_market_value varchar(100))

Following values are returned:

  • Division
  • Season
  • Team
  • Number of Players
  • Team market value

Executing the UDF with the configuration data and a limitation of 1 season

select
stage.WEBSCR_TRANSFERMARKT_MARKET_VALUES_SEASON(division, url, 1)
from
META.TRANSFERMARKT_MARKET_VALUE_SEASON_CONFIG
;

provides the same result set as the original Python script:

UDF result
UDF result set

The complete UDF code can be found at GitHub:

Data integration

Now I got a list of many teams and their market values for several seasons. But to be able to  associate the market values with the team performance, this data needs to be integrated with the already existing historic football match results.

Team mapping

At first the team names need to be synchronized for both sources. Of course there are methods like fuzzy match and such things, but the easiest way is using a manual mapping file, which matches the team names:

This one should just be imported in a separate table, so that it is available for usage.

Data transformation

The market value has to be converted in a real number. Real numbers are of course better usable for calculation as shortcuts for millions and billions.

case
when instr(a.team_market_value,'Mio') > 0 then
to_number(replace(substr(a.team_market_value,1,instr(a.team_market_value,'Mio')-2),',','.')) * 1000000
when instr(a.team_market_value,'Mrd') > 0 then
to_number(replace(substr(a.team_market_value,1,instr(a.team_market_value,'Mrd')-2),',','.')) * 1000000000
else a.team_market_value
end

Staging View

Such transformation and adaptions should be part of the stage layer.  Following stage view executes the web scraper UDF and applies the transformations to the returned data.

Market value satellite

In the next step you load all the data into the Raw DV. At this level all data sources are integrated into one common Data Vault model. This simplifies the analysis of data of different data sources.

The market values are an additional attribute of a team. Attributes in a Data Vault model are stored in satellite tables. So we need to create a new satellite. On top the market value has a time dependency. It changes over time. This has to be modeled in the data. So you are able to analyse the data for any specific point in the past. That’s really import for the development of predictive models. All features for a model training should just contain data before the point of the simulated prediction.

I used following structure for the market value satellite:

market_value_satellite
Market value satellite table structure

The column FOOTBALL_TEAM_HID represents the reference to the team hub table. VALID_FROM indicates the time, since when the team had the specific market value. This results in a full history for each team.

Historic & current market values

When analyzing such a history, you need a range for the validity. A start date is not enough. You also need an end date. That’s why it’s best-practice to create to additional views for currently valid values and the complete history as time ranges.

The satellite view for the current values just filters the maximum valid date for each team.

select
FOOTBALL_TEAM_HID,
VALID_FROM,
to_date('31.12.9999','dd.mm.yyyy') VALID_TO,
NUM_PLAYERS,
TEAM_MARKET_VALUE,
LDTS
from
raw_dv.football_team_h_s_market_values_season
where
(FOOTBALL_TEAM_HID, VALID_FROM) in (
select
football_team_hid,
max(valid_from)
from
raw_dv.football_team_h_s_market_values_season
group by
football_team_hid
)
;

The historic view uses the analytical function LEAD to determine then VALID_FROM value of the next record in the team history. So ranges for the complete data set are created.

select
FOOTBALL_TEAM_HID,
VALID_FROM,
nvl(lead(valid_from-1) over (partition by football_team_hid order by valid_from), to_date('31.12.9999','dd.mm.yyyy')) VALID_TO,
NUM_PLAYERS,
TEAM_MARKET_VALUE,
LDTS
from
raw_dv.football_team_h_s_market_values_season
;

The corresponding DDL statements can be found at GitHub:

Loading satellite data

All data structures are now defined. So the data can be loaded from the stage table into the satellite table. The seasonal market values provided by transfermarkt.com apply to the date 1st, July of the season. This should be used as the VALID_FROM date. Every season, which was already loaded, does not need to be loaded again.

The loading script can also be found GitHub:

 

At this point the hard work is done and the fun, analyzing the data, can begin. But this will be part of the next blog post.

 

 

 

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

5 Replies to “A data journey – market values (part1)”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: