Squawka.com web scraper for upcoming fixtures

 

 

Attention:

Squawka changed the design of their website. The website no longer uses HTML tables to list the fixtures for a specific league and also changed the corresponding URLs. That’s why the described web scraper does not work anylonger. As soon I found another data source for the upcoming fixtures, I will create a new blog.

 

 

For the implementation of the Poisson prediction model I needed a data source for the current fixtures. As a temporary solution I used a manual CSV file, which I updated and imported regularly. During my researches for new data sources, I found the website squawka.com. This website provides statistics and analysis based on Opta data. With this post, I will describe, how you can extract the current fixtures from this website and use them during the normal data processing, which replaces the manual CSV interface.

Reading HTML tables

Many football websites offer a list with the future fixtures for a variety of divisions. Why I have chosen to use Squawka? Squawka visualizes data in HTML tables, which is a huge benefit for creating a web scraper, as it makes it easy to extract data from the website. The R package rvest offers such functionality to read and extract data from HTML tables.

Before you can import the data, you have to locate the specific HTML table. Squawka offers different pages with the fixtures for different leagues. The fixtures for the German Bundesliga can be found at http://www.squawka.com/match-fixtures?ctl=22_s2017. You can locate the specific HTML table with the developer tools of the Google Chrome browser. After you have located the HTML table, you have to copy the XPath for the table. The XPath for the fixtures table is ‘//*[@id=”upcoming-fixtures”]/center/center/div/table‘.

At first you have to install the rvest package.

install.package("rvest")

The function read_html() reads the HTML data of the specific XPath.

#load libraries
library(rvest) 

#read html table
html_table_data %
 read_html() %>%
 html_nodes(xpath='//*[@id="upcoming-fixtures"]/center/center/div/table') %>%
 html_table(fill = TRUE)

After that, the HTML data can be pushed into a data frame.

fixtures_temp <- html_table_data[[1]]

So the data is available in a R data frame. Unfortunately the data frame contains some more data as needed. So the data frame needs to be arranged.

html_table_data_frame

For every fixture we need the name of the home team, the away team and when the match takes place. This information is stored in columns 2,4 and 7.

fixtures_temp <- fixtures_temp[,c(2,4,7)]

As every fixture consists of 2 lines, the useless lines without any information need to be dropped. The generic function na.omit() can be used to eliminate NA values in data frames.

fixtures_temp <- na.omit(fixtures_temp)

After there is just one line for each match, you can now limit the number of matches, which you want to extract. I have chosen to select all fixtures of the next matchday. For the Bundesliga this would be 9 games.

fixtures_temp <- fixtures_temp[1:9,]

Note: There is currently a problem with the upcoming fixtures of the Bundesliga at squawka.com, as it still displays a past match of December.

raw_fixtures_data_frame
upcoming matchday fixtures data frame

In a last step the date column should be changed to a more usable format. As I am located in Germany, I have to adjust the locale settings of R. With help of function gsub() you can eliminate the ordinal indicators, so that the conversion function as.Date() is able to convert the string into a date value.  The no longer needed original column should be dropped.

Sys.setlocale("LC_TIME", "C")

fixtures_temp$DATE <- gsub("(st|nd|rd|th) "," ",fixtures_temp[,c(3)])
fixtures_temp$DATE <- as.Date(fixtures_temp$DATE, format = "%H:%M on %d %B %Y")
fixtures_temp <- fixtures_temp[,-c(3)]

After these few steps you got a data frame with all upcoming fixtures for the Bundesliga. Due the usage of HTML tables at squawka.com, the data extraction is really simple.

formated_fixtures_data_frame

Process configuration

The code examples above showed, how to extract the upcoming fixtures for the Bundesliga. So you could easily copy the code for every other league. Or you could extend everything to a more generic configurable solution.

The data extraction needs 3 different parameters:

  • Name of the division
  • Squawka-URL with the upcoming fixtures
  • Number of games per matchday

These parameters can easily be stored in a database. Therefor I created a new schema Meta and created a configuration table. At first the table only contains the parameters for the German Bundesliga.

GitHub – Squawka current fixtures configuration table

This table can easily be read and stored into a data frame.

#Connect to exasol
con <- dbConnect("exa", exahost = "[IP-adress]",
 uid = "sys", pwd = "exasol", schema = "meta")

#read config
data_config <- exa.readData(con,
"
SELECT
 DIVISION,
 URL,
 NUM_MATCHES
FROM
 META.SQUAWKA_CUR_FIXTURES_CONFIG
")

The connection parameters need to be adapted. The shown parameters are the default ones, if you use the Exasol virtual machine.

The data frame can be used to loop throw all configured divisions.

#loop all config entries
for (i in 1:nrow(data_config)) {

#read parameter
url <- data_config[i,"URL"]
num_games <- data_config[i,"NUM_MATCHES"]

[…]
}

After the fixtures for one league were extracted, they must be stored in a data frame for all divisions.

#save in finale fixtures data frame
if (i==1) {
fixtures <- fixtures_temp
} else {
fixtures <- rbind(fixtures, fixtures_temp)
}

Data integration

At this point, you got a web scraper, which reads the upcoming fixtures for all configured leagues. Now this data must be integrated to the existing data model and to the existing data process.

As each other data source, the web scraper has to store the extracted data into the staging area of the database.  The stage table should have the same data structure as the stage table for the manuel data source for current fixtures:

GitHub – Squawka current fixtures interface table

Because both tables have the same data structure, the existing loading process can easily be reused by just switching the source table.

GitHub – Current matches link table loading process

Unfortunately the team names of football-data.co.uk and squawka.com differ a little bit. For example the team name for Bayern Munich at football-data.co.uk is “Bayern Munich” and “Bayern” at squawka.com. So the web scraper must be extended by a team mapping, which ensures, that the team names of the upcoming fixtures correspond to the team names of the historic data.

I solved this problem by creating a small mapping CSV file, which I import into the database and query during the web scraping.

GitHub – Team mapping CSV

The stage table and the import process for that CSV file look like this:

GitHub – Team mapping stage table

GitHub – Team mapping import process

The imported data can be read during the web scraping process and stored into a data frame.

#read team name mapping
team_mapping <- exa.readData(con,
"
SELECT
FOOTBALL_DATA,
SQUAWKA
FROM
STAGE.TEAM_MAPPING
")

This data frame can be used to replace the original Squawka team names with the football-data.co.uk team names. The merge() function joins two data frames based on a common column. Therfor the columns of the upcoming fixtures data frame must first be renamed and then deleted, after the new home and away team name were merged to the data frame. Additionally the division must be added.

#lookup home team name
colnames(fixtures_temp)[1] <- "SQUAWKA"
fixtures_temp <- merge(fixtures_temp,team_mapping,by=c("SQUAWKA"))
colnames(fixtures_temp)[5] <- "TEAM_HOME"
fixtures_temp <- fixtures_temp[,-c(1)]

#lookup away team name
colnames(fixtures_temp)[1] <- "SQUAWKA"
fixtures_temp <- merge(fixtures_temp,team_mapping,by=c("SQUAWKA"))
colnames(fixtures_temp)[5] <- "TEAM_AWAY"
fixtures_temp <- fixtures_temp[,-c(1)]

#add division
fixtures_temp$DIVISION <- data_config[i,"DIVISION"]

After all these steps the final data frame contains all upcoming fixtures for the configured divisions with adapted team names and the correct data structure.

final_fixtures_data_frame

 

The content of this data frame must be written into the interface table, so that it can be used during the standard data processing.

#write to stage table
EXAWriteTable(con = con, schema = 'STAGE', tbl_name = 'SQUAWKA_CUR_FIXTURES',
overwrite = 'TRUE', data = fixtures)

 

 

The complete R code for this web scraper can be found under following link:

GitHub – Squawka.com current fixtures web scraper

The disadvantage of this web scraper is of course, it must be executed in R studio. This does not comply with the overall architectural idea of the TripleA DWH. So this web scraper needs to be implemented as a User Defined Function inside the Exasol database. This currently fails as the R version inside the Exasol database is outdated and I do not know, how to update the Exasol R version. As soon as this problem is solved, I will write a second post, which will describe, how to implement this web scraper inside the database.

 

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

One Reply to “Squawka.com web scraper for upcoming fixtures”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s