Prepare data: football-data.co.uk (part 2)

In the first part Prepare data: football-data.co.uk (part 1) I described how the Data Vault model for the data of football-data.co.uk looks like. In the second part I will now focus on loading data into the Data Vault model. With the overall analytical architecture in mind this equates the data integration process between the stage layer and the raw data layer.

Using a Data Vault model has a very nice benefit: each element type (hub, link, satellite) of such a model has a similar data structure and loading process. That’s why, I will explain the whole data processing just with one example for each type. If you know, how to load data into one hub, you know how to load data in every hub.

Raw_DV_initial (1) — Data Vault model for football-data.co.uk

Load hub entities

The hub entities of a Data Vault model are the easiest ones. They just contain the Business Key of the corresponding object and some mandatory columns.

Data structure FOOTBALL_TEAM_H

Following SQL statement creates the data structure of the team hub table:

CREATE TABLE raw_layer.football_team_h
(
  football_team_hid CHAR(32),
  team VARCHAR(50),
  audit_insert_date TIMESTAMP,
  audit_record_source VARCHAR(100),
  CONSTRAINT pk_football_team_h primary key (football_team_hid) DISABLE,
  DISTRIBUTE BY football_team_hid
);

The primary key column of a hub table (and every other table type) is the hash key. As I use MD5 to calculate the hash key, I have to use CHAR(32) as the data type. I only create disabled constraints, because the loading process of a Data Vault 2.0 model does not ensure referential integrity. The DISTRIBUTE BY clause is a special feature of the Exasol database. The Exasol database is a MPP (massive parallel processing) system. So data gets split over an amount of different nodes. With the DISTRIBUTE BY clause you are able to decide, which attribute should be used as the distribution key over the different nodes. I normally suggest using the most used join attribute as the distribution key.

Load data into FOOTBALL_TEAM_H

Following SQL statement loads data from the stage layer table into the team hub table:

INSERT INTO raw_dv.football_team_h
	(
	football_team_hid,
	team,
	audit_insert_date,
	audit_record_source
	)
--select all distinct teams
SELECT DISTINCT
	team_bk_hash,
	team,
	CURRENT_TIMESTAMP,
	'load FOOTBALL_TEAM_H'
FROM
	(
	SELECt
		HASH_MD5(hometeam) AS team_bk_hash,
		hometeam AS team
	from
		stage.football_his
	UNION
	select
		HASH_MD5(awayteam),
		awayteam
	from
		stage.football_his
	) src
where
	--only load teams, which not already exist
	not exists (
				select
					1
				from
					raw_dv.football_team_h tgt
				where
					src.team_bk_hash = tgt.football_team_hid
				);

The business key for the team hub table is the team name. So in a first step you must select a distinct list of all team names available in the source data. The hash key of a hub table is calculated based on the business key. So the hash key gets calculated using the team name. In a second step all selected team names are checked, whether they already exist in the hub table. That’s all you have to do to load data into a hub table.

Load Link Entities

Link tables represent the relations between different hub tables. The Data Vault model for historic football data only contains one link table: the historic match link table.

Data structure FOOTBALL_MATCH_HIS_L

Following SQL statement creates the data structure of the football match link table:

CREATE TABLE raw_layer.football_match_his_l
	(
	football_match_his_lid	CHAR(32),
	football_division_hid	CHAR(32),
	football_team_home_hid	CHAR(32),
	football_team_away_hid	CHAR(32),
	football_season_hid		CHAR(32),
	match_date				DATE,
	audit_insert_date		TIMESTAMP,
	audit_record_source		VARCHAR(100),
	CONSTRAINT pk_football_match_his_l PRIMARY KEY (football_match_his_lid) DISABLE,
	CONSTRAINT fk_football_match_his_l_football_division_h FOREIGN KEY (football_division_hid) references raw_dv.football_division_h (football_division_hid) disable,
	CONSTRAINT fk_football_match_his_l_football_team_home_h	FOREIGN KEY (football_team_home_hid) references raw_dv.football_team_h (football_team_hid) disable,
	CONSTRAINT fk_football_match_his_l_football_team_away_h	FOREIGN KEY (football_team_away_hid) references raw_dv.football_team_h (football_team_hid) disable,
	CONSTRAINT fk_football_match_his_l_football_team_season_h FOREIGN KEY (football_season_hid) references raw_dv.football_season_h (football_season_hid) disable,
	DISTRIBUTE BY football_match_his_lid
	);

All link tables of a Data Vault model are designed as many-to-many relationship tables. Because of that, a link table contains one primary key and a foreign key for each hub table, which is connected to the link table. And of course there have to be the mandatory columns of the Data Vault model. The primary and foreign keys in a link table are all hash keys. A Data Vault 2.0 model just uses hash keys to reference between tables.

Load data into FOOTBALL_MATCH_HIS_L

Following SQL statement loads data from the stage layer table into the football match link table:

INSERT INTO raw_layer.football_match_his_l
	(
	football_match_his_lid,
	football_division_hid,
	football_team_home_hid,
	football_team_away_hid,
	football_season_hid,
	match_date,
	audit_insert_date,
	audit_record_source
	)
--select all matches from stage
SELECT
	HASH_MD5(division || '#' || hometeam || '#' || awayteam || '#' || season || '#' || to_char(match_date,'yyyymmdd')) football_match_his_lid,
	HASH_MD5(division) AS division_bk_hash,
	HASH_MD5(hometeam) AS hometeam_bk_hash,
	HASH_MD5(awayteam) AS awayteam_bk_hash,
	HASH_MD5(season) AS season_bk_hash,
	match_date,
	CURRENT_TIMESTAMP,
	'load FOOTBALL_MATCH_HIS_L'
FROM
	stage_layer.football_his src
WHERE
	--only load matches, which not already exist
	NOT EXISTS (
				SELECT
					1
				FROM
					raw_layer.football_match_his_l tgt
				WHERE
					HASH_MD5(src.division || '#' || src.hometeam || '#' || src.awayteam || '#' || src.season || '#' || to_char(src.match_date,'yyyymmdd')) = tgt.football_match_his_lid
				);

Loading a link table looks very similar to loading a hub table. The only difference is, that you have not to select a distinct list of one business key, but a distinct list of combinations of business keys. It is similar to loading a hub table with a business key, which consists of multiple columns. The primary key has to be generated based on the concatenation of all business keys. But it is very important to use a separator between each business key. Otherwise, the risk of getting a hash collision increases a lot. And again, just like loading a hub table, only new business key combinations have to be loaded into the link table.

Load Satellite Entities

Normally satellite tables have to include a load end date timestamp to provide a historization of all descriptive attributes. But as the statistic of a past football match does not change, I decided to ignore historization. So the loading process for the satellite tables gets a lot easier. I do not have to create multiple versions of records and I do not have to handle the load end dates for changed records.

Data structure FOOTBALL_MATCH_HIS_L_S_STATISTIC

Following SQL statement creates the data structure of the statistic satellite table:

CREATE TABLE raw_layer.football_match_his_l_s_statistic
	(
	football_match_his_lid 		CHAR(32),
	load_dts					TIMESTAMP,
	full_time_home_goals		DECIMAL(3),
	full_time_away_goals		DECIMAL(3),
	full_time_result			VARCHAR(1),
	half_time_home_goals		DECIMAL(3),
	half_time_away_goals		DECIMAL(3),
	half_time_result			VARCHAR(1),
	home_shots					DECIMAL(3),
	away_shots					DECIMAL(3),
	home_shots_target			DECIMAL(3),
	away_shots_target			DECIMAL(3),
	home_fouls					DECIMAL(3),
	away_fouls					DECIMAL(3),
	home_corners				DECIMAL(3),
	away_corners				DECIMAL(3),
	home_yellow					DECIMAL(3),
	away_yellow					DECIMAL(3),
	home_red					DECIMAL(3),
	away_red					DECIMAL(3),
	audit_insert_date			TIMESTAMP,
	audit_record_source			VARCHAR(100),
	CONSTRAINT pk_football_match_his_l_s_statistic PRIMARY KEY (football_match_his_lid, load_dts) DISABLE,
	CONSTRAINT fk_football_match_his_l_s_statistic FOREIGN KEY (football_match_his_lid) REFERENCES raw_dv.football_match_his_l (football_match_his_lid) DISABLE,
	DISTRIBUTED BY football_match_his_lid
	);

The hash key of a satellite table has to be exactly the same, as in the corresponding hub or link table. As you can see the primary key also includes the load data timestamp. This is normally just needed, if you use a historization in you satellite, because there could be multiple version of a satellite record with same the same hash key and different load data timestamps. The foreign key of the statistic satellite table references to the football match link. Besides the typical mandatory columns there are just the descriptive attributes. It is possible to add also the business key of the corresponding link or hub table to a satellite. But this is just optional and would be a simplification for ad hoc querying, if you are working with larger Data Vault models.

Load data into FOOTBALL_MATCH_HIS_L_S_STATISTIC

Following SQL Statement loads all descriptive statistic attributes from the stage source table into the statistic Satellite table:

MERGE INTO raw_layer.football_match_his_l_s_statistic tgt
USING
	(
	SELECT
		HASH_MD5(division || '#' || hometeam || '#' || awayteam || '#' || season || '#' || to_char(match_date,'yyyymmdd')) football_match_his_lid,
		CURRENT_TIMESTAMP load_dts,
		full_time_home_goals,
		full_time_away_goals,
		full_time_result,
		half_time_home_goals,
		half_time_away_goals,
		half_time_result,
		home_shots,
		away_shots,
		home_shots_target,
		away_shots_target,
		home_fouls,
		away_fouls,
		home_corners,
		away_corners,
		home_yellow,
		away_yellow,
		home_red,
		away_red,
		CURRENT_TIMESTAMP audit_insert_date,
		'load FOOTBALL_MATCH_HIS_L_S_STATISTIC' audit_record_source
	FROM
		stage_layer.football_his
	) src
ON (src.football_match_his_lid = tgt.football_match_his_lid)
WHEN MATCHED THEN UPDATE
SET
	tgt.FULL_TIME_HOME_GOALS 	= src.FULL_TIME_HOME_GOALS,
	tgt.FULL_TIME_AWAY_GOALS 	= src.FULL_TIME_AWAY_GOALS,
	tgt.FULL_TIME_RESULT 		= src.FULL_TIME_RESULT,
	tgt.HALF_TIME_HOME_GOALS	= src.HALF_TIME_HOME_GOALS,
	tgt.HALF_TIME_AWAY_GOALS	= src.HALF_TIME_AWAY_GOALS,
	tgt.HALF_TIME_RESULT		= src.HALF_TIME_RESULT,
	tgt.HOME_SHOTS				= src.HOME_SHOTS,
	tgt.AWAY_SHOTS				= src.AWAY_SHOTS,
	tgt.HOME_SHOTS_TARGET		= src.HOME_SHOTS_TARGET,
	tgt.AWAY_SHOTS_TARGET		= src.AWAY_SHOTS_TARGET,
	tgt.HOME_FOULS				= src.HOME_FOULS,
	tgt.AWAY_FOULS				= src.AWAY_FOULS,
	tgt.HOME_CORNERS			= src.HOME_CORNERS,
	tgt.AWAY_CORNERS			= src.AWAY_CORNERS,
	tgt.HOME_YELLOW				= src.HOME_YELLOW,
	tgt.AWAY_YELLOW				= src.AWAY_YELLOW,
	tgt.HOME_RED				= src.HOME_RED,
	tgt.AWAY_RED				= src.AWAY_RED,
	tgt.audit_insert_date		= src.audit_insert_date,
	tgt.audit_record_source		= src.audit_record_source
WHERE
	--Full-time result has changed
	--sufficient check for changed records
	tgt.FULL_TIME_RESULT <> src.FULL_TIME_RESULT
WHEN NOT MATCHED THEN
INSERT VALUES
	(
	src.football_match_his_lid,
	src.load_dts,
	src.FULL_TIME_HOME_GOALS,
	src.FULL_TIME_AWAY_GOALS,
	src.FULL_TIME_RESULT,
	src.HALF_TIME_HOME_GOALS,
	src.HALF_TIME_AWAY_GOALS,
	src.HALF_TIME_RESULT,
	src.HOME_SHOTS,
	src.AWAY_SHOTS,
	src.HOME_SHOTS_TARGET,
	src.AWAY_SHOTS_TARGET,
	src.HOME_FOULS,
	src.AWAY_FOULS,
	src.HOME_CORNERS,
	src.AWAY_CORNERS,
	src.HOME_YELLOW,
	src.AWAY_YELLOW,
	src.HOME_RED,
	src.AWAY_RED,
	src.audit_insert_date,
	src.audit_record_source
	);

As I do not use any historization in the satellite tables, the loading process for such a table consists of 2 steps: Loading new records into the satellite table and update existing records, if a descriptive value has changed. As Exasol supports MERGE statements, all this can be done with one single statement. A subselect queries all descriptive attributes and calculates the hash key. After that you have to define, what the MERGE statement should do, if the record exists or not exists in the target table.

If you use a load end date timestamp in a satellite table, you have to update this timestamp at every record, for which a new version is available. So the loading process for a satellite table would consist of more than one SQL statement.

Transfer model to Analytical Layer

After the Data Vault model and the data is now available in the Raw Data Layer, I transferred the model to the Analytical Layer. This is not necessarily required, but I wanted to separate both layers not only logical but also physical. Therefore, I just created views for each table. That’s all I have done. By doing that, data, which is recently loaded to Raw Data Layer, is automatically available at the Analytical Layer. Generally, I would suggest to use virtualization as much as possible from the Analytical Layer forward. So there is no need to develop more data processes.

After the first data source is now imported and modeled, I will describe in the next posts, which features / variables you can calculate based on this data to develop predictive models.

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.