In the first part Prepare data: football-data.co.uk (part 1) I described how the Data Vault model for the data of football-data.co.uk looks like. In the second part I will now focus on loading data into the Data Vault model. With the overall analytical architecture in mind this equates the data integration process between the stage layer and the raw data layer.
Using a Data Vault model has a very nice benefit: each element type (hub, link, satellite) of such a model has a similar data structure and loading process. That’s why, I will explain the whole data processing just with one example for each type. If you know, how to load data into one hub, you know how to load data in every hub.
Load hub entities
The hub entities of a Data Vault model are the easiest ones. They just contain the Business Key of the corresponding object and some mandatory columns.
Data structure FOOTBALL_TEAM_H
Following SQL statement creates the data structure of the team hub table:
CREATE TABLE raw_layer.football_team_h ( football_team_hid CHAR(32), team VARCHAR(50), audit_insert_date TIMESTAMP, audit_record_source VARCHAR(100), CONSTRAINT pk_football_team_h primary key (football_team_hid) DISABLE, DISTRIBUTE BY football_team_hid );
The primary key column of a hub table (and every other table type) is the hash key. As I use MD5 to calculate the hash key, I have to use CHAR(32) as the data type. I only create disabled constraints, because the loading process of a Data Vault 2.0 model does not ensure referential integrity. The DISTRIBUTE BY clause is a special feature of the Exasol database. The Exasol database is a MPP (massive parallel processing) system. So data gets split over an amount of different nodes. With the DISTRIBUTE BY clause you are able to decide, which attribute should be used as the distribution key over the different nodes. I normally suggest using the most used join attribute as the distribution key.
Load data into FOOTBALL_TEAM_H
Following SQL statement loads data from the stage layer table into the team hub table:
INSERT INTO raw_dv.football_team_h ( football_team_hid, team, audit_insert_date, audit_record_source ) --select all distinct teams SELECT DISTINCT team_bk_hash, team, CURRENT_TIMESTAMP, 'load FOOTBALL_TEAM_H' FROM ( SELECt HASH_MD5(hometeam) AS team_bk_hash, hometeam AS team from stage.football_his UNION select HASH_MD5(awayteam), awayteam from stage.football_his ) src where --only load teams, which not already exist not exists ( select 1 from raw_dv.football_team_h tgt where src.team_bk_hash = tgt.football_team_hid );
The business key for the team hub table is the team name. So in a first step you must select a distinct list of all team names available in the source data. The hash key of a hub table is calculated based on the business key. So the hash key gets calculated using the team name. In a second step all selected team names are checked, whether they already exist in the hub table. That’s all you have to do to load data into a hub table.
Load Link Entities
Link tables represent the relations between different hub tables. The Data Vault model for historic football data only contains one link table: the historic match link table.
Data structure FOOTBALL_MATCH_HIS_L
Following SQL statement creates the data structure of the football match link table:
CREATE TABLE raw_layer.football_match_his_l ( football_match_his_lid CHAR(32), football_division_hid CHAR(32), football_team_home_hid CHAR(32), football_team_away_hid CHAR(32), football_season_hid CHAR(32), match_date DATE, audit_insert_date TIMESTAMP, audit_record_source VARCHAR(100), CONSTRAINT pk_football_match_his_l PRIMARY KEY (football_match_his_lid) DISABLE, CONSTRAINT fk_football_match_his_l_football_division_h FOREIGN KEY (football_division_hid) references raw_dv.football_division_h (football_division_hid) disable, CONSTRAINT fk_football_match_his_l_football_team_home_h FOREIGN KEY (football_team_home_hid) references raw_dv.football_team_h (football_team_hid) disable, CONSTRAINT fk_football_match_his_l_football_team_away_h FOREIGN KEY (football_team_away_hid) references raw_dv.football_team_h (football_team_hid) disable, CONSTRAINT fk_football_match_his_l_football_team_season_h FOREIGN KEY (football_season_hid) references raw_dv.football_season_h (football_season_hid) disable, DISTRIBUTE BY football_match_his_lid );
All link tables of a Data Vault model are designed as many-to-many relationship tables. Because of that, a link table contains one primary key and a foreign key for each hub table, which is connected to the link table. And of course there have to be the mandatory columns of the Data Vault model. The primary and foreign keys in a link table are all hash keys. A Data Vault 2.0 model just uses hash keys to reference between tables.
Load data into FOOTBALL_MATCH_HIS_L
Following SQL statement loads data from the stage layer table into the football match link table:
INSERT INTO raw_layer.football_match_his_l ( football_match_his_lid, football_division_hid, football_team_home_hid, football_team_away_hid, football_season_hid, match_date, audit_insert_date, audit_record_source ) --select all matches from stage SELECT HASH_MD5(division || '#' || hometeam || '#' || awayteam || '#' || season || '#' || to_char(match_date,'yyyymmdd')) football_match_his_lid, HASH_MD5(division) AS division_bk_hash, HASH_MD5(hometeam) AS hometeam_bk_hash, HASH_MD5(awayteam) AS awayteam_bk_hash, HASH_MD5(season) AS season_bk_hash, match_date, CURRENT_TIMESTAMP, 'load FOOTBALL_MATCH_HIS_L' FROM stage_layer.football_his src WHERE --only load matches, which not already exist NOT EXISTS ( SELECT 1 FROM raw_layer.football_match_his_l tgt WHERE HASH_MD5(src.division || '#' || src.hometeam || '#' || src.awayteam || '#' || src.season || '#' || to_char(src.match_date,'yyyymmdd')) = tgt.football_match_his_lid );
Loading a link table looks very similar to loading a hub table. The only difference is, that you have not to select a distinct list of one business key, but a distinct list of combinations of business keys. It is similar to loading a hub table with a business key, which consists of multiple columns. The primary key has to be generated based on the concatenation of all business keys. But it is very important to use a separator between each business key. Otherwise, the risk of getting a hash collision increases a lot. And again, just like loading a hub table, only new business key combinations have to be loaded into the link table.
Load Satellite Entities
Normally satellite tables have to include a load end date timestamp to provide a historization of all descriptive attributes. But as the statistic of a past football match does not change, I decided to ignore historization. So the loading process for the satellite tables gets a lot easier. I do not have to create multiple versions of records and I do not have to handle the load end dates for changed records.
Data structure FOOTBALL_MATCH_HIS_L_S_STATISTIC
Following SQL statement creates the data structure of the statistic satellite table:
CREATE TABLE raw_layer.football_match_his_l_s_statistic ( football_match_his_lid CHAR(32), load_dts TIMESTAMP, full_time_home_goals DECIMAL(3), full_time_away_goals DECIMAL(3), full_time_result VARCHAR(1), half_time_home_goals DECIMAL(3), half_time_away_goals DECIMAL(3), half_time_result VARCHAR(1), home_shots DECIMAL(3), away_shots DECIMAL(3), home_shots_target DECIMAL(3), away_shots_target DECIMAL(3), home_fouls DECIMAL(3), away_fouls DECIMAL(3), home_corners DECIMAL(3), away_corners DECIMAL(3), home_yellow DECIMAL(3), away_yellow DECIMAL(3), home_red DECIMAL(3), away_red DECIMAL(3), audit_insert_date TIMESTAMP, audit_record_source VARCHAR(100), CONSTRAINT pk_football_match_his_l_s_statistic PRIMARY KEY (football_match_his_lid, load_dts) DISABLE, CONSTRAINT fk_football_match_his_l_s_statistic FOREIGN KEY (football_match_his_lid) REFERENCES raw_dv.football_match_his_l (football_match_his_lid) DISABLE, DISTRIBUTED BY football_match_his_lid );
The hash key of a satellite table has to be exactly the same, as in the corresponding hub or link table. As you can see the primary key also includes the load data timestamp. This is normally just needed, if you use a historization in you satellite, because there could be multiple version of a satellite record with same the same hash key and different load data timestamps. The foreign key of the statistic satellite table references to the football match link. Besides the typical mandatory columns there are just the descriptive attributes. It is possible to add also the business key of the corresponding link or hub table to a satellite. But this is just optional and would be a simplification for ad hoc querying, if you are working with larger Data Vault models.
Load data into FOOTBALL_MATCH_HIS_L_S_STATISTIC
Following SQL Statement loads all descriptive statistic attributes from the stage source table into the statistic Satellite table:
MERGE INTO raw_layer.football_match_his_l_s_statistic tgt USING ( SELECT HASH_MD5(division || '#' || hometeam || '#' || awayteam || '#' || season || '#' || to_char(match_date,'yyyymmdd')) football_match_his_lid, CURRENT_TIMESTAMP load_dts, full_time_home_goals, full_time_away_goals, full_time_result, half_time_home_goals, half_time_away_goals, half_time_result, home_shots, away_shots, home_shots_target, away_shots_target, home_fouls, away_fouls, home_corners, away_corners, home_yellow, away_yellow, home_red, away_red, CURRENT_TIMESTAMP audit_insert_date, 'load FOOTBALL_MATCH_HIS_L_S_STATISTIC' audit_record_source FROM stage_layer.football_his ) src ON (src.football_match_his_lid = tgt.football_match_his_lid) WHEN MATCHED THEN UPDATE SET tgt.FULL_TIME_HOME_GOALS = src.FULL_TIME_HOME_GOALS, tgt.FULL_TIME_AWAY_GOALS = src.FULL_TIME_AWAY_GOALS, tgt.FULL_TIME_RESULT = src.FULL_TIME_RESULT, tgt.HALF_TIME_HOME_GOALS = src.HALF_TIME_HOME_GOALS, tgt.HALF_TIME_AWAY_GOALS = src.HALF_TIME_AWAY_GOALS, tgt.HALF_TIME_RESULT = src.HALF_TIME_RESULT, tgt.HOME_SHOTS = src.HOME_SHOTS, tgt.AWAY_SHOTS = src.AWAY_SHOTS, tgt.HOME_SHOTS_TARGET = src.HOME_SHOTS_TARGET, tgt.AWAY_SHOTS_TARGET = src.AWAY_SHOTS_TARGET, tgt.HOME_FOULS = src.HOME_FOULS, tgt.AWAY_FOULS = src.AWAY_FOULS, tgt.HOME_CORNERS = src.HOME_CORNERS, tgt.AWAY_CORNERS = src.AWAY_CORNERS, tgt.HOME_YELLOW = src.HOME_YELLOW, tgt.AWAY_YELLOW = src.AWAY_YELLOW, tgt.HOME_RED = src.HOME_RED, tgt.AWAY_RED = src.AWAY_RED, tgt.audit_insert_date = src.audit_insert_date, tgt.audit_record_source = src.audit_record_source WHERE --Full-time result has changed --sufficient check for changed records tgt.FULL_TIME_RESULT <> src.FULL_TIME_RESULT WHEN NOT MATCHED THEN INSERT VALUES ( src.football_match_his_lid, src.load_dts, src.FULL_TIME_HOME_GOALS, src.FULL_TIME_AWAY_GOALS, src.FULL_TIME_RESULT, src.HALF_TIME_HOME_GOALS, src.HALF_TIME_AWAY_GOALS, src.HALF_TIME_RESULT, src.HOME_SHOTS, src.AWAY_SHOTS, src.HOME_SHOTS_TARGET, src.AWAY_SHOTS_TARGET, src.HOME_FOULS, src.AWAY_FOULS, src.HOME_CORNERS, src.AWAY_CORNERS, src.HOME_YELLOW, src.AWAY_YELLOW, src.HOME_RED, src.AWAY_RED, src.audit_insert_date, src.audit_record_source );
As I do not use any historization in the satellite tables, the loading process for such a table consists of 2 steps: Loading new records into the satellite table and update existing records, if a descriptive value has changed. As Exasol supports MERGE statements, all this can be done with one single statement. A subselect queries all descriptive attributes and calculates the hash key. After that you have to define, what the MERGE statement should do, if the record exists or not exists in the target table.
If you use a load end date timestamp in a satellite table, you have to update this timestamp at every record, for which a new version is available. So the loading process for a satellite table would consist of more than one SQL statement.
Transfer model to Analytical Layer
After the Data Vault model and the data is now available in the Raw Data Layer, I transferred the model to the Analytical Layer. This is not necessarily required, but I wanted to separate both layers not only logical but also physical. Therefore, I just created views for each table. That’s all I have done. By doing that, data, which is recently loaded to Raw Data Layer, is automatically available at the Analytical Layer. Generally, I would suggest to use virtualization as much as possible from the Analytical Layer forward. So there is no need to develop more data processes.
After the first data source is now imported and modeled, I will describe in the next posts, which features / variables you can calculate based on this data to develop predictive models.
If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.