Prepare data: football-data.co.uk (part 1)

In the post Gather data: football-data.co.uk I described, how you can load CSV data into the Exasol database. As the data is now available at the Stage Layer in the database, I must now prepare the data and persist it at the Raw Data Layer, so that I can easily use it for building predictive models.

With part 1 of this post I want to explain, what Data Vault modeling is and how the Data Vault model for the data structure of football-data.co.uk looks like. With part 2 I will explain, how you load data into the developed Data Vault model.

What is Data Vault?

Data Vault 1.0 was released by Dan Linstedt in 2000 as a data modeling method. In year 2013 Data Vault was extended by methodology and architecture.

Before Data Vault was introduced, there were only two methods for modeling the data layer in a Data Warehouse: the Conformed Dimension approach by Ralph Kimbell or  the normalized approach by Bill Inmon. But as I could recognize with some grown data warehouses, both approaches have some problems with changes. And this is really a problem as an agile development gets more and more important to keep up with the fast changing business requirements.

The Data Vault model itself combines the advantages of the other modeling approaches with one small trick: Separate the objects and the associations between the objects from the context.

If you want to learn something more about the basics of Data Vault, I recommend reading the article Data Vault Basics. I will now just take a look at the basic elements of the Data Vault model.

 

Data Vault modeling

A Data Vault model consists only of 3 elements:

  • Hubs
  • Links
  • Satellites

Hubs and Links build the structure, the skeleton of your Data Vault Model. The Satellites contain all descriptive attributes.

 

the_data_vault
Basic elements of a Data Vault model

 

Hubs

Hubs represent unique lists of different Business Objects. They contain only the Business Keys of these objects and no descriptive information. Descriptive information is stored in Satellites. A Hub must at least contain following columns:

  • Hub Hash Key – Hash Key generated from Business Key columns
  • Business Key – Unique identifier of a business object
  • Load Date Timestamp – Timestamp information, when the record was loaded to the Hub
  • Record Source – Audit information for the source of the record

Links

Links represent the associations between the business objects. Together with the Hub tables, they build the skeleton of your Data Vault model. Basically Link tables are always modeled as many-to-many relation tables. This supports the flexibility of the data model. A Link table must at least contain following columns:

  • Link Hash Key – Hash Key generated from all Hub Business Keys connected to the Link table
  • Hub Hash Keys – Hub Hash Keys of all Hub tables connected to the Link table
  • Load Date Timestamp – Timestamp information, when the record was loaded to the Hub
  • Record Source – Audit information for the source of the record

Satellites

As mention before descriptive data is stored in the Satellite tables. A Satellite table can contain descriptive data for a Hub or a Link. All descriptive attributes in a Satellite are historized like a SCD2 dimension. Different descriptive attributes can be split into different Satellite tables. They can be split e.g. by source system, change rate or business meaning.

A Satellite table must at least contain following columns:

  • Hub/Link Hash Key – Hash Key of the corresponding Hub / Link table
  • Load Date Timestamp – Timestamp information, when the record was loaded to the Hub
  • Load End Date Timestamp – Timestamp information, until the version of this record is valid (SCD2 functionality)
  • Record Source – Audit information for the source of the record

 

Data Vault model for football-data.co.uk

To build the Data Vault model for football-data.co.uk you first must identify your Business Objects and the corresponding Business Keys. Based on the given data structure following Business Objects can be determined:

  • Division
  • Season
  • Team

As you can see, I did not list the match itself as a Business Object. There is a simple reason for that: A match is played in a division, in a season, between two teams on a fixed date. All these attributes are Business Keys of the mentioned Business Objects. Because of this the match should be modeled as a Link table between all Business Objects.

All statistics and betting odds are descriptive attributes of a specific match. So these attributes must be modeled as a Satellite table for the Match Link. Because of personal preferences I decided to split the statistics and odds in two different Satellite tables.

The following picture shows my developed Data Vault model.

 

Raw_DV_initial (1)
Data Vault model for football-data.co.uk

 

In the next part I will describe, how you load data into this Data Vault model.

 

If you have further questions, feel free to leave a comment or contact me @Mo_Nbg.

5 Replies to “Prepare data: football-data.co.uk (part 1)”

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s