Skip to content

Latest commit

 

History

History
73 lines (54 loc) · 2.66 KB

File metadata and controls

73 lines (54 loc) · 2.66 KB

Statsbomb football data

image

This data is from the Statsbomb free and open data, and it contains data about different football competitions, matches, players, and events. it's stored in a highly nested JSON format, and it's a great dataset to practice data engineering skills on.


Raw JSON Data

Directory Structure

data/
├── competitions.json
│
├── matches/
│    └── <competition_id>/
│        └── <season_id>.json
│  
├── lineups/
│    └── <match_id>.json
│  
└── events/
     └── <match_id>.json

files explanation

  1. competitions.json

    • This file contains basic data about the competitions as well as their seasons
  2. matches/<competition_id>/<season_id>.json

    • This file contains basic data about all the matches in that season, it's important to note that some of this data is truncated because we're using the free verison of the data
  3. lineups/<match_id>.json

    • This file contains the lineups for that match id
    • This will contain data about all players that actually played in the match
    • It will also show their position, and if they changed from one position to another during the match
  4. events/<match_id>.json

    • This file contains the events for that match id
    • This is the bread and butter of this dataset, it contains each pass, each tackle, and every single event that happened in the match, and each event has related events
    • (PS: this may be my undoing because it's a LOT of data (JK), but let's see where this goes)

MongoDB Data

Database Structure

database 
├── competitions (collection)
├── matches (collection)
├── lineups (collection)
└── events (collection)

Migration Process

  1. competitions -> file is uploaded as is
  2. matches -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)
  3. lineups -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)
  4. events -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)

PS: the _id field is automatically generated by MongoDB


SQL Data

Please refer to relational database design.dbml or to the ERD diagram for the database design. image

you can click on the image to go to the interactive version of the ERD.