Statsbomb football data

This data is from the Statsbomb free and open data, and it contains data about different football competitions, matches, players, and events. it's stored in a highly nested JSON format, and it's a great dataset to practice data engineering skills on.

Raw JSON Data

Directory Structure

data/
├── competitions.json
│
├── matches/
│    └── <competition_id>/
│        └── <season_id>.json
│  
├── lineups/
│    └── <match_id>.json
│  
└── events/
     └── <match_id>.json

files explanation

competitions.json
- This file contains basic data about the competitions as well as their seasons
matches/<competition_id>/<season_id>.json
- This file contains basic data about all the matches in that season, it's important to note that some of this data is truncated because we're using the free verison of the data
lineups/<match_id>.json
- This file contains the lineups for that match id
- This will contain data about all players that actually played in the match
- It will also show their position, and if they changed from one position to another during the match
events/<match_id>.json
- This file contains the events for that match id
- This is the bread and butter of this dataset, it contains each pass, each tackle, and every single event that happened in the match, and each event has related events
- (PS: this may be my undoing because it's a LOT of data (JK), but let's see where this goes)

MongoDB Data

Database Structure

database 
├── competitions (collection)
├── matches (collection)
├── lineups (collection)
└── events (collection)

Migration Process

competitions -> file is uploaded as is
matches -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)
lineups -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)
events -> each document in each file is given a match_id field (based on the name of the file itself in the Raw Data)

PS: the _id field is automatically generated by MongoDB

SQL Data

Please refer to relational database design.dbml or to the ERD diagram for the database design.

you can click on the image to go to the interactive version of the ERD.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!