Pipeline for NHL Data

I just wanted to have some fun with some data pipelines and messing around with NestJS

Pipeline for NHL Data

Running locally

To run this application with all of its components you will need to have Docker installed. If you want to run any of the unit tests or run the services individually without the likes of PostgreSQL or RabbitMq you will need to have NodeJS installed. Links to those can be found in the resources section.

Running docker compose for the first time you'll need to build the docker images from the local docker files along pulling any missing docker images you do not have locally. This can be done with the command below

docker compose up --build --pull missing -d

Once you're done you can stop all services by running

docker compose down

Usage

After running the docker compose up command before using the API check the health check (GET http://localhost:3000/health) since RabbitMQ takes some time to setup even with the queue already being created and also for PostgreSQL to initialize itself for the first time.

Once everything is up and running you can check out the GraphQL schema and documentation by going to the playground at http://localhost:3000/graphql

Load Game/Season

Before you can query any player stats you first need to load some data you can do so by loading a season or a game

request

mutation LoadPlayersStats($loadPlayersStatsInput: LoadPlayersStatsInput!) {
  loadPlayersStats(loadPlayersStatsInput: $loadPlayersStatsInput) {
    message
  }
}

variables (loading a season)

{
  "loadPlayersStatsInput": {
    "seasonId": "20182019"
  }
}

variables (loading a game)

{
  "loadPlayersStatsInput": {
    "gameId": 2022010001
  }
}

Query player stats

Once player stats have been loaded you can begin to query player stats. The example below returns results where players have scored at least 3 goals in a game against the Ottawa Senators (opponentTeamId: 9) or the Los Angeles Kings (opponentTeamId: 26)

query PlayersStats($queryPlayerStatsInput: QueryPlayerStatsInput!) {
  playerStats(queryPlayerStatsInput: $queryPlayerStatsInput) {
    playerName
    teamName
    playerAge
    playerPosition
    points
    penaltyMinutes
    opponentTeamName
    playerNumber
    teamId
    teamName
    opponentTeamId
    opponentTeamName
  }
}

variables

{
  "queryPlayerStatsInput": {
    "points": {
      "operator": "gte",
      "value": 3
    },
    "opponentTeamIds": [9, 26],
    "options": {
      "limit": 100,
      "offset": 1
    }
  }
}

Components / Design Decisions

GraphQL API

I went with a GraphQL API because I thought about how a player's stats could/would be queried. If someone wanted to query for all game stats where a player scored more than 2 goals in a game, from a specific position against the New Jersey Devils they would be able to query that using the same endpoint as someone who would want to query all of the players with a number of hits for the Florida Panthers. GraphQL also provides a simple way to write/view documentation for those endpoints.

RabbitMQ

I knew that I wanted to use a microservice architecture and with that would be using message passing to tell the ETL service that there was another game or season's worth of games to load into the database.

Game Stats ETL

This service is what receives a request to either load a game or a season's worth of games into the database to to be queried later. It also checks game status (if a game is live or not) to determine if it needs to continue to pull game data for storage.

PostgreSQL

It's the database where all of the player's game stats are stored. I chose to to store all data in a single table since my focus was on player stats per game. If player info, stats, game info, team info were all in a separate table then I'd consider using a SQL View which would provide me a similar concept of this data all being in the same table

Flow Diagrams

Storing player stats

The diagram below goes over the flow when a seasonId is provided. The flow is the same when a gameId is sent except for the call to the NHL API to retrieve the season schedule

sequenceDiagram
  Client->>GraphQL: store stats from seasonId
  GraphQL->>ETL: emit `load_player_stats` for season (via RabbitMQ)
  ETL->>NHL API: request schedule for seasonId
  NHL API-->>ETL: season schedule returned with dates and games on those dates
  ETL->>NHL API: get games w/ player stats
  NHL API-->>ETL: games w/ player stats
  ETL->>PostgreSQL: store player stats

Querying stats

The diagram below goes over the flow for querying player stats. The client makes a GraphQL query which gets transformed into a SQL query and the results are returned.

sequenceDiagram
  Client->>GraphQL: make query request for data
  GraphQL->>PostgreSQL: queries PostgreSQL for data
  PostgreSQL-->>GraphQL: queried data returned
  GraphQL-->>Client: specific requested properties returned

What else I would do

Something else I would've considered doing if I didn't want to time box this would be to implement the streaming of game stats in the context of a live game scenario. To do this I would've added Kafka for the streaming from PostgreSQL in combination of using GraphQL's subscription to get real-time stat updates.

In regards to querying, for simplicity I made majority of queries operate in an OR fashion, but I would update that to allow the querier to determine if they want determine if the queries should use OR or AND similarly to how I've implemented the age or stats properties where the querier chooses the comparison operator and provides the value.

Another addition would be an events API. The purpose of this would be to allow for the user(s) to be able to query stats currently being processed based on a status identifier. So if player stats are currently being loaded based on a game or season id there would be an identifier or correlation id that can be used to query if there were errors in the ingestion process or if it's finished and the duration of the loading of player stats.

Also, more tests...

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
api		api
etl		etl
http		http
resources		resources
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline for NHL Data

Running locally

Usage

Load Game/Season

Query player stats

Components / Design Decisions

GraphQL API

RabbitMQ

Game Stats ETL

PostgreSQL

Flow Diagrams

Storing player stats

Querying stats

What else I would do

Resources

About

Releases

Packages

Languages

MorpheusNephew/improved-octo-waddle

Folders and files

Latest commit

History

Repository files navigation

Pipeline for NHL Data

Running locally

Usage

Load Game/Season

Query player stats

Components / Design Decisions

GraphQL API

RabbitMQ

Game Stats ETL

PostgreSQL

Flow Diagrams

Storing player stats

Querying stats

What else I would do

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages