nyt-books-elt

Overview

This documentation provides detailed instructions on setting up and running a Docker container to retrieve, store, and query data from the New York Times Books API, specifically focusing on the lists overview endpoint. The tasks include data retrieval for the years 2021 to 2023, dimensional modeling, and executing specific SQL queries to generate insights and export results.

Pre-requisites

Docker and Docker Compose installed on your machine.
A New York Times API key. You can request one here.

Docker Compose File

The Docker Compose file defines a single service nyt-elt which is responsible for the entire ELT (Extract, Load, Transform) process.

version: '3.8'
services:
  nyt-elt:
    build:
      context: .
      dockerfile: Dockerfile
    image: nyt-books-app
    container_name: nyt-elt
    volumes:
      - ./core/sql_queries/docker_results:/app/core/sql_queries/results
      - ./output_db:/app/output_db
    environment:
      - START_DATE=2021-01-01
      - END_DATE=2024-01-01
      - API_KEY=API
    command: /bin/bash /app/elt.sh

Setup

Clone the repository: Clone the repository to your local machine.
Add your API key: Replace the value of the API_KEY environment variable in the Docker Compose file with your New York Times API key.
Optional: Modify the START_DATE and END_DATE environment variables in the Docker Compose file to specify the date range for data retrieval. Have in mind that depending on the date range, the data retrieval process may take longer (up to 30 minutes for the data between 2021 and 2023). A set of pre-calculated results are available in the core/sql_queries/results directory for the years 2021 to 2023.
Build the Docker container: Run the following command to build the Docker container.
```
docker-compose build
```
Run the Docker container: Run the following command to start the Docker container and execute the ELT process.
```
docker-compose up
```
Check the results: After the ELT process is complete, you can find the results of the SQL queries in the core/sql_queries/results directory and the output database in the output_db directory. You can then connect to the output database to query the data and generate insights.

ELT Process

Data Retrieval

The ELT process starts with data retrieval from the New York Times Books API using the lists overview endpoint. The data is retrieved for the specified date range and stored in a Duckdb database. The data includes information about the best-selling books for each week, such as the title, author, description, rank, and weeks on the list. The data retrieval process is implemented using the dlt package in Python, which handles the API requests, data transformation, and database loading.

Dimensional Modeling

Using dbt, the data is transformed into a star schema consisting of six tables: authors_dimension, books_dimension, dates_dimension, lists_dimension, publishers_dimension and best_sellers_facts.

The following diagram illustrates the star schema:

SQL Queries

The ELT process includes the execution of specific SQL queries to generate insights from the data. The queries are stored in the core/sql_queries directory. A set of pre-calculated results are available in the core/sql_queries/results directory for the years 2021 to 2023.

Most weeks in the top 3 ranks: This query identifies the book that spent the most weeks in the top 3 ranks:

SELECT
    b.title,
    COUNT(*) AS weeks_in_top_3
FROM nyt_books.main_models.best_sellers_facts AS f
JOIN nyt_books.main_models.dates_dimension AS d
    ON f.published_date_id = d.date_id
JOIN nyt_books.main_models.books_dimension AS b
    ON f.book_id = b.id
WHERE f.rank <= 3
AND d.year = 2022
GROUP BY b.title
ORDER BY weeks_in_top_3 DESC
LIMIT 1

Top 3 lists with most unique books: This query identifies the top 3 lists with the least number of unique books in their rankings for the entirety of the data.

SELECT
    f.list_id,
    COUNT(DISTINCT f.book_id) AS distinct_books_count
FROM nyt_books.main_models.best_sellers_facts AS f
JOIN nyt_books.main_models.dates_dimension AS d
    ON f.published_date_id = d.date_id
GROUP BY f.list_id
ORDER BY COUNT(DISTINCT f.book_id) ASC
LIMIT 3

Quarterly ranking of publishers: This query calculates the quarterly ranking of publishers based on a point system where the rank of the book on the list determines the points awarded to the publisher.

WITH book_quarter_rank AS (
    SELECT
        f.book_id,
        p.publisher_name,
        d.quarter,
        d.year,
        MIN(f."rank") AS best_rank
    FROM nyt_books.main_models.best_sellers_facts AS f
    LEFT JOIN nyt_books.main_models.publishers_dimension  AS p
        ON f.publisher_id = p.id
    LEFT JOIN nyt_books.main_models.dates_dimension AS d
        ON f.published_date_id = d.date_id
    WHERE d."date" >= '2021-01-01' AND d."date" < '2024-01-01'
    GROUP BY f.book_id, p.publisher_name, d.quarter, d.year
),
book_quarter_rank_points AS (
    SELECT
        publisher_name,
        year,
        quarter,
        best_rank,
        CASE
            WHEN best_rank = 1 THEN 5
            WHEN best_rank = 2 THEN 4
            WHEN best_rank = 3 THEN 3
            WHEN best_rank = 4 THEN 2
            WHEN best_rank = 5 THEN 1
            ELSE 0
        END AS points
    FROM book_quarter_rank
),
best_publishers AS (
    SELECT
        publisher_name,
        year,
        quarter,
        SUM(points) AS total_points
    FROM book_quarter_rank_points
    GROUP BY publisher_name, year, quarter
    ORDER BY year ASC, quarter ASC, SUM(points) DESC
),
publishers_ranked AS (
    SELECT
        publisher_name,
        year,
        quarter,
        total_points,
        RANK() OVER(PARTITION BY year, quarter ORDER BY total_points DESC) AS publishers_rank
    FROM best_publishers
)
SELECT *
FROM publishers_ranked
WHERE publishers_rank <= 5
ORDER BY year ASC, quarter ASC, publishers_rank ASC

This final query answers this question: Two friends Jake and Pete have podcasts where they review books. Jake's team reviews the book ranked first on every list, while Pete’s team reviews the book ranked third. Both of them share books, if Jake’s team wants to review a book, they first check with Pete’s before buying and vice versa. Which team bought what book in 2023?

WITH ranked_books AS (
    SELECT
        bs.list_id,
        b.title, 
        bs.rank,
        d."date" 
    FROM nyt_books.main_models.best_sellers_facts AS bs
    JOIN nyt_books.main_models.books_dimension AS b ON bs.book_id = b.id
    JOIN nyt_books.main_models.dates_dimension AS d ON bs.published_date_id = d.date_id
    WHERE d.year = 2023 AND (bs.rank = 1 OR bs.rank = 3)
    ORDER BY d."date" ASC, bs.list_id ASC
),
flagged AS (
    SELECT
        title,
        rank,
        "date",
        CASE
            WHEN title IN (SELECT subquery.title FROM ranked_books AS subquery WHERE subquery."date" < s."date" AND subquery.rank != s.rank)
            THEN 1
            ELSE 0
        END AS flag
    FROM ranked_books AS s
), 
ranked AS (
    SELECT
        title,
        rank,
        "date",
        flag,
        ROW_NUMBER() OVER (PARTITION BY title ORDER BY date) AS rn
    FROM flagged
    WHERE flag = 0
),
first_date AS (
    SELECT
        title,
        rank,
        date,
        flag
    FROM ranked
    WHERE rn = 1
)
SELECT
    title,
    CASE
        WHEN rank = 1 THEN 'Jake´s team'
        ELSE 'Pete´s team'
    END AS bought_by
FROM first_date
;

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
core		core
images		images
output_db		output_db
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
elt.sh		elt.sh
poetry.lock		poetry.lock
profiles.yml		profiles.yml
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nyt-books-elt

Overview

Pre-requisites

Docker Compose File

Setup

ELT Process

Data Retrieval

Dimensional Modeling

SQL Queries

About

Releases

Packages

Languages

ricardo8aib/nyt-books-elt

Folders and files

Latest commit

History

Repository files navigation

nyt-books-elt

Overview

Pre-requisites

Docker Compose File

Setup

ELT Process

Data Retrieval

Dimensional Modeling

SQL Queries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages