This documentation provides detailed instructions on setting up and running a Docker container to retrieve, store, and query data from the New York Times Books API, specifically focusing on the lists overview endpoint. The tasks include data retrieval for the years 2021 to 2023, dimensional modeling, and executing specific SQL queries to generate insights and export results.
- Docker and Docker Compose installed on your machine.
- A New York Times API key. You can request one here.
The Docker Compose file defines a single service nyt-elt which is responsible for the entire ELT (Extract, Load, Transform) process.
version: '3.8'
services:
nyt-elt:
build:
context: .
dockerfile: Dockerfile
image: nyt-books-app
container_name: nyt-elt
volumes:
- ./core/sql_queries/docker_results:/app/core/sql_queries/results
- ./output_db:/app/output_db
environment:
- START_DATE=2021-01-01
- END_DATE=2024-01-01
- API_KEY=API
command: /bin/bash /app/elt.sh
-
Clone the repository: Clone the repository to your local machine.
-
Add your API key: Replace the value of the
API_KEY
environment variable in the Docker Compose file with your New York Times API key. -
Optional: Modify the
START_DATE
andEND_DATE
environment variables in the Docker Compose file to specify the date range for data retrieval. Have in mind that depending on the date range, the data retrieval process may take longer (up to 30 minutes for the data between 2021 and 2023). A set of pre-calculated results are available in thecore/sql_queries/results
directory for the years 2021 to 2023. -
Build the Docker container: Run the following command to build the Docker container.
docker-compose build
-
Run the Docker container: Run the following command to start the Docker container and execute the ELT process.
docker-compose up
-
Check the results: After the ELT process is complete, you can find the results of the SQL queries in the
core/sql_queries/results
directory and the output database in theoutput_db
directory. You can then connect to the output database to query the data and generate insights.
The ELT process starts with data retrieval from the New York Times Books API using the lists overview endpoint. The data is retrieved for the specified date range and stored in a Duckdb database. The data includes information about the best-selling books for each week, such as the title, author, description, rank, and weeks on the list. The data retrieval process is implemented using the dlt package in Python, which handles the API requests, data transformation, and database loading.
Using dbt, the data is transformed into a star schema consisting of six tables: authors_dimension
, books_dimension
, dates_dimension
, lists_dimension
, publishers_dimension
and best_sellers_facts
.
The following diagram illustrates the star schema:
The ELT process includes the execution of specific SQL queries to generate insights from the data. The queries are stored in the core/sql_queries
directory. A set of pre-calculated results are available in the core/sql_queries/results
directory for the years 2021 to 2023.
-
Most weeks in the top 3 ranks: This query identifies the book that spent the most weeks in the top 3 ranks:
SELECT b.title, COUNT(*) AS weeks_in_top_3 FROM nyt_books.main_models.best_sellers_facts AS f JOIN nyt_books.main_models.dates_dimension AS d ON f.published_date_id = d.date_id JOIN nyt_books.main_models.books_dimension AS b ON f.book_id = b.id WHERE f.rank <= 3 AND d.year = 2022 GROUP BY b.title ORDER BY weeks_in_top_3 DESC LIMIT 1
-
Top 3 lists with most unique books: This query identifies the top 3 lists with the least number of unique books in their rankings for the entirety of the data.
SELECT f.list_id, COUNT(DISTINCT f.book_id) AS distinct_books_count FROM nyt_books.main_models.best_sellers_facts AS f JOIN nyt_books.main_models.dates_dimension AS d ON f.published_date_id = d.date_id GROUP BY f.list_id ORDER BY COUNT(DISTINCT f.book_id) ASC LIMIT 3
-
Quarterly ranking of publishers: This query calculates the quarterly ranking of publishers based on a point system where the rank of the book on the list determines the points awarded to the publisher.
WITH book_quarter_rank AS ( SELECT f.book_id, p.publisher_name, d.quarter, d.year, MIN(f."rank") AS best_rank FROM nyt_books.main_models.best_sellers_facts AS f LEFT JOIN nyt_books.main_models.publishers_dimension AS p ON f.publisher_id = p.id LEFT JOIN nyt_books.main_models.dates_dimension AS d ON f.published_date_id = d.date_id WHERE d."date" >= '2021-01-01' AND d."date" < '2024-01-01' GROUP BY f.book_id, p.publisher_name, d.quarter, d.year ), book_quarter_rank_points AS ( SELECT publisher_name, year, quarter, best_rank, CASE WHEN best_rank = 1 THEN 5 WHEN best_rank = 2 THEN 4 WHEN best_rank = 3 THEN 3 WHEN best_rank = 4 THEN 2 WHEN best_rank = 5 THEN 1 ELSE 0 END AS points FROM book_quarter_rank ), best_publishers AS ( SELECT publisher_name, year, quarter, SUM(points) AS total_points FROM book_quarter_rank_points GROUP BY publisher_name, year, quarter ORDER BY year ASC, quarter ASC, SUM(points) DESC ), publishers_ranked AS ( SELECT publisher_name, year, quarter, total_points, RANK() OVER(PARTITION BY year, quarter ORDER BY total_points DESC) AS publishers_rank FROM best_publishers ) SELECT * FROM publishers_ranked WHERE publishers_rank <= 5 ORDER BY year ASC, quarter ASC, publishers_rank ASC
-
This final query answers this question: Two friends Jake and Pete have podcasts where they review books. Jake's team reviews the book ranked first on every list, while Pete’s team reviews the book ranked third. Both of them share books, if Jake’s team wants to review a book, they first check with Pete’s before buying and vice versa. Which team bought what book in 2023?
WITH ranked_books AS ( SELECT bs.list_id, b.title, bs.rank, d."date" FROM nyt_books.main_models.best_sellers_facts AS bs JOIN nyt_books.main_models.books_dimension AS b ON bs.book_id = b.id JOIN nyt_books.main_models.dates_dimension AS d ON bs.published_date_id = d.date_id WHERE d.year = 2023 AND (bs.rank = 1 OR bs.rank = 3) ORDER BY d."date" ASC, bs.list_id ASC ), flagged AS ( SELECT title, rank, "date", CASE WHEN title IN (SELECT subquery.title FROM ranked_books AS subquery WHERE subquery."date" < s."date" AND subquery.rank != s.rank) THEN 1 ELSE 0 END AS flag FROM ranked_books AS s ), ranked AS ( SELECT title, rank, "date", flag, ROW_NUMBER() OVER (PARTITION BY title ORDER BY date) AS rn FROM flagged WHERE flag = 0 ), first_date AS ( SELECT title, rank, date, flag FROM ranked WHERE rn = 1 ) SELECT title, CASE WHEN rank = 1 THEN 'Jake´s team' ELSE 'Pete´s team' END AS bought_by FROM first_date ;