Merge pull request #64 from dataforgoodfr/update-series-data-loader

Update series data loader
dataforgoodfr · May 9, 2024 · eb290d3 · eb290d3
2 parents 6a7c34d + 8c41632
commit eb290d3
Show file tree

Hide file tree

Showing 5 changed files with 44 additions and 323 deletions.
diff --git a/README.md b/README.md
@@ -29,15 +29,15 @@ pip install poetry
    cd 12_observatoire_des_imaginaires/
    ```
 
-3. **Configure `poetry` to create a Virtual Environment inside the project:**
+3. **Configure** `poetry` to create a Virtual Environment inside the project:
 
    Ensure that poetry will create a `.venv` directory into the project with the command:
 
    ```bash
    poetry config virtualenvs.in-project true
    ```
 
-4. **Install Project Dependencies using `poetry`:**
+4. **Install Project Dependencies using** `poetry`:
 
    Use `poetry` to install the project dependencies.
 
@@ -67,20 +67,16 @@ pip install poetry
 
 This code base uses a `.env` file at the root directory of the code base.
 
-| Variable         | Description                                                         | Default Value |
-| ---------------- | ------------------------------------------------------------------- | ------------- |
-| HF_TOKEN         | Hugging Face API Token. You must have write access to the datasets. | N/A           |
-| TMDB_API_KEY     | TMDB API Token.                                                     | N/A           |
-| TMDB_BATCH_SIZE  | Number of TMDB entries to download before updating a HF dataset.    | 10000         |
-| TMDB_MAX_RETRIES | Maximum number of times to retry a failed TMDB API call.            | 500           |
+| Variable | Description | Default Value |
+| --- | --- | --- |
+| HF_TOKEN | Hugging Face API Token. You must have write access to the datasets. | N/A |
+| TMDB_API_KEY | TMDB API Token. | N/A |
+| TMDB_BATCH_SIZE | Number of TMDB entries to download before updating a HF dataset. | 10000 |
+| TMDB_MAX_RETRIES | Maximum number of times to retry a failed TMDB API call. | 500 |
 
 ## Website to select a specific movie or TV show
 
-The [observable](https://github.com/dataforgoodfr/12_observatoire_des_imaginaires/tree/main/site-observable) directory contains
-an observable framework site that collect film and movie data from the above datasets on kaggle and filters the datasets according
-to the following rules in order to reduced the size of the data present on the generated web site. This site provides a search UI
-allow a user to select a specific movie or TV show. The user can then click on the link for their selection to kick off the
-questionnaire on tally andis destined to be embedded in an iframe in the main Observatoire des Imaginaires web site.
+The [observable](https://github.com/dataforgoodfr/12_observatoire_des_imaginaires/tree/main/site-observable) directory contains an observable framework site that collect film and movie data from datasets on Hugging Face and filters the datasets according to the following rules in order to reduced the size of the data present on the generated web site. This site provides a search UI allow a user to select a specific movie or TV show. The user can then click on the link for their selection to kick off the questionnaire on tally andis destined to be embedded in an iframe in the main Observatoire des Imaginaires web site.
 
 Movies:
 
@@ -99,16 +95,19 @@ https://observatoire-des-imaginaires.observablehq.cloud/questionnaire
 
 [Install precommits](https://pre-commit.com/)
 
-    pre-commit run --all-files
+```
+pre-commit run --all-files
+```
 
 ## Use Tox to test your code
 
-    tox -vv
+```
+tox -vv
+```
 
 ## Tasks
 
-This repo includes invoke for pythonic task execution. To see the
-is of available tasks you can run:
+This repo includes invoke for pythonic task execution. To see the is of available tasks you can run:
 
 ```bash
 invoke -l
@@ -124,18 +123,16 @@ invoke dev
 
 ### Updating the Movie Dataset
 
-The [French regional TMDB Movies Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_movies)
-on Hugging Face can be updated using the following command:
+The [French regional TMDB Movies Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_movies) on Hugging Face can be updated using the following command:
 
 ```bash
 invoke update-movies-dataset
 ```
 
 ### Updating the Series Dataset
 
-The [French regional TMDB Series Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_series)
-on Hugging Face can be updated using the following command:
+The [French regional TMDB Series Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_series) on Hugging Face can be updated using the following command:
 
 ```bash
 invoke update-series-dataset
-```
+```
diff --git a/notebooks/create_tmdb_subsets.ipynb b/notebooks/create_tmdb_subsets.ipynb
diff --git a/observable/src/data/series.sqlite.py b/observable/src/data/series.sqlite.py
@@ -3,42 +3,35 @@
 import tempfile
 from datetime import datetime
 
-import pandas as pd
+from observatoire.tmdb.series.hf import load_series_dataset
 
-with tempfile.TemporaryDirectory() as temp_dir:
-    os.chdir(temp_dir)
+# Load the dataset
+df = load_series_dataset()
 
-    os.system(
-        "kaggle datasets download -d asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows >&2",
-    )
-    os.system("unzip full-tmdb-tv-shows-dataset-2023-150k-shows.zip >&2")
+# Remove adult movies
+df = df[df["adult"] == False]  # noqa: E712
 
-    df = pd.read_csv("TMDB_tv_dataset_v3.csv", parse_dates=["first_air_date"])
+# Remove documentaries
+df = df[df["genres"].str.contains("Documentary") == False]  # noqa: E712
 
-    # Remove adult movies
-    df = df[df["adult"] == False]  # noqa: E712
+# Remove shows with a future first air date or no first air date
+now = datetime.now().strftime("%Y-%m-%d")
+df = df[df["first_air_date"] < now]
 
-    # Remove documentaries
-    df = df[df["genres"].str.contains("Documentary") == False]  # noqa: E712
+# Select the columns we want
+df = df[["id", "name", "original_name", "poster_path"]]
 
-    # Remove shows with a future first air date or no first air date
-    now = datetime.now()
-    df = df[df["first_air_date"] < now]
+# Set original name to blank string if same as name
+df["original_name"] = df["original_name"].where(
+    df["name"] != df["original_name"],
+    "",
+)
 
-    # Select the columns we want
-    df = df[["id", "name", "original_name", "poster_path"]]
+# Save the dataframe to a SQLite database
+with tempfile.NamedTemporaryFile(suffix=".sqlite", delete=False) as temp_file:
+    temp_filename = temp_file.name
+    with sqlite3.connect(temp_filename) as conn:
+        df.to_sql("series", conn, index=False)
 
-    # Set original name to blank string if same as name
-    df["original_name"] = df["original_name"].where(
-        df["name"] != df["original_name"],
-        "",
-    )
-
-    # Save the dataframe to a SQLite database
-    with tempfile.NamedTemporaryFile(suffix=".sqlite", delete=False) as temp_file:
-        temp_filename = temp_file.name
-        with sqlite3.connect(temp_filename) as conn:
-            df.to_sql("series", conn, index=False)
-
-    # Print db file to stdout
-    os.system(f"cat {temp_filename}")
+# Print db file to stdout
+os.system(f"cat {temp_filename}")