Skip to content

Commit

Permalink
Merge pull request #64 from dataforgoodfr/update-series-data-loader
Browse files Browse the repository at this point in the history
Update series data loader
  • Loading branch information
kaaloo authored May 9, 2024
2 parents 6a7c34d + 8c41632 commit eb290d3
Show file tree
Hide file tree
Showing 5 changed files with 44 additions and 323 deletions.
41 changes: 19 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,15 +29,15 @@ pip install poetry
cd 12_observatoire_des_imaginaires/
```

3. **Configure `poetry` to create a Virtual Environment inside the project:**
3. **Configure** `poetry` to create a Virtual Environment inside the project:

Ensure that poetry will create a `.venv` directory into the project with the command:

```bash
poetry config virtualenvs.in-project true
```

4. **Install Project Dependencies using `poetry`:**
4. **Install Project Dependencies using** `poetry`:

Use `poetry` to install the project dependencies.

Expand Down Expand Up @@ -67,20 +67,16 @@ pip install poetry

This code base uses a `.env` file at the root directory of the code base.

| Variable | Description | Default Value |
| ---------------- | ------------------------------------------------------------------- | ------------- |
| HF_TOKEN | Hugging Face API Token. You must have write access to the datasets. | N/A |
| TMDB_API_KEY | TMDB API Token. | N/A |
| TMDB_BATCH_SIZE | Number of TMDB entries to download before updating a HF dataset. | 10000 |
| TMDB_MAX_RETRIES | Maximum number of times to retry a failed TMDB API call. | 500 |
| Variable | Description | Default Value |
| --- | --- | --- |
| HF_TOKEN | Hugging Face API Token. You must have write access to the datasets. | N/A |
| TMDB_API_KEY | TMDB API Token. | N/A |
| TMDB_BATCH_SIZE | Number of TMDB entries to download before updating a HF dataset. | 10000 |
| TMDB_MAX_RETRIES | Maximum number of times to retry a failed TMDB API call. | 500 |

## Website to select a specific movie or TV show

The [observable](https://github.com/dataforgoodfr/12_observatoire_des_imaginaires/tree/main/site-observable) directory contains
an observable framework site that collect film and movie data from the above datasets on kaggle and filters the datasets according
to the following rules in order to reduced the size of the data present on the generated web site. This site provides a search UI
allow a user to select a specific movie or TV show. The user can then click on the link for their selection to kick off the
questionnaire on tally andis destined to be embedded in an iframe in the main Observatoire des Imaginaires web site.
The [observable](https://github.com/dataforgoodfr/12_observatoire_des_imaginaires/tree/main/site-observable) directory contains an observable framework site that collect film and movie data from datasets on Hugging Face and filters the datasets according to the following rules in order to reduced the size of the data present on the generated web site. This site provides a search UI allow a user to select a specific movie or TV show. The user can then click on the link for their selection to kick off the questionnaire on tally andis destined to be embedded in an iframe in the main Observatoire des Imaginaires web site.

Movies:

Expand All @@ -99,16 +95,19 @@ https://observatoire-des-imaginaires.observablehq.cloud/questionnaire

[Install precommits](https://pre-commit.com/)

pre-commit run --all-files
```
pre-commit run --all-files
```

## Use Tox to test your code

tox -vv
```
tox -vv
```

## Tasks

This repo includes invoke for pythonic task execution. To see the
is of available tasks you can run:
This repo includes invoke for pythonic task execution. To see the is of available tasks you can run:

```bash
invoke -l
Expand All @@ -124,18 +123,16 @@ invoke dev

### Updating the Movie Dataset

The [French regional TMDB Movies Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_movies)
on Hugging Face can be updated using the following command:
The [French regional TMDB Movies Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_movies) on Hugging Face can be updated using the following command:

```bash
invoke update-movies-dataset
```

### Updating the Series Dataset

The [French regional TMDB Series Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_series)
on Hugging Face can be updated using the following command:
The [French regional TMDB Series Dataset](https://huggingface.co/datasets/DataForGood/observatoire_des_imaginaires_series) on Hugging Face can be updated using the following command:

```bash
invoke update-series-dataset
```
```
180 changes: 0 additions & 180 deletions notebooks/create_tmdb_subsets.ipynb

This file was deleted.

55 changes: 24 additions & 31 deletions observable/src/data/series.sqlite.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,42 +3,35 @@
import tempfile
from datetime import datetime

import pandas as pd
from observatoire.tmdb.series.hf import load_series_dataset

with tempfile.TemporaryDirectory() as temp_dir:
os.chdir(temp_dir)
# Load the dataset
df = load_series_dataset()

os.system(
"kaggle datasets download -d asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows >&2",
)
os.system("unzip full-tmdb-tv-shows-dataset-2023-150k-shows.zip >&2")
# Remove adult movies
df = df[df["adult"] == False] # noqa: E712

df = pd.read_csv("TMDB_tv_dataset_v3.csv", parse_dates=["first_air_date"])
# Remove documentaries
df = df[df["genres"].str.contains("Documentary") == False] # noqa: E712

# Remove adult movies
df = df[df["adult"] == False] # noqa: E712
# Remove shows with a future first air date or no first air date
now = datetime.now().strftime("%Y-%m-%d")
df = df[df["first_air_date"] < now]

# Remove documentaries
df = df[df["genres"].str.contains("Documentary") == False] # noqa: E712
# Select the columns we want
df = df[["id", "name", "original_name", "poster_path"]]

# Remove shows with a future first air date or no first air date
now = datetime.now()
df = df[df["first_air_date"] < now]
# Set original name to blank string if same as name
df["original_name"] = df["original_name"].where(
df["name"] != df["original_name"],
"",
)

# Select the columns we want
df = df[["id", "name", "original_name", "poster_path"]]
# Save the dataframe to a SQLite database
with tempfile.NamedTemporaryFile(suffix=".sqlite", delete=False) as temp_file:
temp_filename = temp_file.name
with sqlite3.connect(temp_filename) as conn:
df.to_sql("series", conn, index=False)

# Set original name to blank string if same as name
df["original_name"] = df["original_name"].where(
df["name"] != df["original_name"],
"",
)

# Save the dataframe to a SQLite database
with tempfile.NamedTemporaryFile(suffix=".sqlite", delete=False) as temp_file:
temp_filename = temp_file.name
with sqlite3.connect(temp_filename) as conn:
df.to_sql("series", conn, index=False)

# Print db file to stdout
os.system(f"cat {temp_filename}")
# Print db file to stdout
os.system(f"cat {temp_filename}")
Loading

0 comments on commit eb290d3

Please sign in to comment.