generated from dataforgoodfr/d4g-project-template
-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into site-observable
- Loading branch information
Showing
8 changed files
with
2,112 additions
and
126 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -160,4 +160,6 @@ dmypy.json | |
cython_debug/ | ||
|
||
# Precommit hooks: ruff cache | ||
.ruff_cache | ||
.ruff_cache | ||
|
||
data/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
download-tmdb-movies-dataset: | ||
mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/tmdb-movies-dataset-2023-930k-movies | ||
|
||
download-full-tmdb-tv-shows-dataset: | ||
mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,53 +1,96 @@ | ||
Template DataForGood | ||
Observatoire des imagiaires | ||
================ | ||
|
||
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! --> | ||
|
||
This file will become your README and also the index of your | ||
documentation. | ||
## Installing with poetry | ||
|
||
# Contributing | ||
### Prerequisites: | ||
|
||
1. Python (≥ `3.10`) installed on your system. | ||
2. Ensure you have `poetry` installed. If not, you can install them using `pip`. | ||
|
||
## Use a venv | ||
```bash | ||
pip install poetry | ||
``` | ||
|
||
python3 -m venv name-of-your-venv | ||
### Steps: | ||
|
||
source name-of-your-venv/bin/activate | ||
1. **Clone the GitHub Repository:** | ||
|
||
Clone the GitHub repository you want to install locally using the `git clone` command. | ||
|
||
## Utiliser Poetry | ||
```bash | ||
git clone https://github.com/dataforgoodfr/12_observatoire_des_imaginaires.git | ||
``` | ||
|
||
[Installer Poetry](https://python-poetry.org/docs/): | ||
2. **Navigate to the Repository Directory:** | ||
|
||
python3 -m pip install "poetry==1.4.0" | ||
Use the `cd` command to navigate into the repository directory. | ||
|
||
Installer les dépendances: | ||
```bash | ||
cd 12_observatoire_des_imaginaires/ | ||
``` | ||
|
||
poetry install | ||
3. **Configure `poetry` to create a Virtual Environment inside the project:** | ||
|
||
Ajouter une dépendance: | ||
Ensure that poetry will create a `.venv` directory into the project with the command: | ||
|
||
poetry add pandas | ||
```bash | ||
poetry config virtualenvs.in-project true | ||
``` | ||
|
||
Mettre à jour les dépendances: | ||
4. **Install Project Dependencies using `poetry`:** | ||
|
||
poetry update | ||
Use `poetry` to install the project dependencies. | ||
|
||
## Utiliser Jupyter Notebook | ||
```bash | ||
poetry install | ||
``` | ||
|
||
jupyter notebook | ||
This will read the `pyproject.toml` file in the repository and install all the dependencies specified. | ||
|
||
and check your browser ! | ||
5. **Activate the Virtual Environment:** | ||
|
||
## Lancer les precommit-hook localement | ||
Activate the virtual environment to work within its isolated environment. | ||
|
||
[Installer les precommit](https://pre-commit.com/) | ||
On Unix or MacOS: | ||
|
||
```bash | ||
poetry shell | ||
``` | ||
|
||
6. **Run & edit notebooks**: | ||
|
||
```bash | ||
jupyter notebook | ||
``` | ||
|
||
## Download datasets from kaggle | ||
|
||
If you want to use kaggle to download datasets, please make sure to have api's credentials in ~/.kaggle/kaggle.json. | ||
|
||
How to get .json with kaggle api's credentials : [here](https://github.com/Kaggle/kaggle-api#api-credentials) | ||
|
||
Once you have setup your venv with poetry, you can use commands in the Makefile to download tmdb datasets from kaggle : | ||
|
||
```bash | ||
make download-tmdb-movies-dataset | ||
make download-full-tmdb-tv-shows-dataset | ||
``` | ||
|
||
|
||
Alternatively you can download directly the datasets from kaggle website : | ||
- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies) | ||
- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows) | ||
|
||
## Run precommit-hook locally | ||
|
||
[Install precommits](https://pre-commit.com/) | ||
|
||
|
||
pre-commit run --all-files | ||
|
||
|
||
## Utiliser Tox pour tester votre code | ||
## Use Tox to test your code | ||
|
||
tox -vv |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,180 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "aa58c2c2", | ||
"metadata": {}, | ||
"source": [ | ||
"Before running this notebook, make sure to download datasets from kaggle using commands in Makefile :\n", | ||
"- download-tmdb-movies-dataset\n", | ||
"- download-full-tmdb-tv-shows-dataset\n", | ||
"\n", | ||
"You may need first to get your api's credentials for kaggle first : [here](https://github.com/Kaggle/kaggle-api#api-credentials)\n", | ||
"\n", | ||
"Alternatively you can download directly the datasets from kaggle website :\n", | ||
"- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)\n", | ||
"- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows)\n", | ||
"\n", | ||
"And put downloaded zip files into ./data folder and change KAGGLE_TMDB_MOVIES_DATASET_NAME & KAGGLE_TMDB_TVSHOWS_DATASET_NAME values if needed." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "22a2fc9f-6e22-40cc-aa37-e4039572e196", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"from zipfile import ZipFile\n", | ||
"from pathlib import Path\n", | ||
"\n", | ||
"import pandas as pd" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "f2b9b36b", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"KAGGLE_TMDB_MOVIES_DATASET_PATH = Path(\"../data/tmdb-movies-dataset-2023-930k-movies.zip\").resolve()\n", | ||
"KAGGLE_TMDB_TVSHOWS_DATASET_PATH = Path(\"../data/full-tmdb-tv-shows-dataset-2023-150k-shows.zip\").resolve()\n", | ||
"\n", | ||
"EXTRACT_MOVIES_ZIP_TO = Path(\"../data/tmdb_movies\").resolve()\n", | ||
"EXTRACT_TWSHOWS_ZIP_TO = Path(\"../data/tmdb_tvshows\").resolve()\n", | ||
"\n", | ||
"EXPORT_TMDB_SUBSETS_TO = Path(\"../data/tmdb_subsets\").resolve()\n", | ||
"\n", | ||
"MOVIES_COLUMNS_OF_INTEREST = ['title', 'original_title', 'release_date', 'production_countries', 'genres', 'production_companies']\n", | ||
"NB_MOVIES_SUBSET = 5000\n", | ||
"\n", | ||
"TVSHOWS_COLUMNS_OF_INTEREST = ['name', 'original_name', 'first_air_date', 'production_countries', 'genres', 'production_companies']\n", | ||
"NB_TVSHOWS_SUBSET = 5000" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "f35eec9c", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"EXTRACT_MOVIES_ZIP_TO.mkdir(exist_ok=True, parents=True)\n", | ||
"EXTRACT_TWSHOWS_ZIP_TO.mkdir(exist_ok=True, parents=True)\n", | ||
"EXPORT_TMDB_SUBSETS_TO.mkdir(exist_ok=True, parents=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "fc3989b0", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with ZipFile(KAGGLE_TMDB_MOVIES_DATASET_PATH, 'r') as f:\n", | ||
" f.extractall(path=EXTRACT_MOVIES_ZIP_TO)\n", | ||
"\n", | ||
"df_movies = pd.read_csv(EXTRACT_MOVIES_ZIP_TO / os.listdir(EXTRACT_MOVIES_ZIP_TO)[0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 13, | ||
"id": "bf19fe99", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with ZipFile(KAGGLE_TMDB_TVSHOWS_DATASET_PATH, 'r') as f:\n", | ||
" f.extractall(path=EXTRACT_TWSHOWS_ZIP_TO)\n", | ||
"\n", | ||
"df_tvshows = pd.read_csv(EXTRACT_TWSHOWS_ZIP_TO / os.listdir(EXTRACT_TWSHOWS_ZIP_TO)[0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 26, | ||
"id": "ad152395", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"6497 movies are note uniquely identify by original_title & release_year on 797541 movies (0.81%)\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"df = df_movies[(df_movies['status'] == 'Released') & \n", | ||
" (~df_movies['adult']) &\n", | ||
" (~df_movies['release_date'].isna())].copy()\n", | ||
"\n", | ||
"df['release_year'] = df['release_date'].apply(lambda date : date[0:4]).astype(int)\n", | ||
"\n", | ||
"df_by_title_year = df.groupby(by=['original_title', 'release_year']).id.count()\n", | ||
"\n", | ||
"nb_duplicates_title_year = df_by_title_year[df_by_title_year > 1].shape[0]\n", | ||
"nb_total_movies = df.shape[0]\n", | ||
"print(f\"{nb_duplicates_title_year} movies are note uniquely identify by original_title & release_year on {nb_total_movies} movies ({(100 * nb_duplicates_title_year / nb_total_movies):.2f}%)\")\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "7429796d", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df_movies_subset = df_movies.dropna(axis=0, how='any', subset=MOVIES_COLUMNS_OF_INTEREST)\n", | ||
"\n", | ||
"df_movies_subset = df_movies_subset[(df_movies_subset['status'] == 'Released') & \n", | ||
" (~df_movies_subset['adult']) &\n", | ||
" (df_movies_subset['release_date'] < '2024-03-01') &\n", | ||
" (df_movies_subset['original_language'].isin(['fr', 'en']))].sort_values(by='release_date', ascending=False).iloc[0:NB_MOVIES_SUBSET]\n", | ||
" \n", | ||
"df_movies_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_movies_subset.csv\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "c3a4ecc2", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"df_tvshows_subset = df_tvshows.dropna(axis=0, how='any', subset=TVSHOWS_COLUMNS_OF_INTEREST)\n", | ||
"\n", | ||
"df_tvshows_subset = df_tvshows_subset[(df_tvshows_subset['status'] == 'Ended') & \n", | ||
" (~df_tvshows_subset['adult']) &\n", | ||
" (df_tvshows_subset['last_air_date'] < '2024-03-01') &\n", | ||
" (df_tvshows_subset['original_language'].isin(['fr', 'en']))].sort_values(by='last_air_date', ascending=False).iloc[0:NB_TVSHOWS_SUBSET]\n", | ||
" \n", | ||
"df_tvshows_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_tvshows_subset.csv\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
File renamed without changes.
Oops, something went wrong.