Merge branch 'main' into site-observable

dataforgoodfr · Mar 21, 2024 · 9012829 · 9012829
2 parents fba081f + 237c43b
commit 9012829
Show file tree

Hide file tree

Showing 8 changed files with 2,112 additions and 126 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,4 +160,6 @@ dmypy.json
 cython_debug/
 
 # Precommit hooks: ruff cache
-.ruff_cache
+.ruff_cache
+
+data/
diff --git a/Makefile b/Makefile
@@ -0,0 +1,5 @@
+download-tmdb-movies-dataset:
+	mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/tmdb-movies-dataset-2023-930k-movies
+
+download-full-tmdb-tv-shows-dataset:
+	mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows
diff --git a/README.md b/README.md
@@ -1,53 +1,96 @@
-Template DataForGood
+Observatoire des imagiaires
 ================
 
-<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
 
-This file will become your README and also the index of your
-documentation.
+## Installing with poetry
 
-# Contributing
+### Prerequisites:
 
+1. Python (≥ `3.10`) installed on your system.
+2. Ensure you have `poetry` installed. If not, you can install them using `pip`.
 
-## Use a venv 
+```bash
+pip install poetry
+```
 
-    python3 -m venv name-of-your-venv
+### Steps:
 
-    source name-of-your-venv/bin/activate
+1. **Clone the GitHub Repository:**
 
+   Clone the GitHub repository you want to install locally using the `git clone` command.
 
-## Utiliser Poetry
+   ```bash
+   git clone https://github.com/dataforgoodfr/12_observatoire_des_imaginaires.git
+   ```
 
-[Installer Poetry](https://python-poetry.org/docs/):
+2. **Navigate to the Repository Directory:**
 
-    python3 -m pip install "poetry==1.4.0"
+   Use the `cd` command to navigate into the repository directory.
 
-Installer les dépendances:
+   ```bash
+   cd 12_observatoire_des_imaginaires/
+   ```
 
-    poetry install
+3. **Configure `poetry` to create a Virtual Environment inside the project:**
 
-Ajouter une dépendance:
+   Ensure that poetry will create a `.venv` directory into the project with the command:
 
-    poetry add pandas
+   ```bash
+   poetry config virtualenvs.in-project true
+   ```
 
-Mettre à jour les dépendances:
+4. **Install Project Dependencies using `poetry`:**
 
-    poetry update
+   Use `poetry` to install the project dependencies.
 
-## Utiliser Jupyter Notebook
+   ```bash
+   poetry install
+   ```
 
-    jupyter notebook
+   This will read the `pyproject.toml` file in the repository and install all the dependencies specified.
 
-and check your browser !
+5. **Activate the Virtual Environment:**
 
-## Lancer les precommit-hook localement
+   Activate the virtual environment to work within its isolated environment.
 
-[Installer les precommit](https://pre-commit.com/)
+   On Unix or MacOS:
+
+   ```bash
+   poetry shell
+   ```
+
+6. **Run & edit notebooks**:
+
+   ```bash
+   jupyter notebook
+   ```
+
+## Download datasets from kaggle 
+
+If you want to use kaggle to download datasets, please make sure to have api's credentials in ~/.kaggle/kaggle.json.
+
+How to get .json with kaggle api's credentials : [here](https://github.com/Kaggle/kaggle-api#api-credentials)
+
+Once you have setup your venv with poetry, you can use commands in the Makefile to download tmdb datasets from kaggle :
+
+```bash
+make download-tmdb-movies-dataset
+make download-full-tmdb-tv-shows-dataset
+```
+
+
+Alternatively you can download directly the datasets from kaggle website :
+- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)
+- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows)
+
+## Run precommit-hook locally
+
+[Install precommits](https://pre-commit.com/)
 
 
     pre-commit run --all-files 
 
 
-## Utiliser Tox pour tester votre code
+## Use Tox to test your code
 
     tox -vv
diff --git a/notebooks/create_tmdb_subsets.ipynb b/notebooks/create_tmdb_subsets.ipynb
@@ -0,0 +1,180 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "aa58c2c2",
+   "metadata": {},
+   "source": [
+    "Before running this notebook, make sure to download datasets from kaggle using commands in Makefile :\n",
+    "- download-tmdb-movies-dataset\n",
+    "- download-full-tmdb-tv-shows-dataset\n",
+    "\n",
+    "You may need first to get your api's credentials for kaggle first : [here](https://github.com/Kaggle/kaggle-api#api-credentials)\n",
+    "\n",
+    "Alternatively you can download directly the datasets from kaggle website :\n",
+    "- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)\n",
+    "- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows)\n",
+    "\n",
+    "And put downloaded zip files into ./data folder and change KAGGLE_TMDB_MOVIES_DATASET_NAME & KAGGLE_TMDB_TVSHOWS_DATASET_NAME values if needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "22a2fc9f-6e22-40cc-aa37-e4039572e196",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from zipfile import ZipFile\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "f2b9b36b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "KAGGLE_TMDB_MOVIES_DATASET_PATH = Path(\"../data/tmdb-movies-dataset-2023-930k-movies.zip\").resolve()\n",
+    "KAGGLE_TMDB_TVSHOWS_DATASET_PATH = Path(\"../data/full-tmdb-tv-shows-dataset-2023-150k-shows.zip\").resolve()\n",
+    "\n",
+    "EXTRACT_MOVIES_ZIP_TO = Path(\"../data/tmdb_movies\").resolve()\n",
+    "EXTRACT_TWSHOWS_ZIP_TO = Path(\"../data/tmdb_tvshows\").resolve()\n",
+    "\n",
+    "EXPORT_TMDB_SUBSETS_TO = Path(\"../data/tmdb_subsets\").resolve()\n",
+    "\n",
+    "MOVIES_COLUMNS_OF_INTEREST = ['title', 'original_title', 'release_date', 'production_countries', 'genres', 'production_companies']\n",
+    "NB_MOVIES_SUBSET = 5000\n",
+    "\n",
+    "TVSHOWS_COLUMNS_OF_INTEREST = ['name', 'original_name', 'first_air_date', 'production_countries', 'genres', 'production_companies']\n",
+    "NB_TVSHOWS_SUBSET = 5000"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "f35eec9c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "EXTRACT_MOVIES_ZIP_TO.mkdir(exist_ok=True, parents=True)\n",
+    "EXTRACT_TWSHOWS_ZIP_TO.mkdir(exist_ok=True, parents=True)\n",
+    "EXPORT_TMDB_SUBSETS_TO.mkdir(exist_ok=True, parents=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "fc3989b0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with ZipFile(KAGGLE_TMDB_MOVIES_DATASET_PATH, 'r') as f:\n",
+    "    f.extractall(path=EXTRACT_MOVIES_ZIP_TO)\n",
+    "\n",
+    "df_movies = pd.read_csv(EXTRACT_MOVIES_ZIP_TO / os.listdir(EXTRACT_MOVIES_ZIP_TO)[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "bf19fe99",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with ZipFile(KAGGLE_TMDB_TVSHOWS_DATASET_PATH, 'r') as f:\n",
+    "    f.extractall(path=EXTRACT_TWSHOWS_ZIP_TO)\n",
+    "\n",
+    "df_tvshows = pd.read_csv(EXTRACT_TWSHOWS_ZIP_TO / os.listdir(EXTRACT_TWSHOWS_ZIP_TO)[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "ad152395",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "6497 movies are note uniquely identify by original_title & release_year on 797541 movies (0.81%)\n"
+     ]
+    }
+   ],
+   "source": [
+    "df = df_movies[(df_movies['status'] == 'Released') & \n",
+    "               (~df_movies['adult']) &\n",
+    "               (~df_movies['release_date'].isna())].copy()\n",
+    "\n",
+    "df['release_year'] = df['release_date'].apply(lambda date : date[0:4]).astype(int)\n",
+    "\n",
+    "df_by_title_year = df.groupby(by=['original_title', 'release_year']).id.count()\n",
+    "\n",
+    "nb_duplicates_title_year = df_by_title_year[df_by_title_year > 1].shape[0]\n",
+    "nb_total_movies = df.shape[0]\n",
+    "print(f\"{nb_duplicates_title_year} movies are note uniquely identify by original_title & release_year on {nb_total_movies} movies ({(100 * nb_duplicates_title_year / nb_total_movies):.2f}%)\")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "7429796d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_movies_subset = df_movies.dropna(axis=0, how='any', subset=MOVIES_COLUMNS_OF_INTEREST)\n",
+    "\n",
+    "df_movies_subset = df_movies_subset[(df_movies_subset['status'] == 'Released') & \n",
+    "                                    (~df_movies_subset['adult']) &\n",
+    "                                    (df_movies_subset['release_date'] < '2024-03-01') &\n",
+    "                                    (df_movies_subset['original_language'].isin(['fr', 'en']))].sort_values(by='release_date', ascending=False).iloc[0:NB_MOVIES_SUBSET]\n",
+    "          \n",
+    "df_movies_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_movies_subset.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "c3a4ecc2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_tvshows_subset = df_tvshows.dropna(axis=0, how='any', subset=TVSHOWS_COLUMNS_OF_INTEREST)\n",
+    "\n",
+    "df_tvshows_subset = df_tvshows_subset[(df_tvshows_subset['status'] == 'Ended') & \n",
+    "                                      (~df_tvshows_subset['adult']) &\n",
+    "                                      (df_tvshows_subset['last_air_date'] < '2024-03-01') &\n",
+    "                                      (df_tvshows_subset['original_language'].isin(['fr', 'en']))].sort_values(by='last_air_date', ascending=False).iloc[0:NB_TVSHOWS_SUBSET]\n",
+    "          \n",
+    "df_tvshows_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_tvshows_subset.csv\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/python_template/__init__.py → observatoire/__init__.py b/python_template/__init__.py → observatoire/__init__.py