Skip to content

Commit

Permalink
Merge branch 'main' into site-observable
Browse files Browse the repository at this point in the history
  • Loading branch information
kaaloo committed Mar 21, 2024
2 parents fba081f + 237c43b commit 9012829
Show file tree
Hide file tree
Showing 8 changed files with 2,112 additions and 126 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,4 +160,6 @@ dmypy.json
cython_debug/

# Precommit hooks: ruff cache
.ruff_cache
.ruff_cache

data/
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
download-tmdb-movies-dataset:
mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/tmdb-movies-dataset-2023-930k-movies

download-full-tmdb-tv-shows-dataset:
mkdir data -p ; cd data ; kaggle datasets download -d asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows
89 changes: 66 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,96 @@
Template DataForGood
Observatoire des imagiaires
================

<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

This file will become your README and also the index of your
documentation.
## Installing with poetry

# Contributing
### Prerequisites:

1. Python (≥ `3.10`) installed on your system.
2. Ensure you have `poetry` installed. If not, you can install them using `pip`.

## Use a venv
```bash
pip install poetry
```

python3 -m venv name-of-your-venv
### Steps:

source name-of-your-venv/bin/activate
1. **Clone the GitHub Repository:**

Clone the GitHub repository you want to install locally using the `git clone` command.

## Utiliser Poetry
```bash
git clone https://github.com/dataforgoodfr/12_observatoire_des_imaginaires.git
```

[Installer Poetry](https://python-poetry.org/docs/):
2. **Navigate to the Repository Directory:**

python3 -m pip install "poetry==1.4.0"
Use the `cd` command to navigate into the repository directory.

Installer les dépendances:
```bash
cd 12_observatoire_des_imaginaires/
```

poetry install
3. **Configure `poetry` to create a Virtual Environment inside the project:**

Ajouter une dépendance:
Ensure that poetry will create a `.venv` directory into the project with the command:

poetry add pandas
```bash
poetry config virtualenvs.in-project true
```

Mettre à jour les dépendances:
4. **Install Project Dependencies using `poetry`:**

poetry update
Use `poetry` to install the project dependencies.

## Utiliser Jupyter Notebook
```bash
poetry install
```

jupyter notebook
This will read the `pyproject.toml` file in the repository and install all the dependencies specified.

and check your browser !
5. **Activate the Virtual Environment:**

## Lancer les precommit-hook localement
Activate the virtual environment to work within its isolated environment.

[Installer les precommit](https://pre-commit.com/)
On Unix or MacOS:

```bash
poetry shell
```

6. **Run & edit notebooks**:

```bash
jupyter notebook
```

## Download datasets from kaggle

If you want to use kaggle to download datasets, please make sure to have api's credentials in ~/.kaggle/kaggle.json.

How to get .json with kaggle api's credentials : [here](https://github.com/Kaggle/kaggle-api#api-credentials)

Once you have setup your venv with poetry, you can use commands in the Makefile to download tmdb datasets from kaggle :

```bash
make download-tmdb-movies-dataset
make download-full-tmdb-tv-shows-dataset
```


Alternatively you can download directly the datasets from kaggle website :
- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)
- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows)

## Run precommit-hook locally

[Install precommits](https://pre-commit.com/)


pre-commit run --all-files


## Utiliser Tox pour tester votre code
## Use Tox to test your code

tox -vv
180 changes: 180 additions & 0 deletions notebooks/create_tmdb_subsets.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "aa58c2c2",
"metadata": {},
"source": [
"Before running this notebook, make sure to download datasets from kaggle using commands in Makefile :\n",
"- download-tmdb-movies-dataset\n",
"- download-full-tmdb-tv-shows-dataset\n",
"\n",
"You may need first to get your api's credentials for kaggle first : [here](https://github.com/Kaggle/kaggle-api#api-credentials)\n",
"\n",
"Alternatively you can download directly the datasets from kaggle website :\n",
"- [tmdb-movies-dataset](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies)\n",
"- [full-tmdb-tv-shows-dataset](https://www.kaggle.com/datasets/asaniczka/full-tmdb-tv-shows-dataset-2023-150k-shows)\n",
"\n",
"And put downloaded zip files into ./data folder and change KAGGLE_TMDB_MOVIES_DATASET_NAME & KAGGLE_TMDB_TVSHOWS_DATASET_NAME values if needed."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "22a2fc9f-6e22-40cc-aa37-e4039572e196",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from zipfile import ZipFile\n",
"from pathlib import Path\n",
"\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "f2b9b36b",
"metadata": {},
"outputs": [],
"source": [
"KAGGLE_TMDB_MOVIES_DATASET_PATH = Path(\"../data/tmdb-movies-dataset-2023-930k-movies.zip\").resolve()\n",
"KAGGLE_TMDB_TVSHOWS_DATASET_PATH = Path(\"../data/full-tmdb-tv-shows-dataset-2023-150k-shows.zip\").resolve()\n",
"\n",
"EXTRACT_MOVIES_ZIP_TO = Path(\"../data/tmdb_movies\").resolve()\n",
"EXTRACT_TWSHOWS_ZIP_TO = Path(\"../data/tmdb_tvshows\").resolve()\n",
"\n",
"EXPORT_TMDB_SUBSETS_TO = Path(\"../data/tmdb_subsets\").resolve()\n",
"\n",
"MOVIES_COLUMNS_OF_INTEREST = ['title', 'original_title', 'release_date', 'production_countries', 'genres', 'production_companies']\n",
"NB_MOVIES_SUBSET = 5000\n",
"\n",
"TVSHOWS_COLUMNS_OF_INTEREST = ['name', 'original_name', 'first_air_date', 'production_countries', 'genres', 'production_companies']\n",
"NB_TVSHOWS_SUBSET = 5000"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f35eec9c",
"metadata": {},
"outputs": [],
"source": [
"EXTRACT_MOVIES_ZIP_TO.mkdir(exist_ok=True, parents=True)\n",
"EXTRACT_TWSHOWS_ZIP_TO.mkdir(exist_ok=True, parents=True)\n",
"EXPORT_TMDB_SUBSETS_TO.mkdir(exist_ok=True, parents=True)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "fc3989b0",
"metadata": {},
"outputs": [],
"source": [
"with ZipFile(KAGGLE_TMDB_MOVIES_DATASET_PATH, 'r') as f:\n",
" f.extractall(path=EXTRACT_MOVIES_ZIP_TO)\n",
"\n",
"df_movies = pd.read_csv(EXTRACT_MOVIES_ZIP_TO / os.listdir(EXTRACT_MOVIES_ZIP_TO)[0])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "bf19fe99",
"metadata": {},
"outputs": [],
"source": [
"with ZipFile(KAGGLE_TMDB_TVSHOWS_DATASET_PATH, 'r') as f:\n",
" f.extractall(path=EXTRACT_TWSHOWS_ZIP_TO)\n",
"\n",
"df_tvshows = pd.read_csv(EXTRACT_TWSHOWS_ZIP_TO / os.listdir(EXTRACT_TWSHOWS_ZIP_TO)[0])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "ad152395",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6497 movies are note uniquely identify by original_title & release_year on 797541 movies (0.81%)\n"
]
}
],
"source": [
"df = df_movies[(df_movies['status'] == 'Released') & \n",
" (~df_movies['adult']) &\n",
" (~df_movies['release_date'].isna())].copy()\n",
"\n",
"df['release_year'] = df['release_date'].apply(lambda date : date[0:4]).astype(int)\n",
"\n",
"df_by_title_year = df.groupby(by=['original_title', 'release_year']).id.count()\n",
"\n",
"nb_duplicates_title_year = df_by_title_year[df_by_title_year > 1].shape[0]\n",
"nb_total_movies = df.shape[0]\n",
"print(f\"{nb_duplicates_title_year} movies are note uniquely identify by original_title & release_year on {nb_total_movies} movies ({(100 * nb_duplicates_title_year / nb_total_movies):.2f}%)\")\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "7429796d",
"metadata": {},
"outputs": [],
"source": [
"df_movies_subset = df_movies.dropna(axis=0, how='any', subset=MOVIES_COLUMNS_OF_INTEREST)\n",
"\n",
"df_movies_subset = df_movies_subset[(df_movies_subset['status'] == 'Released') & \n",
" (~df_movies_subset['adult']) &\n",
" (df_movies_subset['release_date'] < '2024-03-01') &\n",
" (df_movies_subset['original_language'].isin(['fr', 'en']))].sort_values(by='release_date', ascending=False).iloc[0:NB_MOVIES_SUBSET]\n",
" \n",
"df_movies_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_movies_subset.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "c3a4ecc2",
"metadata": {},
"outputs": [],
"source": [
"df_tvshows_subset = df_tvshows.dropna(axis=0, how='any', subset=TVSHOWS_COLUMNS_OF_INTEREST)\n",
"\n",
"df_tvshows_subset = df_tvshows_subset[(df_tvshows_subset['status'] == 'Ended') & \n",
" (~df_tvshows_subset['adult']) &\n",
" (df_tvshows_subset['last_air_date'] < '2024-03-01') &\n",
" (df_tvshows_subset['original_language'].isin(['fr', 'en']))].sort_values(by='last_air_date', ascending=False).iloc[0:NB_TVSHOWS_SUBSET]\n",
" \n",
"df_tvshows_subset.to_csv(EXPORT_TMDB_SUBSETS_TO / \"tmdb_tvshows_subset.csv\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
File renamed without changes.
Loading

0 comments on commit 9012829

Please sign in to comment.