better dataset description

renecotyfanboy · Jun 16, 2024 · b8435c1 · b8435c1
1 parent a49d209
commit b8435c1
Show file tree

Hide file tree

Showing 2 changed files with 56 additions and 17 deletions.
diff --git a/docs/dataset/cookbook.md b/docs/dataset/cookbook.md
@@ -2,32 +2,68 @@
 
 Here I'll provide a few examples of how to use the dataset using [`polars`](https://docs.pola.rs/).
 
-### Load the dataset
+## Load the dataset
 
-```python
-import polars as pl
+Load the dataset from [`huggingface`](https://huggingface.co/datasets/renecotyfanboy/leagueData) and display all the available columns.
 
-df = pl.read_csv("league_dataframe.csv")
+```python
+from datasets import load_dataset
 
-print(df)
+df = load_dataset("renecotyfanboy/leagueData", split="train").to_polars()
+print(df.columns)
 ```
 
 ## Find the history of a player
 
-``` python
+```python
 puuid = 'your_puuid' # (1)!
 historic_of_random_player = df.filter(
     puuid=puuid, is_in_reference_sample=True # (2)!
-    ).sort(by='gameStartTimestamp') 
-
+    ).sort(by='gameStartTimestamp')
 ```
 
 1. `b3fhGxFuV-hCD3B5Vvj9nrD--8YwlFACxvAIox_sOq2aNUtmkcsmem8NFufjdZd79L49I9spnh7LQg` is a valid `puuid`.
 2. `is_in_reference_sample=True` indicates that we only keep the match history collected initially. Sometimes, the player
 can appear in the others matches, but for history analysis it would include matches that were not initially selected.
 
-## Build the win/loss curve
+## Lowest number of games
+Remake games were removed from the dataset, so some players don't have 100 games. This is how we get the lowest number of game for a single player, which is 85.
+
+```python
+from datasets import load_dataset
+
+columns = ['elo', 'puuid', 'gameStartTimestamp', 'is_in_reference_sample', 'win']
+df = load_dataset("renecotyfanboy/leagueData", split="train").select_columns(columns).to_polars()
+df = df.filter(is_in_reference_sample=True)
+
+number_of_games = []
+
+for puuid in df['puuid'].unique():
+    player = df.filter(puuid=puuid)
+    number_of_games.append(len(player.sort(by='gameStartTimestamp')['win'].to_numpy()))
+
+min(number_of_games)
+```
+
+## History of the Gold III players
+
+Display the history of Gold III players in the dataset as an image.
+
+```python
+import numpy as np 
+import matplotlib.pyplot as plt 
+from datasets import load_dataset
+
+columns = ['elo', 'puuid', 'gameStartTimestamp', 'is_in_reference_sample', 'win']
+df = load_dataset("renecotyfanboy/leagueData", split="train").select_columns(columns).to_polars()
+df = df.filter(elo="GOLD_III", is_in_reference_sample=True)
+
+history = []
+
+for puuid in df['puuid'].unique():
+    player = df.filter(puuid=puuid)
+    history.append(player.sort(by='gameStartTimestamp')['win'].to_numpy()[-85:])
+
+plt.matshow(np.asarray(history))
+```
 
-``` python
-import matplotlib.pyplot as plt
-```
diff --git a/docs/dataset/introduction.md b/docs/dataset/introduction.md
@@ -19,9 +19,12 @@ in SoloQ for each of these players.
 
 1. ![La source](https://risibank.fr/cache/medias/0/14/1420/142061/full.png){ align=left }
 
-Let's explore a bit our dataset. In the next plot, I show the winrate of players in each division. The winrate is 
-computed using the history list of each player.
+The following plot show the winrate of players in each division.
 
-```plotly
-{"file_path": "dataset/assets/winrate_over_division.json"}
-```
+<div class="grid cards" markdown>
+
+-   <p style='text-align: center;'> **Winrate per division** </p>
+    ``` plotly
+    {"file_path": "dataset/assets/winrate_over_division.json"}
+    ```
+</div>