This repository contains code and resources for my final thesis titled "Application of Graph Neural Networks to Music Recommender Systems." Recommender Systems (RSs) play a crucial role in filtering vast amounts of data to deliver personalized content. Music Recommender Systems (MuRSs) enhance user experience by predicting preferences, helping users navigate extensive music libraries. Recent advancements in Graph Neural Networks (GNNs) have set new standards in RSs, but their evaluation remains inconsistent across datasets and splitting strategies. This work applies traditional and GNN-based models to a new music industry dataset, utilizing temporal data-splitting for a realistic evaluation. Therefore, the recent evaluation pipeline proposed by Malitesta et. al. (2024) has been applied and extented towards a broad set of models and beyond-accuracy metrics. Code and results are available in this repository.
MIDS ... Music Industry Dataset
- # of customers: 58.747
- # of records: 37.370
dataset | # rows, (users, items) | sparsity | features |
---|---|---|---|
MIDS (filtered) |
17.665.904, ( 58.747, 37.370 ) | 99.1953 % | userID, itemID, timestamp |
Following the steps within the evaluation pipeline:
- Create data splits
- Calculate dataset characteristics (classical & topological)
- Apply traditional and GNNs-based models to each split (for RO & TO split)
- Apply explanatory model (linear regression)
All tests have been conducted using RecBole and RecBole-GNN.
After hyperparameter and epoch tuning, a Top-10 recommendation was applied to all users on each dataset, using a random order split (70/10/20) and a temporal order split with leave-5-out (5/5) for the validation and test sets. The following tables present the mean performance across all datasets for each model, ranked in descending order by NDCG(@10). The best values are bolded, while the second-highest values are underlined.
Algorithm | Pre | MRR | NDCG | IC | ARP | APLT |
---|---|---|---|---|---|---|
ALS-MF | 0.152966 | 0.329359 | 0.198749 | 0.075893 | 57.758945 | 0.000234 |
XSimGCL | 0.150149 | 0.329086 | 0.194740 | 0.120137 | 51.662866 | 0.010322 |
AsymUserkNN | 0.145127 | 0.328943 | 0.190405 | 0.108999 | 74.511185 | 0.015634 |
SGL | 0.147338 | 0.319677 | 0.190309 | 0.144644 | 44.765344 | 0.011939 |
BPR | 0.138416 | 0.310411 | 0.178590 | 0.080508 | 73.938899 | 0.001084 |
LightGCN | 0.134315 | 0.302034 | 0.173993 | 0.115027 | 60.163477 | 0.004301 |
UltraGCN | 0.133322 | 0.292251 | 0.172359 | 0.084255 | 70.992891 | 0.001501 |
AsymItemkNN | 0.124047 | 0.193475 | 0.137902 | 0.130814 | 46.564064 | 0.051082 |
MostPop | 0.039420 | 0.096802 | 0.048425 | 0.001969 | 120.488178 | 0.000000 |
Algorithm | Pre | MRR | NDCG | IC | ARP | APLT |
---|---|---|---|---|---|---|
ALS-MF | 0.032978 | 0.079593 | 0.065587 | 0.089447 | 71.046930 | 0.000184 |
UltraGCN | 0.031105 | 0.079218 | 0.061834 | 0.068403 | 96.395474 | 0.000218 |
XSimGCL | 0.030875 | 0.077371 | 0.061724 | 0.098007 | 71.387480 | 0.006263 |
AsymUserkNN | 0.030218 | 0.074655 | 0.190405 | 0.105359 | 96.254631 | 0.009973 |
Different characteristics
where:
$X_c \equiv {SpaceSize, Shape, Density, Gini_U, Gini_I, AvgDeg_U, AvgDeg_I, AvgClustC_U, AvgClustC_I, Assort_U, Assort_I}$ $y \equiv {NDCG@10, IC@10, ARP@10}$
The model was tested under the following null hypothesis:
The values of
Furthermore, for each run, the number of interactions, average clustering coefficients, assortativity,
and the average popularity of the interacted items have been recorded for the
-
assets
: Stores material for theREADME.md
files. -
data
: Location for the recommendation dataset.mids-100000
: The first 100,000 rows of theMIDS
dataset.mids-raw
: The raw dataset to be processed in1-DataPreparation
to generate splits.mids-splits
: Storage for data splits used in the evaluation pipeline (output ofDataPreparation.ipynb
).
-
src
: Contains all steps performed as described in the thesis (see README.md)DataPreparation
: Creates dataset splits and calculates traditional & topological metrics.HyperParameterTuning
: Tunes and evaluates all model hyperparameters on themids-100000-1
split.EpochTuning
: Determines the optimal number of epochs on 10 randomly drawn datasets.TestRuns
: Conducts tests using random order and temporal order splits.Evaluation
: Builds evaluation files, performs evaluation, and conducts significance tests.AdditionalMaterial
: Contains additional plots referenced in the thesis.assets
: Stores generated plots and statistics.config
: Storesconfig_files
, constants such as Colors and Paths, and methods used in many other directories.README.md
: Further information about the source code itself.
-
test
:hello.py
: says hello
-
.gitignore
: Specifies files to be excluded from the repository. -
.python-version
: Defines the explicit Python version used (for uv). -
pyproject.toml
: Lists dependencies required to run this project (for uv). -
requirements.txt
: Lists dependencies required to run this project (for pip). -
quick_start.py
: Provides a quick-start interface to access RecBole and RecBole-GNN for running models. -
quick_start.yaml
: Configuration file for the quick-start setup inrun.py
. -
QuickEvaluation.ipynb
: Provides a quick-start interface to access the results. -
uv.lock
: Contains locked versions of all dependencies in this project (foruv sync --frozen
).
After a successful setup, the quick_start.py
and QuickEvaluation.ipynb
provide a quick way to use RecBole, RecBole-GNN, and load the results of this work.
The QuickEvaluation.ipynb
offers an quick view into the results of this work by loading the final evaluation dataset and create diverse tables, plots and perform the statistical analysis with plots.
The quick_start.py
script allows to run any model provided through RecBole and RecBole-GNN on the datasets.
All configurations, including filtering, train/test splitting, and other settings, can be adjusted in quick_start.yaml
.
For more details, refer to the RecBole configuration introduction.
The best model hyperparameter settings are listed at the bottom of the quick_start.yaml
file.
Additionally, specific configuration files can be accessed through the config
directory.
In quick_start.py
, you can modify the following lines to select the desired dataset and models:
model = '<Model>'
config_files = str(CONFIG_DIRECTORY.joinpath('<config_file>.yaml'))
dataset = '<Dataset>'
config_dict = {
'data_path': PROJECT_DIRECTORY.joinpath('<Path_to_Dataset>')
}
Possible values:
Model ['AsymKNN', 'LightGCN', 'UltraGCN', 'ALS', 'BPR', 'SGL', 'XSimGCL', 'Pop']
config_file ['quick_start', user_asym', 'item_asym', 'lightgcn', 'ultragcn', 'als', 'bpr', 'sgl', 'xsimgcl', 'mostpop']
Dataset ['mids-100000', 'mids-raw', 'mids-splits-i']
Path_to_Dataset ['', 'mids-splits]
For dataset="mids-splits-i"
, where
Path_to_Dataset == 'mids-splits'
For Model="AsymKNN"
, one of the following must be set in quick_start.yaml
:
knn_method: ['item', 'user']
For config_file="quick_start"
, adjust the config_files
path:
config_files = str(PROJECT_DIRECTORY.joinpath('quick_start.yaml'))
-
Install uv (https://github.com/astral-sh/uv).
-
Ensure proper python version
3.12.x
, if not:uv python install 3.12.5 uv python pin 3.12.5
-
Install Packages
uv sync --frozen --extra build uv sync --frozen --extra build --extra compile
-
Install uv (https://github.com/astral-sh/uv)
-
Ensure proper python version
3.12.x
, if not:uv python install 3.12.5 uv python pin 3.12.5
-
Create virtual environment:
uv venv
-
Activate virtual environment:
-
on Mac / Linux:
source .venv/bin/activate
-
on Windows:
.venv\Scripts\activate
-
Install Packages
uv pip install -r requirements.txt
-
Use python version
3.12.5
-
Create virtual environment in project directory:
python3 -m venv .venv
-
Activate virtual environment:
-
on Mac / Linux:
source .venv/bin/activate
-
on Windows:
.venv\Scripts\activate
-
Upgrade pip
pip3 install --upgrade pip
-
Install Packages
pip install -r requirements.txt
only necessary for creating the datasplits, for evaluation
The dataset can be found in the Google Drive. Store the containing files as follows:
mids_RAW_ANONYMIZED.txt
->data/mids-raw