Skip to content

mkhe93/Thesis-GNN-Rec-2025

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Thesis-GNN-Rec-2025

πŸ•΅οΈ Introduction

This repository contains code and resources for my final thesis titled "Application of Graph Neural Networks to Music Recommender Systems." Recommender Systems (RSs) play a crucial role in filtering vast amounts of data to deliver personalized content. Music Recommender Systems (MuRSs) enhance user experience by predicting preferences, helping users navigate extensive music libraries. Recent advancements in Graph Neural Networks (GNNs) have set new standards in RSs, but their evaluation remains inconsistent across datasets and splitting strategies. This work applies traditional and GNN-based models to a new music industry dataset, utilizing temporal data-splitting for a realistic evaluation. Therefore, the recent evaluation pipeline proposed by Malitesta et. al. (2024) has been applied and extented towards a broad set of models and beyond-accuracy metrics. Code and results are available in this repository.

πŸ’Ύ Dataset

MIDS ... Music Industry Dataset

  • # of customers: 58.747
  • # of records: 37.370
dataset # rows, (users, items) sparsity features
MIDS (filtered) 17.665.904, ( 58.747, 37.370 ) 99.1953 % userID, itemID, timestamp

βš™οΈ Methodology

Following the steps within the evaluation pipeline:

  1. Create data splits
  2. Calculate dataset characteristics (classical & topological)
  3. Apply traditional and GNNs-based models to each split (for RO & TO split)
  4. Apply explanatory model (linear regression)

Distribution number of interactions

Distribution of the number of interactions.

All tests have been conducted using RecBole and RecBole-GNN.

πŸ’Ž Results

πŸ“ˆ Performance

After hyperparameter and epoch tuning, a Top-10 recommendation was applied to all users on each dataset, using a random order split (70/10/20) and a temporal order split with leave-5-out (5/5) for the validation and test sets. The following tables present the mean performance across all datasets for each model, ranked in descending order by NDCG(@10). The best values are bolded, while the second-highest values are underlined.

RO (70/10/20)
Algorithm Pre MRR NDCG IC ARP APLT
ALS-MF 0.152966 0.329359 0.198749 0.075893 57.758945 0.000234
XSimGCL 0.150149 0.329086 0.194740 0.120137 51.662866 0.010322
AsymUserkNN 0.145127 0.328943 0.190405 0.108999 74.511185 0.015634
SGL 0.147338 0.319677 0.190309 0.144644 44.765344 0.011939
BPR 0.138416 0.310411 0.178590 0.080508 73.938899 0.001084
LightGCN 0.134315 0.302034 0.173993 0.115027 60.163477 0.004301
UltraGCN 0.133322 0.292251 0.172359 0.084255 70.992891 0.001501
AsymItemkNN 0.124047 0.193475 0.137902 0.130814 46.564064 0.051082
MostPop 0.039420 0.096802 0.048425 0.001969 120.488178 0.000000

Boxplot of test runs on data splits with RO

Boxplots of the performance of test runs with RO.
TO (5/5)
Algorithm Pre MRR NDCG IC ARP APLT
ALS-MF 0.032978 0.079593 0.065587 0.089447 71.046930 0.000184
UltraGCN 0.031105 0.079218 0.061834 0.068403 96.395474 0.000218
XSimGCL 0.030875 0.077371 0.061724 0.098007 71.387480 0.006263
AsymUserkNN 0.030218 0.074655 0.190405 0.105359 96.254631 0.009973

Boxplot of test runs on data splits with RO

Boxplots of the performance of test runs with RO.

πŸ’‘ Influence Analysis

Different characteristics $(X_c)$ have been investigated for all models regarding their influence $(\beta_c)$ on certain target metrics $(y)$. Therefore, a linear regression model has been applied as follows:

$$y=\beta_0 + \beta_c X_c + e$$

where:

  • $X_c \equiv {SpaceSize, Shape, Density, Gini_U, Gini_I, AvgDeg_U, AvgDeg_I, AvgClustC_U, AvgClustC_I, Assort_U, Assort_I}$
  • $y \equiv {NDCG@10, IC@10, ARP@10}$

The model was tested under the following null hypothesis:

$$H_0: \beta_c = 0, \quad H_1: \beta_c \neq 0$$

The values of $\beta_c$ are represented by the bar length, while the $p$-value is indicated by the color.

Influence dataset characteristics to NDCG@10

Influence dataset characteristics for XSimGCL

Furthermore, for each run, the number of interactions, average clustering coefficients, assortativity, and the average popularity of the interacted items have been recorded for the $10$ best and worst users who received recommendations.

Influence dataset characteristics to NDCG@10

Characteristics of the $10$ users who received the best and worst recommendations for XSimGCL.

πŸ” Project Structure

  • assets: Stores material for the README.md files.

  • data: Location for the recommendation dataset.

    • mids-100000: The first 100,000 rows of the MIDS dataset.
    • mids-raw: The raw dataset to be processed in 1-DataPreparation to generate splits.
    • mids-splits: Storage for data splits used in the evaluation pipeline (output of DataPreparation.ipynb).
  • src: Contains all steps performed as described in the thesis (see README.md)

    • DataPreparation: Creates dataset splits and calculates traditional & topological metrics.
    • HyperParameterTuning: Tunes and evaluates all model hyperparameters on the mids-100000-1 split.
    • EpochTuning: Determines the optimal number of epochs on 10 randomly drawn datasets.
    • TestRuns: Conducts tests using random order and temporal order splits.
    • Evaluation: Builds evaluation files, performs evaluation, and conducts significance tests.
    • AdditionalMaterial: Contains additional plots referenced in the thesis.
    • assets: Stores generated plots and statistics.
    • config: Stores config_files, constants such as Colors and Paths, and methods used in many other directories.
    • README.md: Further information about the source code itself.
  • test:

    • hello.py: says hello
  • .gitignore: Specifies files to be excluded from the repository.

  • .python-version: Defines the explicit Python version used (for uv).

  • pyproject.toml: Lists dependencies required to run this project (for uv).

  • requirements.txt: Lists dependencies required to run this project (for pip).

  • quick_start.py: Provides a quick-start interface to access RecBole and RecBole-GNN for running models.

  • quick_start.yaml: Configuration file for the quick-start setup in run.py.

  • QuickEvaluation.ipynb: Provides a quick-start interface to access the results.

  • uv.lock: Contains locked versions of all dependencies in this project (for uv sync --frozen).

Quick Start

After a successful setup, the quick_start.py and QuickEvaluation.ipynb provide a quick way to use RecBole, RecBole-GNN, and load the results of this work.

via notebook

The QuickEvaluation.ipynb offers an quick view into the results of this work by loading the final evaluation dataset and create diverse tables, plots and perform the statistical analysis with plots.

via script

The quick_start.py script allows to run any model provided through RecBole and RecBole-GNN on the datasets. All configurations, including filtering, train/test splitting, and other settings, can be adjusted in quick_start.yaml.
For more details, refer to the RecBole configuration introduction.

The best model hyperparameter settings are listed at the bottom of the quick_start.yaml file.
Additionally, specific configuration files can be accessed through the config directory.

In quick_start.py, you can modify the following lines to select the desired dataset and models:

model = '<Model>'
config_files = str(CONFIG_DIRECTORY.joinpath('<config_file>.yaml'))
dataset = '<Dataset>'
config_dict = {
    'data_path': PROJECT_DIRECTORY.joinpath('<Path_to_Dataset>')
}

Possible values:

Model ['AsymKNN', 'LightGCN', 'UltraGCN', 'ALS', 'BPR', 'SGL', 'XSimGCL', 'Pop']
config_file ['quick_start', user_asym', 'item_asym', 'lightgcn', 'ultragcn', 'als', 'bpr', 'sgl', 'xsimgcl', 'mostpop']
Dataset ['mids-100000', 'mids-raw', 'mids-splits-i']
Path_to_Dataset ['', 'mids-splits]

For dataset="mids-splits-i", where $i \in {1, \dots, 176}$, the splits must be created, and the correct path to these datasets must be specified.

Path_to_Dataset == 'mids-splits'

For Model="AsymKNN", one of the following must be set in quick_start.yaml:

knn_method: ['item', 'user']

For config_file="quick_start", adjust the config_files path:

config_files = str(PROJECT_DIRECTORY.joinpath('quick_start.yaml'))

Setup

Setup with uv (recommended)

  1. Install uv (https://github.com/astral-sh/uv).

  2. Ensure proper python version 3.12.x, if not:

     uv python install 3.12.5
     uv python pin 3.12.5
    
  3. Install Packages

    uv sync --frozen --extra build
    uv sync --frozen --extra build --extra compile
    

Setup with uv pip

  1. Install uv (https://github.com/astral-sh/uv)

  2. Ensure proper python version 3.12.x, if not:

     uv python install 3.12.5
     uv python pin 3.12.5
    
  3. Create virtual environment:

     uv venv
    
  4. Activate virtual environment:

  • on Mac / Linux:

      source .venv/bin/activate
    
  • on Windows:

      .venv\Scripts\activate
    
  1. Install Packages

     uv pip install -r requirements.txt
    

Setup with pip

  1. Use python version 3.12.5

  2. Create virtual environment in project directory:

    python3 -m venv .venv
    
  3. Activate virtual environment:

  • on Mac / Linux:

      source .venv/bin/activate
    
  • on Windows:

      .venv\Scripts\activate
    
  1. Upgrade pip

    pip3 install --upgrade pip
    
  2. Install Packages

     pip install -r requirements.txt
    

Download and store the Dataset

only necessary for creating the datasplits, for evaluation

The dataset can be found in the Google Drive. Store the containing files as follows:

  • mids_RAW_ANONYMIZED.txt -> data/mids-raw

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published