RecTest Library: Data Schema and Benchmarking Guide

Overview

The RecTest library allows you to benchmark your datasets for recommendation systems. This guide provides the necessary data schema and steps to get started.

Data Schema

Initial Dataset

Your dataset should be in JSONL format, similar to the Amazon Reviews 2018 & 2023 versions. Each line should represent an interaction between a user and an item.

Required Fields:

user_id (str): ID of the user interacting with the product.
item_id (str): ID of the product.
timestamp (int): Time of the interaction in Unix time.

Optional Fields:

review_text (str): Review left by the user.
score (bool|float): Rating/score of the item left by the user, binary or float.

Note: The names of the input columns can be set in the YAML configuration file.

Metadata-Items

You can optionally provide additional metadata for items.

Required Field:

item_id: ID of the product.

Optional Fields:

feature_1
feature_2
feature_3
...

Example (based on Amazon Reviews 2018):

item_id: ID of the product.
title: Name of the product.
description: Description of the product.
brand: Brand name.
categories: List of categories the product belongs to.

Output Dataset

The provided data will be processed through a custom preprocessing step. The resulting dataset folder will have the following structure:

Example Directory Structure

Name_of_Dataset
- interactions: Dask DataFrame
- train (optional): Dask DataFrame
- val (optional): Dask DataFrame
- test (optional): Dask DataFrame
- metadata (optional): Dask DataFrame
- encoder_items.json (optional): Dictionary in the form {item_id: item_id_encoded}
- encoder_users.json (optional): Dictionary in the form {user_id: user_id_encoded}
- stats.json (optional): Information about the number of unique users, items, and interactions.

Benchmarking Your Dataset

To benchmark your dataset using the RecTest library, follow these steps:

Prepare Your Data: Ensure your data is in the required JSONL format with the necessary fields.
Create a YAML Configuration File: Define the column names and other settings in a YAML file.
Run the Preprocessing Script: Use the provided script to preprocess your data.
Train and Evaluate Models: Use the training scripts to benchmark your dataset.

For detailed instructions and examples, refer to the notebooks/create_dataset.ipynb.

Using Recformer and LlamaRec

To avoid copying code from the respective github repos, we have added both Recformer and LlamaRec as submodules.

Initializing Submodules

To use Recformer and LlamaRec, you need to initialize the submodules from git:

git submodule update --init --recursive

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
baselines		baselines
configs_hydra		configs_hydra
data		data
dataset		dataset
notebooks		notebooks
recommender		recommender
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RecTest Library: Data Schema and Benchmarking Guide

Overview

Data Schema

Initial Dataset

Required Fields:

Optional Fields:

Metadata-Items

Required Field:

Optional Fields:

Example (based on Amazon Reviews 2018):

Output Dataset

Example Directory Structure

Benchmarking Your Dataset

Using Recformer and LlamaRec

Initializing Submodules

About

Releases

Packages

Contributors 2

Languages

License

snap-research/rectest

Folders and files

Latest commit

History

Repository files navigation

RecTest Library: Data Schema and Benchmarking Guide

Overview

Data Schema

Initial Dataset

Required Fields:

Optional Fields:

Metadata-Items

Required Field:

Optional Fields:

Example (based on Amazon Reviews 2018):

Output Dataset

Example Directory Structure

Benchmarking Your Dataset

Using Recformer and LlamaRec

Initializing Submodules

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages