Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a project about VN1 competition #179

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions project/vn1_competition/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
download_data:
mkdir -p data
curl https://www.datasource.ai/attachments/eyJpZCI6Ijk4NDYxNjE2NmZmZjM0MGRmNmE4MTczOGMyMzI2ZWI2LmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMCAtIFNhbGVzLmNzdiIsInNpemUiOjEwODA0NjU0LCJtaW1lX3R5cGUiOiJ0ZXh0L2NzdiJ9fQ -o data/phase_0_sales.csv
curl https://www.datasource.ai/attachments/eyJpZCI6ImM2OGQxNGNmNTJkZDQ1MTUyZTg0M2FkMDAyMjVlN2NlLmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMSAtIFNhbGVzLmNzdiIsInNpemUiOjEwMTgzOTYsIm1pbWVfdHlwZSI6InRleHQvY3N2In19 -o data/phase_1_sales.csv
curl https://www.datasource.ai/attachments/eyJpZCI6IjhlNmJmNmU3ZTlhNWQ4NTcyNGVhNTI4YjAwNTk3OWE1LmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMiAtIFNhbGVzLmNzdiIsInNpemUiOjEwMTI0MzcsIm1pbWVfdHlwZSI6InRleHQvY3N2In19 -o data/phase_2_sales.csv
70 changes: 70 additions & 0 deletions project/vn1_competition/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# The Fine-Tuned Moirai-Base Model Achieves 1st Place in the VN1 Challenge

We present a reproducible experiment where **Salesforce's Moirai** pretrained model, after simple fine-tuning, achieves **first place** in the [VN1 Forecasting - Accuracy Challenge](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description).

One of the competition's key requirements was to use open-source solutions, and all code, training scripts, and data for the Moirai pretrained model have been open-sourced. The further fine-tuning of the model follows the approach in the codebase, requiring only minor modifications to parameters in the fine-tuning scripts.

The table below displays the official competition results, where Moirai-base outperformed all competitors to claim the top position. The final scores were averaged over 5 predictions.

| **Model** | **Score** |
| ----------- | ---------- |
| **Moirai-base** | **0.4629** |
| 1st | 0.4637 |
| 2nd | 0.4657 |
| 3rd | 0.4758 |
| 4th | 0.4774 |
| 5th | 0.4808 |

---

### [**VN1 Forecasting**](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description)
Participants in this datathon are tasked with accurately forecasting future sales using historical sales and pricing data provided. The goal is to develop robust predictive models that can anticipate sales trends for various products across different clients and warehouses. Submissions will be evaluated based on their accuracy and bias against actual sales figures. The competition is structured into two phases.

#### Phase 1
In this phase participants will use the provided Phase 0 sales data to predict sales for Phase 1. This phase will last three weeks, during which there will be live leaderboard updates to track the progress and provide feedback on the predictions. At the end of Phase 1, participants will receive the actual sales data for this phase.

#### Phase 2
Using both Phase 0 and Phase 1 data, participants will predict sales for Phase 2. This second phase will last two weeks, but unlike Phase 1, there will be no leaderboard updates until the competition ends.

---

### [**Data Overview**](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/datasets)
#### The competition data consists of three phases:
Phase 0: Historical training data
Phase 1: Additional training data
Phase 2: Test data (used for evaluation)

#### Each data entry includes the following fields:
- Client: Client ID
- Warehouse: Warehouse ID
- Product: Product ID
- Weekly sales data

---

### **How to Run**
To reproduce the experimental results, refer to this [blog](https://zhuanlan.zhihu.com/p/20755649808).

#### Instructions
1. Follow the instructions from the `uni2ts` library to create a virtual environment and install dependencies.
2. The `Makefile` provides the raw dataset required for use:
```bash
make download_data
```
3. After replacing the directory path of the downloaded raw dataset, run `prepare_data.py` to obtain the preprocessed dataset.
4. Add the directory path of the processed dataset to the `.env` file:
```bash
echo "CUSTOM_DATA_PATH=PATH_TO_SAVE" >> .env
```
5. Replace the variable `pretrained_model_name_or_path` in the configuration file with your own path, then run the following command to fine-tune the `Moirai-base` model:
```bash
python -m cli.train -cp ../project/vn1_competition/fine_tune run_name=run1
```
6. Replace the weight file path in the `main.py` file under the `src` directory and run `main.py`.

---

### **References**

- Vandeput, Nicolas. “VN1 Forecasting - Accuracy Challenge.” DataSource.ai, DataSource, 3 Oct. 2024, [https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description)
- [Moirai Paper](https://arxiv.org/abs/2402.02592)
3 changes: 3 additions & 0 deletions project/vn1_competition/fine_tune/VN1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
_target_: uni2ts.data.builder.simple.SimpleDatasetBuilder
dataset: train_dataset
weight: 1
10 changes: 10 additions & 0 deletions project/vn1_competition/fine_tune/VN1_val.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
_target_: uni2ts.data.builder.ConcatDatasetBuilder
_args_:
_target_: uni2ts.data.builder.simple.generate_eval_builders
dataset: val_dataset
offset: 97
eval_length: 16
prediction_lengths: [13]
context_lengths: [65]
patch_sizes: [8,16]

84 changes: 84 additions & 0 deletions project/vn1_competition/fine_tune/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
hydra:
run:
dir: outputs/finetune/${hydra:runtime.choices.model}/${hydra:runtime.choices.data}/${run_name}
defaults:
- model: ../moirai_1.1_R_base
- data: ../VN1
- val_data: ../VN1_val
- _self_
run_name: ???
seed: 0
tf32: true
compile: false # set to mode: default, reduce-overhead, max-autotune
ckpt_path: null
trainer:
_target_: lightning.Trainer
accelerator: auto
strategy: auto
devices: [0,1,2,3]
num_nodes: 1
precision: 32
logger:
_target_: lightning.pytorch.loggers.TensorBoardLogger
save_dir: ${hydra:runtime.output_dir}
name: logs
callbacks:
- _target_: lightning.pytorch.callbacks.LearningRateMonitor
logging_interval: epoch
- _target_: lightning.pytorch.callbacks.ModelCheckpoint
dirpath: ${hydra:runtime.output_dir}/checkpoints
monitor: val/PackedNLLLoss
save_weights_only: true
mode: min
save_top_k: 1
every_n_epochs: 1
- _target_: lightning.pytorch.callbacks.EarlyStopping
monitor: val/PackedNLLLoss
min_delta: 0.0
patience: 5
mode: min
strict: false
verbose: true
max_epochs: 100
enable_progress_bar: true
accumulate_grad_batches: 1
gradient_clip_val: 1.0
gradient_clip_algorithm: norm
train_dataloader:
_target_: uni2ts.data.loader.DataLoader
batch_size: 128
batch_size_factor: 2.0
cycle: true
num_batches_per_epoch: 100
shuffle: true
num_workers: 11
collate_fn:
_target_: uni2ts.data.loader.PackCollate
max_length: ${model.module_kwargs.max_seq_len}
seq_fields: ${cls_getattr:${model._target_},seq_fields}
pad_func_map: ${cls_getattr:${model._target_},pad_func_map}
pin_memory: true
drop_last: false
fill_last: false
worker_init_fn: null
prefetch_factor: 2
persistent_workers: true
val_dataloader:
_target_: uni2ts.data.loader.DataLoader
batch_size: 128
batch_size_factor: 2.0
cycle: false
num_batches_per_epoch: null
shuffle: false
num_workers: 11
collate_fn:
_target_: uni2ts.data.loader.PackCollate
max_length: ${model.module_kwargs.max_seq_len}
seq_fields: ${cls_getattr:${model._target_},seq_fields}
pad_func_map: ${cls_getattr:${model._target_},pad_func_map}
pin_memory: false
drop_last: false
fill_last: true
worker_init_fn: null
prefetch_factor: 2
persistent_workers: true
34 changes: 34 additions & 0 deletions project/vn1_competition/fine_tune/moirai_1.1_R_base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# load a pretrained checkpoint from huggingface hub
_target_: uni2ts.model.moirai.MoiraiFinetune
module:
_target_: uni2ts.model.moirai.MoiraiModule.from_pretrained
pretrained_model_name_or_path: Salesforce/moirai-1.1-R-base
module_kwargs:
_target_: builtins.dict
distr_output:
_target_: uni2ts.distribution.MixtureOutput
components:
- _target_: uni2ts.distribution.StudentTOutput
- _target_: uni2ts.distribution.NormalFixedScaleOutput
- _target_: uni2ts.distribution.NegativeBinomialOutput
- _target_: uni2ts.distribution.LogNormalOutput
d_model: 768
num_layers: 12
patch_sizes: ${as_tuple:[8, 16, 32, 64, 128]}
max_seq_len: 512
attn_dropout_p: 0.0
dropout_p: 0.0
scaling: true
min_patches: 2
min_mask_ratio: 0.1
max_mask_ratio: 0.4
max_dim: 128
loss_func:
_target_: uni2ts.loss.packed.PackedNLLLoss
lr: 5e-8
weight_decay: 1e-1
beta1: 0.9
beta2: 0.98

num_training_steps: 10000
num_warmup_steps: 0
81 changes: 81 additions & 0 deletions project/vn1_competition/prepare_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
import os
from collections.abc import Generator
from typing import Any

import datasets
import pandas as pd
from datasets import Features, Sequence, Value


def train_example_gen_func() -> Generator[dict[str, Any], None, None]:
for i, (product_id, df) in enumerate(train_df.items()):
yield {
"target": df.to_numpy(),
"start": df.index[0],
"freq": pd.infer_freq(df.index),
"item_id": f"item_{i}",
}


def val_example_gen_func() -> Generator[dict[str, Any], None, None]:
for i, (product_id, df) in enumerate(val_df.items()):
yield {
"target": df.to_numpy(),
"start": df.index[0],
"freq": pd.infer_freq(df.index),
"item_id": f"item_{i}",
}


def get_data(file_path1, file_path2):
df_sales_0 = pd.read_csv(file_path1)
df_sales_1 = pd.read_csv(file_path2)
df_sales = pd.concat([df_sales_0, df_sales_1.iloc[:, 3:]], axis=1)
df_sales["item_id"] = (
df_sales["Client"].astype(str)
+ "-"
+ df_sales["Warehouse"].astype(str)
+ "-"
+ df_sales["Product"].astype(str)
)
df_sales.drop(columns=["Client", "Warehouse", "Product"], inplace=True)
cols = ["item_id"] + [col for col in df_sales.columns if col != "item_id"]
df_sales = df_sales[cols]
df_sales = df_sales.T
df_sales.columns = df_sales.iloc[0]
df_sales.drop(df_sales.index[0], inplace=True)
df_sales.index = pd.to_datetime(df_sales.index)
return df_sales


current_dir = os.path.dirname(os.path.abspath(__file__))
file_path1 = os.path.join(current_dir, "data/phase_0_sales.csv")
file_path2 = os.path.join(current_dir, "data/phase_1_sales.csv")

df_sales = get_data(file_path1, file_path2)
df_sales = df_sales.iloc[70:, :]
df_sales.index.name = "timestamp"
zero_ratios = (df_sales == 0).mean()
cols_to_drop = zero_ratios[zero_ratios > 0.5].index
df_sales = df_sales.drop(columns=cols_to_drop)

train_df = df_sales.iloc[:-16, :]
val_df = df_sales.iloc[:, :]

features = Features(
dict(
target=Sequence(Value("float32")),
start=Value("timestamp[s]"),
freq=Value("string"),
item_id=Value("string"),
)
)

train_dataset = datasets.Dataset.from_generator(
train_example_gen_func, features=features
)
val_dataset = datasets.Dataset.from_generator(val_example_gen_func, features=features)
train_dataset_path = os.path.join(current_dir, "train_dataset")
val_dataset_path = os.path.join(current_dir, "val_dataset")
train_dataset.save_to_disk(train_dataset_path)
val_dataset.save_to_disk(val_dataset_path)
Loading