SalesforceAIResearch · dmwyd · Feb 1, 2025 · Feb 10, 2025 · Feb 18, 2025
diff --git a/project/vn1_competition/Makefile b/project/vn1_competition/Makefile
@@ -0,0 +1,5 @@
+download_data:
+	mkdir -p data
+	curl https://www.datasource.ai/attachments/eyJpZCI6Ijk4NDYxNjE2NmZmZjM0MGRmNmE4MTczOGMyMzI2ZWI2LmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMCAtIFNhbGVzLmNzdiIsInNpemUiOjEwODA0NjU0LCJtaW1lX3R5cGUiOiJ0ZXh0L2NzdiJ9fQ -o data/phase_0_sales.csv
+	curl https://www.datasource.ai/attachments/eyJpZCI6ImM2OGQxNGNmNTJkZDQ1MTUyZTg0M2FkMDAyMjVlN2NlLmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMSAtIFNhbGVzLmNzdiIsInNpemUiOjEwMTgzOTYsIm1pbWVfdHlwZSI6InRleHQvY3N2In19 -o data/phase_1_sales.csv 
+	curl https://www.datasource.ai/attachments/eyJpZCI6IjhlNmJmNmU3ZTlhNWQ4NTcyNGVhNTI4YjAwNTk3OWE1LmNzdiIsInN0b3JhZ2UiOiJzdG9yZSIsIm1ldGFkYXRhIjp7ImZpbGVuYW1lIjoiUGhhc2UgMiAtIFNhbGVzLmNzdiIsInNpemUiOjEwMTI0MzcsIm1pbWVfdHlwZSI6InRleHQvY3N2In19 -o data/phase_2_sales.csv 
diff --git a/project/vn1_competition/README.md b/project/vn1_competition/README.md
@@ -0,0 +1,70 @@
+# The Fine-Tuned Moirai-Base Model Achieves 1st Place in the VN1 Challenge
+
+We present a reproducible experiment where **Salesforce's Moirai** pretrained model, after simple fine-tuning, achieves **first place** in the [VN1 Forecasting - Accuracy Challenge](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description).
+
+One of the competition's key requirements was to use open-source solutions, and all code, training scripts, and data for the Moirai pretrained model have been open-sourced. The further fine-tuning of the model follows the approach in the codebase, requiring only minor modifications to parameters in the fine-tuning scripts.
+
+The table below displays the official competition results, where Moirai-base outperformed all competitors to claim the top position. The final scores were averaged over 5 predictions.
+
+| **Model**   | **Score**  |
+| ----------- | ---------- |
+| **Moirai-base** | **0.4629** |
+| 1st         | 0.4637 |
+| 2nd         | 0.4657     |
+| 3rd         | 0.4758     |
+| 4th | 0.4774 | 
+| 5th | 0.4808 |
+
+---
+
+### [**VN1 Forecasting**](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description)
+Participants in this datathon are tasked with accurately forecasting future sales using historical sales and pricing data provided. The goal is to develop robust predictive models that can anticipate sales trends for various products across different clients and warehouses. Submissions will be evaluated based on their accuracy and bias against actual sales figures. The competition is structured into two phases.
+
+#### Phase 1
+In this phase participants will use the provided Phase 0 sales data to predict sales for Phase 1. This phase will last three weeks, during which there will be live leaderboard updates to track the progress and provide feedback on the predictions. At the end of Phase 1, participants will receive the actual sales data for this phase.
+
+#### Phase 2
+Using both Phase 0 and Phase 1 data, participants will predict sales for Phase 2. This second phase will last two weeks, but unlike Phase 1, there will be no leaderboard updates until the competition ends.
+
+---
+
+### [**Data Overview**](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/datasets)
+#### The competition data consists of three phases:
+Phase 0: Historical training data  
+Phase 1: Additional training data  
+Phase 2: Test data (used for evaluation)
+
+#### Each data entry includes the following fields:
+- Client: Client ID  
+- Warehouse: Warehouse ID  
+- Product: Product ID  
+- Weekly sales data
+
+---
+
+### **How to Run**
+To reproduce the experimental results, refer to this [blog](https://zhuanlan.zhihu.com/p/20755649808).
+
+#### Instructions
+1. Follow the instructions from the `uni2ts` library to create a virtual environment and install dependencies.
+2. The `Makefile` provides the raw dataset required for use:
+   ```bash
+   make download_data
+   ```
+3. After replacing the directory path of the downloaded raw dataset, run `prepare_data.py` to obtain the preprocessed dataset.
+4. Add the directory path of the processed dataset to the `.env` file:
+   ```bash
+   echo "CUSTOM_DATA_PATH=PATH_TO_SAVE" >> .env
+   ```
+5. Replace the variable `pretrained_model_name_or_path` in the configuration file with your own path, then run the following command to fine-tune the `Moirai-base` model:
+   ```bash
+   python -m cli.train -cp ../project/vn1_competition/fine_tune run_name=run1
+   ```
+6. Replace the weight file path in the `main.py` file under the `src` directory and run `main.py`.
+
+---
+
+### **References**
+
+- Vandeput, Nicolas. “VN1 Forecasting - Accuracy Challenge.” DataSource.ai, DataSource, 3 Oct. 2024, [https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description](https://www.datasource.ai/en/home/data-science-competitions-for-startups/phase-2-vn1-forecasting-accuracy-challenge/description)
+- [Moirai Paper](https://arxiv.org/abs/2402.02592)
diff --git a/project/vn1_competition/fine_tune/VN1.yaml b/project/vn1_competition/fine_tune/VN1.yaml
@@ -0,0 +1,3 @@
+_target_: uni2ts.data.builder.simple.SimpleDatasetBuilder
+dataset: train_dataset
+weight: 1
diff --git a/project/vn1_competition/fine_tune/VN1_val.yaml b/project/vn1_competition/fine_tune/VN1_val.yaml
@@ -0,0 +1,10 @@
+_target_: uni2ts.data.builder.ConcatDatasetBuilder
+_args_:
+  _target_: uni2ts.data.builder.simple.generate_eval_builders
+  dataset: val_dataset
+  offset: 97
+  eval_length: 16
+  prediction_lengths: [13]
+  context_lengths: [65]
+  patch_sizes: [8,16]
+
diff --git a/project/vn1_competition/fine_tune/config.yaml b/project/vn1_competition/fine_tune/config.yaml
@@ -0,0 +1,84 @@
+hydra:
+  run:
+    dir: outputs/finetune/${hydra:runtime.choices.model}/${hydra:runtime.choices.data}/${run_name}
+defaults:
+  - model: ../moirai_1.1_R_base
+  - data: ../VN1
+  - val_data: ../VN1_val
+  - _self_
+run_name: ???
+seed: 0
+tf32: true
+compile: false  # set to mode: default, reduce-overhead, max-autotune
+ckpt_path: null
+trainer:
+  _target_: lightning.Trainer
+  accelerator: auto
+  strategy: auto
+  devices: [0,1,2,3]            
+  num_nodes: 1
+  precision: 32
+  logger:
+      _target_: lightning.pytorch.loggers.TensorBoardLogger
+      save_dir: ${hydra:runtime.output_dir}
+      name: logs
+  callbacks:
+    - _target_: lightning.pytorch.callbacks.LearningRateMonitor
+      logging_interval: epoch
+    - _target_: lightning.pytorch.callbacks.ModelCheckpoint
+      dirpath: ${hydra:runtime.output_dir}/checkpoints
+      monitor: val/PackedNLLLoss
+      save_weights_only: true
+      mode: min
+      save_top_k: 1
+      every_n_epochs: 1
+    - _target_: lightning.pytorch.callbacks.EarlyStopping 
+      monitor: val/PackedNLLLoss
+      min_delta: 0.0
+      patience: 5
+      mode: min
+      strict: false
+      verbose: true
+  max_epochs: 100
+  enable_progress_bar: true
+  accumulate_grad_batches: 1       
+  gradient_clip_val: 1.0
+  gradient_clip_algorithm: norm
+train_dataloader:
+  _target_: uni2ts.data.loader.DataLoader
+  batch_size: 128               
+  batch_size_factor: 2.0        
+  cycle: true                
+  num_batches_per_epoch: 100 
+  shuffle: true
+  num_workers: 11
+  collate_fn:
+    _target_: uni2ts.data.loader.PackCollate         
+    max_length: ${model.module_kwargs.max_seq_len}
+    seq_fields: ${cls_getattr:${model._target_},seq_fields}
+    pad_func_map: ${cls_getattr:${model._target_},pad_func_map}
+  pin_memory: true
+  drop_last: false           
+  fill_last: false           
+  worker_init_fn: null
+  prefetch_factor: 2
+  persistent_workers: true
+val_dataloader:
+  _target_: uni2ts.data.loader.DataLoader
+  batch_size: 128                       
+  batch_size_factor: 2.0               
+  cycle: false
+  num_batches_per_epoch: null
+  shuffle: false
+  num_workers: 11
+  collate_fn:
+    _target_: uni2ts.data.loader.PackCollate
+    max_length: ${model.module_kwargs.max_seq_len}
+    seq_fields: ${cls_getattr:${model._target_},seq_fields}
+    pad_func_map: ${cls_getattr:${model._target_},pad_func_map}
+  pin_memory: false
+  drop_last: false
+  fill_last: true
+  worker_init_fn: null
+  prefetch_factor: 2
+  persistent_workers: true
diff --git a/project/vn1_competition/fine_tune/moirai_1.1_R_base.yaml b/project/vn1_competition/fine_tune/moirai_1.1_R_base.yaml
@@ -0,0 +1,34 @@
+# load a pretrained checkpoint from huggingface hub
+_target_: uni2ts.model.moirai.MoiraiFinetune
+module:    
+  _target_: uni2ts.model.moirai.MoiraiModule.from_pretrained
+  pretrained_model_name_or_path: Salesforce/moirai-1.1-R-base
+module_kwargs:
+  _target_: builtins.dict
+  distr_output:
+    _target_: uni2ts.distribution.MixtureOutput
+    components:
+      - _target_: uni2ts.distribution.StudentTOutput
+      - _target_: uni2ts.distribution.NormalFixedScaleOutput
+      - _target_: uni2ts.distribution.NegativeBinomialOutput
+      - _target_: uni2ts.distribution.LogNormalOutput
+  d_model: 768
+  num_layers: 12
+  patch_sizes: ${as_tuple:[8, 16, 32, 64, 128]}      
+  max_seq_len: 512                        
+  attn_dropout_p: 0.0
+  dropout_p: 0.0
+  scaling: true
+min_patches: 2           
+min_mask_ratio: 0.1      
+max_mask_ratio: 0.4     
+max_dim: 128     
+loss_func:
+  _target_: uni2ts.loss.packed.PackedNLLLoss
+lr: 5e-8                              
+weight_decay: 1e-1      
+beta1: 0.9              
+beta2: 0.98              
+
+num_training_steps: 10000                
+num_warmup_steps: 0
diff --git a/project/vn1_competition/prepare_data.py b/project/vn1_competition/prepare_data.py
@@ -0,0 +1,81 @@
+import os
+from collections.abc import Generator
+from typing import Any
+
+import datasets
+import pandas as pd
+from datasets import Features, Sequence, Value
+
+
+def train_example_gen_func() -> Generator[dict[str, Any], None, None]:
+    for i, (product_id, df) in enumerate(train_df.items()):
+        yield {
+            "target": df.to_numpy(),
+            "start": df.index[0],
+            "freq": pd.infer_freq(df.index),
+            "item_id": f"item_{i}",
+        }
+
+
+def val_example_gen_func() -> Generator[dict[str, Any], None, None]:
+    for i, (product_id, df) in enumerate(val_df.items()):
+        yield {
+            "target": df.to_numpy(),
+            "start": df.index[0],
+            "freq": pd.infer_freq(df.index),
+            "item_id": f"item_{i}",
+        }
+
+
+def get_data(file_path1, file_path2):
+    df_sales_0 = pd.read_csv(file_path1)
+    df_sales_1 = pd.read_csv(file_path2)
+    df_sales = pd.concat([df_sales_0, df_sales_1.iloc[:, 3:]], axis=1)
+    df_sales["item_id"] = (
+        df_sales["Client"].astype(str)
+        + "-"
+        + df_sales["Warehouse"].astype(str)
+        + "-"
+        + df_sales["Product"].astype(str)
+    )
+    df_sales.drop(columns=["Client", "Warehouse", "Product"], inplace=True)
+    cols = ["item_id"] + [col for col in df_sales.columns if col != "item_id"]
+    df_sales = df_sales[cols]
+    df_sales = df_sales.T
+    df_sales.columns = df_sales.iloc[0]
+    df_sales.drop(df_sales.index[0], inplace=True)
+    df_sales.index = pd.to_datetime(df_sales.index)
+    return df_sales
+
+
+current_dir = os.path.dirname(os.path.abspath(__file__))
+file_path1 = os.path.join(current_dir, "data/phase_0_sales.csv")
+file_path2 = os.path.join(current_dir, "data/phase_1_sales.csv")
+
+df_sales = get_data(file_path1, file_path2)
+df_sales = df_sales.iloc[70:, :]
+df_sales.index.name = "timestamp"
+zero_ratios = (df_sales == 0).mean()
+cols_to_drop = zero_ratios[zero_ratios > 0.5].index
+df_sales = df_sales.drop(columns=cols_to_drop)
+
+train_df = df_sales.iloc[:-16, :]
+val_df = df_sales.iloc[:, :]
+
+features = Features(
+    dict(
+        target=Sequence(Value("float32")),
+        start=Value("timestamp[s]"),
+        freq=Value("string"),
+        item_id=Value("string"),
+    )
+)
+
+train_dataset = datasets.Dataset.from_generator(
+    train_example_gen_func, features=features
+)
+val_dataset = datasets.Dataset.from_generator(val_example_gen_func, features=features)
+train_dataset_path = os.path.join(current_dir, "train_dataset")
+val_dataset_path = os.path.join(current_dir, "val_dataset")
+train_dataset.save_to_disk(train_dataset_path)
+val_dataset.save_to_disk(val_dataset_path)