WenjieDu · WenjieDu · Jun 22, 2024 · Jun 17, 2024 · Jun 17, 2024 · Jun 17, 2024
diff --git a/.github/workflows/greetings.yml b/.github/workflows/greetings.yml
@@ -18,7 +18,7 @@ jobs:
     steps:
     - uses: actions/first-interaction@v1
       with:
-        repo-token: ${{ secrets.ACCESS_TOKEN }}
+        repo-token: ${{ secrets.GITHUB_TOKEN }}
         issue-message: |
           Hi there 👋,
 
@@ -34,7 +34,7 @@ jobs:
         pr-message: |
           Hi there 👋,
 
-          We really really appreciate that you have taken the time to make this PR on PyPOTS' Awesome Imputation project!
+          We really appreciate that you have taken the time to make this PR on PyPOTS' Awesome Imputation project!
 
           If you are trying to fix a bug, please reference the issue number in the description or give your details about the bug.
           If you are implementing a feature request, please check with the maintainers that the feature will be accepted first.

diff --git a/.gitignore b/.gitignore
@@ -1,3 +1 @@
-benchmark_code/data/physionet_2012/test.h5
-benchmark_code/data/physionet_2012/train.h5
-benchmark_code/data/physionet_2012/val.h5
+*.h5
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,28 @@
+Copyright (c) 2024-present, Wenjie Du
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+2. Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+3. Neither the name of the copyright holder nor the names of its
+   contributors may be used to endorse or promote products derived from
+   this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
diff --git a/README.md b/README.md
@@ -1,43 +1,34 @@
 <p align="center">
     <a id="AwesomeImputation" href="#AwesomeImputation">
-        <img src="https://pypots.com/figs/pypots_logos/ImputationSurvey/banner.jpg" 
-            alt="Time Series Imputation Survey" title="Time Series Imputation Survey" width="80%"
+        <img src="https://pypots.com/figs/pypots_logos/AwesomeImputation/banner.jpg"
+            alt="Time Series Imputation Survey and Benchmark"
+            title="Time Series Imputation Survey and Benchmark"
+            width="80%"
         />
     </a>
 </p>
 
-The open-resource repository for the paper [**Deep Learning for Multivariate Time Series Imputation: A Survey**](https://arxiv.org/abs/2402.04059) 
+The repository for the paper [**TSI-Bench: Benchmarking Time Series Imputation**](https://arxiv.org/abs/2406.12747) 
 from <a href="https://pypots.com" target="_blank"><img src="https://pypots.com/figs/pypots_logos/PyPOTS/logo_FFBG.svg" width="30px" align="center"/> PyPOTS Research</a>.
-The code and configurations for reproducing the experimental results in the paper are available under 
-the folder `time_series_imputation_survey_code`.
-
-If you find this repository helpful to your work, please kindly star it and cite our survey paper (author profile links:
-[Jun Wang](https://github.com/AugustJW), [Wenjie Du](https://github.com/WenjieDu), 
-[Wei Cao](https://weicao1990.github.io/), [Keli Zhang](https://github.com/kelizhang), [Wenjia Wang](https://www.wenjia-w.com/home), 
-[Yuxuan Liang](https://yuxuanliang.com/), [Qingsong Wen](https://sites.google.com/site/qingsongwen8/)) as follows:
-
-```bibtex
-@article{wang2024deep,
-title={Deep Learning for Multivariate Time Series Imputation: A Survey},
-author={Wang, Jun and Du, Wenjie and Cao, Wei and Zhang, Keli and Wang, Wenjia and Liang, Yuxuan and Wen, Qingsong},
-journal={arXiv preprint arXiv:2402.04059},
-year={2024}
-}
-```
+The code and configurations for reproducing the experimental results in the paper are available under the folder `benchmark_code`.
+The README file here maintains a list of must-read papers on time-series imputation, and a collection of time-series imputation toolkits and resources.
 
 🤗 Contributions to update new resources and articles are very welcome!
 
 ## ❖ Time-Series Imputation Toolkits
-### Datasets
-[TSDB (Time Series Data Beans)](https://github.com/WenjieDu/TSDB): a Python toolkit can load 169 public time-series datasets with a single line of code.
+### `Datasets`
+[TSDB (Time Series Data Beans)](https://github.com/WenjieDu/TSDB): a Python toolkit can load 170 public time-series datasets with a single line of code.
 <img src="https://img.shields.io/github/last-commit/WenjieDu/TSDB" align="center">
 
-### Missingness
+[BenchPOTS](https://github.com/WenjieDu/BenchPOTS): a Python suite provides standard preprocessing pipelines of 170 public datasets for benchmarking machine learning on POTS (Partially-Observed Time Series).
+<img src="https://img.shields.io/github/last-commit/WenjieDu/BenchPOTS" align="center">
+
+### `Missingness`
 [PyGrinder](https://github.com/WenjieDu/PyGrinder): a Python library grinds data beans into the incomplete by introducing missing values with different missing patterns.
 <img src="https://img.shields.io/github/last-commit/WenjieDu/PyGrinder" align="center">
 
-### Algorithms
-[PyPOTS](https://github.com/WenjieDu/PyPOTS): a Python toolbox for data mining on Partially-Observed Time Series
+### `Algorithms`
+[PyPOTS](https://github.com/WenjieDu/PyPOTS): a Python toolbox for data mining on POTS (Partially-Observed Time Series)
 <img src="https://img.shields.io/github/last-commit/WenjieDu/PyPOTS" align="center">
 
 [MICE](https://github.com/amices/mice): Multivariate Imputation by Chained Equations
@@ -64,6 +55,10 @@ researchers and practitioners who are interested in this field.
 [[paper](https://openreview.net/pdf?id=K1mcPiDdOJ)]
 [[official code](https://github.com/Chemgyu/TimeCIB)]
 
+[AISTATS] **SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data**
+[[paper](https://proceedings.mlr.press/v238/dai24c/dai24c.pdf)]
+[[official code](https://github.com/bestadcarry/SADI-Similarity-Aware-Diffusion-Model-Based-Imputation-for-Incomplete-Temporal-EHR-Data)]
+
 
 ### `Year 2023`
 
@@ -215,13 +210,49 @@ researchers and practitioners who are interested in this field.
 
 
 ## ❖ Other Resources
-### Repos about General Time Series
+### `Articles about General Missingness and Imputation`
+[blog] [**Data Imputation: An essential yet overlooked problem in machine learning**](https://www.vanderschaar-lab.com/data-imputation-an-essential-yet-overlooked-problem-in-machine-learning/)
+
+[Journal of Big Data] **A survey on missing data in machine learning** 
+[[paper](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00516-9)]
+
+
+### `Repos about General Time Series`
 [Transformers in Time Series](https://github.com/qingsongedu/time-series-transformers-review)
 
 [LLMs and Foundation Models for Time Series and Spatio-Temporal Data](https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM)
 
 [AI for Time Series (AI4TS) Papers, Tutorials, and Surveys](https://github.com/qingsongedu/awesome-AI-for-time-series-papers)
 
+## ❖ Citing This Work
+If you find this repository helpful to your work, please kindly star it and cite our benchmark paper, survey paper, and PyPOTS as follows:
+
+```bibtex
+@article{du2024tsibench,
+title={TSI-Bench: Benchmarking Time Series Imputation},
+author={Wenjie Du and Jun Wang and Linglong Qian and Yiyuan Yang and Fanxing Liu and Zepu Wang and Zina Ibrahim and Haoxin Liu and Zhiyuan Zhao and Yingjie Zhou and Wenjia Wang and Kaize Ding and Yuxuan Liang and B. Aditya Prakash and Qingsong Wen},
+journal={arXiv preprint arXiv:2406.12747},
+year={2024}
+}
+```
+
+```bibtex
+@article{wang2024deep,
+title={Deep Learning for Multivariate Time Series Imputation: A Survey},
+author={Jun Wang and Wenjie Du and Wei Cao and Keli Zhang and Wenjia Wang and Yuxuan Liang and Qingsong Wen},
+journal={arXiv preprint arXiv:2402.04059},
+year={2024}
+}
+```
+
+```bibtex
+@article{du2023pypots,
+title={{PyPOTS: a Python toolbox for data mining on Partially-Observed Time Series}},
+author={Wenjie Du},
+journal={arXiv preprint arXiv:2305.18811},
+year={2023},
+}
+```
 
 <details>
 <summary>🏠 Visits</summary>

diff --git a/benchmark_code/README.md b/benchmark_code/README.md
@@ -1,75 +1,49 @@
-# Code for the Time Series Imputation Survey 
-The scripts and configurations used in the work are all put here.
-
+# TSI-Bench 
+The code scripts, configurations, and logs here are for TSI-Bench, 
+the first comprehensive benchmark for time series imputation.
 
 ## ❖ Python Environment Creation
 A proper Python environment is necessary to reproduce the results. 
 Please ensure that all the below library requirements are satisfied.
 
 ```yaml
-pypots >=0.4
-tsdb >=0.2
-pygrinder >=0.4
+tsdb ==0.4
+pygrinder ==0.6
+benchpots ==0.1.1
+pypots ==0.6
 ```
 
 For Linux OS, it is able to create the environment with Conda by running `conda create -f conda_env.yml`.
 For other OS, library version requirements can also be checked out in `conda_env.yml`.
 
 
-## ❖ Datasets Introduction and Generation
-### Introduction
-#### Air
-Air (Beijing Multi-Site Air-Quality) is collected from twelve Beijing monitoring sites hourly in forty-eight months. 
-At each site, eleven air pollution variables (e.g. PM2.5, NO, O<sub>3</sub>) are collected. 
-The dataset has 1.6% originally missing data.
-
-#### PhysioNet2012
-PhysioNet2012 (PhysioNet-2012 Mortality Prediction Challenge) includes multivariate clinical time series data 
-collected from 11,988 patients in ICU. Each sample contains thirty-seven measurements (e.g. glucose, heart rate, 
-temperature) recorded in the first forty-eight hours after admission to the ICU. 
-This dataset has 80% values missing.
+## ❖ Datasets Generation
+Please refer to [`data/README.md`](data/README.md).
 
-#### ETTm1
-ETTm1 (Electricity Transformer Temperature) records seven state features, including oil temperature and six power 
-load variables of electricity transformers collected every fifteen minutes for two years. 
-There is no original missingness in this dataset.
 
-### Generation
-The scripts for generating three datasets used in this work are in the directory `data_processing`. 
-To generate the preprocessed datasets, please run the shell script `generate_datasets.sh` or 
-execute the below commands:
+## ❖ Results Reproduction
+### Neural network training 
+For example, to reproduce the results of SAITS on the dataset Pedestrian, please execute the following command.
 
 ```shell
-# generate PhysioNet2012 dataset
-python data/gene_physionet_2012.py
-
-# generate Air dataset
-python data/gene_air_quality.py
-
-# generate ETTm1 dataset
-python data/gene_ettm1.py
+nohup python train_model.py \
+  --model SAITS \
+  --dataset Pedestrian \
+  --dataset_fold_path data/melbourne_pedestrian_rate01_step24_point \
+  --saving_path results_point_rate01 \
+  --device cuda:2 \
+  > results_point_rate01/SAITS_pedestrian.log &
 ```
 
-
-## ❖ Model Training and Results Reproduction
-```shell
-# reproduce the results on the dataset PhysioNet2012
-nohup python train_models_for_physionet2012.py > physionet2012.log&
-
-# reproduce the results on the dataset Air
-nohup python train_models_for_air.py > air.log&
-
-# reproduce the results on the dataset ETTm1
-nohup python train_models_for_ettm1.py > ettm1.log&
-```
-
-After all execution finished, please check out all logging information in the according `.log` files.
+After the execution finished, please check out the logging information in the according `.log` file.
 
 Additionally, as claimed in the paper, hyperparameters of all models get optimized by the tuning functionality in 
 [PyPOTS](https://github.com/WenjieDu/PyPOTS). Hence, tuning configurations are available in the directory `PyPOTS_tuning_configs`.
 If you'd like to explore this feature, please check out the details there.
 
+### Naive methods
+To obtain the results of the naive methods, check out the commands in the shell script `naive_imputation.sh`.
+
 
-## ❖ Downstream Classification
-After running `train_models_for_physionet2012.py`, all models' imputation results are persisted under according folders.
-To obtain the simple RNN's classification results on PhysioNet2012, please execute the script `downstream_classification.py`.
+## ❖ Downstream Tasks
+We're clean up the code and updating the scripts for the downstream tasks. Will release the code soon.
diff --git a/benchmark_code/conda_env.yml b/benchmark_code/conda_env.yml
@@ -261,7 +261,6 @@ dependencies:
   - pyflakes=3.2.0=pyhd8ed1ab_0
   - pyg=2.5.0=py310_torch_2.2.0_cu121
   - pygments=2.17.2=pyhd8ed1ab_0
-  - pygrinder=0.4=pyh60bb809_0
   - pyparsing=3.1.2=pyhd8ed1ab_0
   - pyqt=5.15.9=py310h04931ad_5
   - pyqt5-sip=12.12.2=py310hc6cd4ac_5
@@ -326,7 +325,6 @@ dependencies:
   - tornado=6.4=py310h2372a71_0
   - tqdm=4.65.0=py310h2f386ee_0
   - traitlets=5.14.1=pyhd8ed1ab_0
-  - tsdb=0.3.1=pyhc1e730c_0
   - types-python-dateutil=2.8.19.20240311=pyhd8ed1ab_0
   - typing-extensions=4.9.0=py310h06a4308_1
   - typing_extensions=4.9.0=py310h06a4308_1
@@ -370,6 +368,7 @@ dependencies:
   - zstd=1.5.5=hfc55251_0
   - pip:
       - astor==0.8.1
+      - benchpots==0.1
       - cloudpickle==3.0.0
       - contextlib2==21.6.0
       - einops==0.8.0
@@ -379,11 +378,14 @@ dependencies:
       - nvidia-ml-py==12.535.133
       - prettytable==3.10.0
       - pyarrow==15.0.1
-      - pypots==0.5
+      - pygrinder==0.6
+      - pypots==0.6
       - pythonwebhdfs==0.2.3
       - responses==0.25.0
       - schema==0.7.5
       - simplejson==3.19.2
       - sphinxcontrib-gtagjs==0.2.1
+      - tsdb==0.4
       - typeguard==2.13.3
       - websockets==12.0
+      - xgboost==2.0.3
diff --git a/benchmark_code/data/README.md b/benchmark_code/data/README.md
@@ -1,5 +1,13 @@
 # Data generation
 
-Run `python dataset_generating.py` to generate datasets.
+Run the below commands to generate datasets for experiments.
+Note that, for PeMS traffic dataset, you have to put the `traffic.csv` file under the current directory.
+You can download it from https://pems.dot.ca.gov. For other dataset, they are integrated into `TSDB` and can be directly used. 
 
-Note that, for PeMS traffic dataset, you have to put the `traffic.csv` file under the current directory.
+```shell
+python dataset_generating_point01.py
+python dataset_generating_point05.py
+python dataset_generating_point09.py
+python dataset_generating_subseq05.py
+python dataset_generating_block05.py
+```
diff --git a/benchmark_code/data/dataset_generating_block05.py b/benchmark_code/data/dataset_generating_block05.py
@@ -91,7 +91,7 @@
     block_len = 6
     block_width = 6
     pems_traffic = preprocess_pems_traffic(
-        file_path="/Users/wdu/Downloads/traffic.csv",
+        file_path="traffic.csv",
         rate=rate,
         n_steps=step,
         pattern=pattern,

diff --git a/benchmark_code/data/dataset_generating_point01.py b/benchmark_code/data/dataset_generating_point01.py
@@ -114,7 +114,7 @@
 
     step = 24
     pems_traffic = preprocess_pems_traffic(
-        file_path="/Users/wdu/Downloads/traffic.csv",
+        file_path="traffic.csv",
         rate=rate,
         n_steps=step,
         pattern=pattern,

diff --git a/benchmark_code/data/dataset_generating_point05.py b/benchmark_code/data/dataset_generating_point05.py
@@ -64,7 +64,7 @@
 
     step = 24
     pems_traffic = preprocess_pems_traffic(
-        file_path="/Users/wdu/Downloads/traffic.csv",
+        file_path="traffic.csv",
         rate=rate,
         n_steps=step,
         pattern=pattern,

diff --git a/benchmark_code/data/dataset_generating_point09.py b/benchmark_code/data/dataset_generating_point09.py
@@ -68,7 +68,7 @@
 
     step = 24
     pems_traffic = preprocess_pems_traffic(
-        file_path="/Users/wdu/Downloads/traffic.csv",
+        file_path="traffic.csv",
         rate=rate,
         n_steps=step,
         pattern=pattern,

diff --git a/benchmark_code/data/dataset_generating_subseq05.py b/benchmark_code/data/dataset_generating_subseq05.py
@@ -70,7 +70,7 @@
     step = 24
     seq_len = 18
     pems_traffic = preprocess_pems_traffic(
-        file_path="/Users/wdu/Downloads/traffic.csv",
+        file_path="traffic.csv",
         rate=rate,
         n_steps=step,
         pattern=pattern,