Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the repo #21

Merged
merged 13 commits into from
Jun 22, 2024
4 changes: 2 additions & 2 deletions .github/workflows/greetings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
steps:
- uses: actions/first-interaction@v1
with:
repo-token: ${{ secrets.ACCESS_TOKEN }}
repo-token: ${{ secrets.GITHUB_TOKEN }}
issue-message: |
Hi there 👋,

Expand All @@ -34,7 +34,7 @@ jobs:
pr-message: |
Hi there 👋,

We really really appreciate that you have taken the time to make this PR on PyPOTS' Awesome Imputation project!
We really appreciate that you have taken the time to make this PR on PyPOTS' Awesome Imputation project!

If you are trying to fix a bug, please reference the issue number in the description or give your details about the bug.
If you are implementing a feature request, please check with the maintainers that the feature will be accepted first.
Expand Down
4 changes: 1 addition & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1 @@
benchmark_code/data/physionet_2012/test.h5
benchmark_code/data/physionet_2012/train.h5
benchmark_code/data/physionet_2012/val.h5
*.h5
28 changes: 28 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
Copyright (c) 2024-present, Wenjie Du
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.
81 changes: 56 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,34 @@
<p align="center">
<a id="AwesomeImputation" href="#AwesomeImputation">
<img src="https://pypots.com/figs/pypots_logos/ImputationSurvey/banner.jpg"
alt="Time Series Imputation Survey" title="Time Series Imputation Survey" width="80%"
<img src="https://pypots.com/figs/pypots_logos/AwesomeImputation/banner.jpg"
alt="Time Series Imputation Survey and Benchmark"
title="Time Series Imputation Survey and Benchmark"
width="80%"
/>
</a>
</p>

The open-resource repository for the paper [**Deep Learning for Multivariate Time Series Imputation: A Survey**](https://arxiv.org/abs/2402.04059)
The repository for the paper [**TSI-Bench: Benchmarking Time Series Imputation**](https://arxiv.org/abs/2406.12747)
from <a href="https://pypots.com" target="_blank"><img src="https://pypots.com/figs/pypots_logos/PyPOTS/logo_FFBG.svg" width="30px" align="center"/> PyPOTS Research</a>.
The code and configurations for reproducing the experimental results in the paper are available under
the folder `time_series_imputation_survey_code`.

If you find this repository helpful to your work, please kindly star it and cite our survey paper (author profile links:
[Jun Wang](https://github.com/AugustJW), [Wenjie Du](https://github.com/WenjieDu),
[Wei Cao](https://weicao1990.github.io/), [Keli Zhang](https://github.com/kelizhang), [Wenjia Wang](https://www.wenjia-w.com/home),
[Yuxuan Liang](https://yuxuanliang.com/), [Qingsong Wen](https://sites.google.com/site/qingsongwen8/)) as follows:

```bibtex
@article{wang2024deep,
title={Deep Learning for Multivariate Time Series Imputation: A Survey},
author={Wang, Jun and Du, Wenjie and Cao, Wei and Zhang, Keli and Wang, Wenjia and Liang, Yuxuan and Wen, Qingsong},
journal={arXiv preprint arXiv:2402.04059},
year={2024}
}
```
The code and configurations for reproducing the experimental results in the paper are available under the folder `benchmark_code`.
The README file here maintains a list of must-read papers on time-series imputation, and a collection of time-series imputation toolkits and resources.

🤗 Contributions to update new resources and articles are very welcome!

## ❖ Time-Series Imputation Toolkits
### Datasets
[TSDB (Time Series Data Beans)](https://github.com/WenjieDu/TSDB): a Python toolkit can load 169 public time-series datasets with a single line of code.
### `Datasets`
[TSDB (Time Series Data Beans)](https://github.com/WenjieDu/TSDB): a Python toolkit can load 170 public time-series datasets with a single line of code.
<img src="https://img.shields.io/github/last-commit/WenjieDu/TSDB" align="center">

### Missingness
[BenchPOTS](https://github.com/WenjieDu/BenchPOTS): a Python suite provides standard preprocessing pipelines of 170 public datasets for benchmarking machine learning on POTS (Partially-Observed Time Series).
<img src="https://img.shields.io/github/last-commit/WenjieDu/BenchPOTS" align="center">

### `Missingness`
[PyGrinder](https://github.com/WenjieDu/PyGrinder): a Python library grinds data beans into the incomplete by introducing missing values with different missing patterns.
<img src="https://img.shields.io/github/last-commit/WenjieDu/PyGrinder" align="center">

### Algorithms
[PyPOTS](https://github.com/WenjieDu/PyPOTS): a Python toolbox for data mining on Partially-Observed Time Series
### `Algorithms`
[PyPOTS](https://github.com/WenjieDu/PyPOTS): a Python toolbox for data mining on POTS (Partially-Observed Time Series)
<img src="https://img.shields.io/github/last-commit/WenjieDu/PyPOTS" align="center">

[MICE](https://github.com/amices/mice): Multivariate Imputation by Chained Equations
Expand All @@ -64,6 +55,10 @@ researchers and practitioners who are interested in this field.
[[paper](https://openreview.net/pdf?id=K1mcPiDdOJ)]
[[official code](https://github.com/Chemgyu/TimeCIB)]

[AISTATS] **SADI: Similarity-Aware Diffusion Model-Based Imputation for Incomplete Temporal EHR Data**
[[paper](https://proceedings.mlr.press/v238/dai24c/dai24c.pdf)]
[[official code](https://github.com/bestadcarry/SADI-Similarity-Aware-Diffusion-Model-Based-Imputation-for-Incomplete-Temporal-EHR-Data)]


### `Year 2023`

Expand Down Expand Up @@ -215,13 +210,49 @@ researchers and practitioners who are interested in this field.


## ❖ Other Resources
### Repos about General Time Series
### `Articles about General Missingness and Imputation`
[blog] [**Data Imputation: An essential yet overlooked problem in machine learning**](https://www.vanderschaar-lab.com/data-imputation-an-essential-yet-overlooked-problem-in-machine-learning/)

[Journal of Big Data] **A survey on missing data in machine learning**
[[paper](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00516-9)]


### `Repos about General Time Series`
[Transformers in Time Series](https://github.com/qingsongedu/time-series-transformers-review)

[LLMs and Foundation Models for Time Series and Spatio-Temporal Data](https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM)

[AI for Time Series (AI4TS) Papers, Tutorials, and Surveys](https://github.com/qingsongedu/awesome-AI-for-time-series-papers)

## ❖ Citing This Work
If you find this repository helpful to your work, please kindly star it and cite our benchmark paper, survey paper, and PyPOTS as follows:

```bibtex
@article{du2024tsibench,
title={TSI-Bench: Benchmarking Time Series Imputation},
author={Wenjie Du and Jun Wang and Linglong Qian and Yiyuan Yang and Fanxing Liu and Zepu Wang and Zina Ibrahim and Haoxin Liu and Zhiyuan Zhao and Yingjie Zhou and Wenjia Wang and Kaize Ding and Yuxuan Liang and B. Aditya Prakash and Qingsong Wen},
journal={arXiv preprint arXiv:2406.12747},
year={2024}
}
```

```bibtex
@article{wang2024deep,
title={Deep Learning for Multivariate Time Series Imputation: A Survey},
author={Jun Wang and Wenjie Du and Wei Cao and Keli Zhang and Wenjia Wang and Yuxuan Liang and Qingsong Wen},
journal={arXiv preprint arXiv:2402.04059},
year={2024}
}
```

```bibtex
@article{du2023pypots,
title={{PyPOTS: a Python toolbox for data mining on Partially-Observed Time Series}},
author={Wenjie Du},
journal={arXiv preprint arXiv:2305.18811},
year={2023},
}
```

<details>
<summary>🏠 Visits</summary>
Expand Down
76 changes: 25 additions & 51 deletions benchmark_code/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,49 @@
# Code for the Time Series Imputation Survey
The scripts and configurations used in the work are all put here.

# TSI-Bench
The code scripts, configurations, and logs here are for TSI-Bench,
the first comprehensive benchmark for time series imputation.

## ❖ Python Environment Creation
A proper Python environment is necessary to reproduce the results.
Please ensure that all the below library requirements are satisfied.

```yaml
pypots >=0.4
tsdb >=0.2
pygrinder >=0.4
tsdb ==0.4
pygrinder ==0.6
benchpots ==0.1.1
pypots ==0.6
```

For Linux OS, it is able to create the environment with Conda by running `conda create -f conda_env.yml`.
For other OS, library version requirements can also be checked out in `conda_env.yml`.


## ❖ Datasets Introduction and Generation
### Introduction
#### Air
Air (Beijing Multi-Site Air-Quality) is collected from twelve Beijing monitoring sites hourly in forty-eight months.
At each site, eleven air pollution variables (e.g. PM2.5, NO, O<sub>3</sub>) are collected.
The dataset has 1.6% originally missing data.

#### PhysioNet2012
PhysioNet2012 (PhysioNet-2012 Mortality Prediction Challenge) includes multivariate clinical time series data
collected from 11,988 patients in ICU. Each sample contains thirty-seven measurements (e.g. glucose, heart rate,
temperature) recorded in the first forty-eight hours after admission to the ICU.
This dataset has 80% values missing.
## ❖ Datasets Generation
Please refer to [`data/README.md`](data/README.md).

#### ETTm1
ETTm1 (Electricity Transformer Temperature) records seven state features, including oil temperature and six power
load variables of electricity transformers collected every fifteen minutes for two years.
There is no original missingness in this dataset.

### Generation
The scripts for generating three datasets used in this work are in the directory `data_processing`.
To generate the preprocessed datasets, please run the shell script `generate_datasets.sh` or
execute the below commands:
## ❖ Results Reproduction
### Neural network training
For example, to reproduce the results of SAITS on the dataset Pedestrian, please execute the following command.

```shell
# generate PhysioNet2012 dataset
python data/gene_physionet_2012.py

# generate Air dataset
python data/gene_air_quality.py

# generate ETTm1 dataset
python data/gene_ettm1.py
nohup python train_model.py \
--model SAITS \
--dataset Pedestrian \
--dataset_fold_path data/melbourne_pedestrian_rate01_step24_point \
--saving_path results_point_rate01 \
--device cuda:2 \
> results_point_rate01/SAITS_pedestrian.log &
```


## ❖ Model Training and Results Reproduction
```shell
# reproduce the results on the dataset PhysioNet2012
nohup python train_models_for_physionet2012.py > physionet2012.log&

# reproduce the results on the dataset Air
nohup python train_models_for_air.py > air.log&

# reproduce the results on the dataset ETTm1
nohup python train_models_for_ettm1.py > ettm1.log&
```

After all execution finished, please check out all logging information in the according `.log` files.
After the execution finished, please check out the logging information in the according `.log` file.

Additionally, as claimed in the paper, hyperparameters of all models get optimized by the tuning functionality in
[PyPOTS](https://github.com/WenjieDu/PyPOTS). Hence, tuning configurations are available in the directory `PyPOTS_tuning_configs`.
If you'd like to explore this feature, please check out the details there.

### Naive methods
To obtain the results of the naive methods, check out the commands in the shell script `naive_imputation.sh`.


## ❖ Downstream Classification
After running `train_models_for_physionet2012.py`, all models' imputation results are persisted under according folders.
To obtain the simple RNN's classification results on PhysioNet2012, please execute the script `downstream_classification.py`.
## ❖ Downstream Tasks
We're clean up the code and updating the scripts for the downstream tasks. Will release the code soon.
8 changes: 5 additions & 3 deletions benchmark_code/conda_env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,6 @@ dependencies:
- pyflakes=3.2.0=pyhd8ed1ab_0
- pyg=2.5.0=py310_torch_2.2.0_cu121
- pygments=2.17.2=pyhd8ed1ab_0
- pygrinder=0.4=pyh60bb809_0
- pyparsing=3.1.2=pyhd8ed1ab_0
- pyqt=5.15.9=py310h04931ad_5
- pyqt5-sip=12.12.2=py310hc6cd4ac_5
Expand Down Expand Up @@ -326,7 +325,6 @@ dependencies:
- tornado=6.4=py310h2372a71_0
- tqdm=4.65.0=py310h2f386ee_0
- traitlets=5.14.1=pyhd8ed1ab_0
- tsdb=0.3.1=pyhc1e730c_0
- types-python-dateutil=2.8.19.20240311=pyhd8ed1ab_0
- typing-extensions=4.9.0=py310h06a4308_1
- typing_extensions=4.9.0=py310h06a4308_1
Expand Down Expand Up @@ -370,6 +368,7 @@ dependencies:
- zstd=1.5.5=hfc55251_0
- pip:
- astor==0.8.1
- benchpots==0.1
- cloudpickle==3.0.0
- contextlib2==21.6.0
- einops==0.8.0
Expand All @@ -379,11 +378,14 @@ dependencies:
- nvidia-ml-py==12.535.133
- prettytable==3.10.0
- pyarrow==15.0.1
- pypots==0.5
- pygrinder==0.6
- pypots==0.6
- pythonwebhdfs==0.2.3
- responses==0.25.0
- schema==0.7.5
- simplejson==3.19.2
- sphinxcontrib-gtagjs==0.2.1
- tsdb==0.4
- typeguard==2.13.3
- websockets==12.0
- xgboost==2.0.3
12 changes: 10 additions & 2 deletions benchmark_code/data/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Data generation

Run `python dataset_generating.py` to generate datasets.
Run the below commands to generate datasets for experiments.
Note that, for PeMS traffic dataset, you have to put the `traffic.csv` file under the current directory.
You can download it from https://pems.dot.ca.gov. For other dataset, they are integrated into `TSDB` and can be directly used.

Note that, for PeMS traffic dataset, you have to put the `traffic.csv` file under the current directory.
```shell
python dataset_generating_point01.py
python dataset_generating_point05.py
python dataset_generating_point09.py
python dataset_generating_subseq05.py
python dataset_generating_block05.py
```
2 changes: 1 addition & 1 deletion benchmark_code/data/dataset_generating_block05.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@
block_len = 6
block_width = 6
pems_traffic = preprocess_pems_traffic(
file_path="/Users/wdu/Downloads/traffic.csv",
file_path="traffic.csv",
rate=rate,
n_steps=step,
pattern=pattern,
Expand Down
2 changes: 1 addition & 1 deletion benchmark_code/data/dataset_generating_point01.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@

step = 24
pems_traffic = preprocess_pems_traffic(
file_path="/Users/wdu/Downloads/traffic.csv",
file_path="traffic.csv",
rate=rate,
n_steps=step,
pattern=pattern,
Expand Down
2 changes: 1 addition & 1 deletion benchmark_code/data/dataset_generating_point05.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@

step = 24
pems_traffic = preprocess_pems_traffic(
file_path="/Users/wdu/Downloads/traffic.csv",
file_path="traffic.csv",
rate=rate,
n_steps=step,
pattern=pattern,
Expand Down
2 changes: 1 addition & 1 deletion benchmark_code/data/dataset_generating_point09.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@

step = 24
pems_traffic = preprocess_pems_traffic(
file_path="/Users/wdu/Downloads/traffic.csv",
file_path="traffic.csv",
rate=rate,
n_steps=step,
pattern=pattern,
Expand Down
2 changes: 1 addition & 1 deletion benchmark_code/data/dataset_generating_subseq05.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@
step = 24
seq_len = 18
pems_traffic = preprocess_pems_traffic(
file_path="/Users/wdu/Downloads/traffic.csv",
file_path="traffic.csv",
rate=rate,
n_steps=step,
pattern=pattern,
Expand Down
Loading