Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader #83

Merged
merged 58 commits into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
dfda24f
Initial implementation with hydra
kks32 Jun 25, 2024
50c63c2
Test CI for config yaml
kks32 Jun 25, 2024
c93c335
Testing CI for hydra
kks32 Jun 25, 2024
13c5204
Try pip requirements.txt
kks32 Jun 25, 2024
7ebded9
Try pip instead of conda for docker
kks32 Jun 25, 2024
85ff713
Docker build on GitHub
kks32 Jun 25, 2024
e607f98
Copy requirements.txt file before installing on Docker container
kks32 Jun 25, 2024
9913d92
Copy requirements.txt
kks32 Jun 25, 2024
111b428
Copy requirements.txt
kks32 Jun 25, 2024
3606790
Trying with user flag
kks32 Jun 25, 2024
b6cea3e
Trying Python 3.11
kks32 Jun 25, 2024
f9cb06d
Test CircleCI with ghcr container image
kks32 Jun 25, 2024
5d8ce23
GitHub Actions workflow to test training GNS
kks32 Jun 25, 2024
b230de8
Updated dockerfile with paths and env
kks32 Jun 25, 2024
20fa700
Modify workflow to run training
kks32 Jun 25, 2024
7764f9d
Add at least one epoch to run when nsteps is fewer than 1 epoch steps
kks32 Jun 25, 2024
1e612df
Train GNS action
kks32 Jun 25, 2024
1f0b8cd
Test without docker pull on CircleCI
kks32 Jun 25, 2024
8c0ddad
Specify path and branches
kks32 Jun 25, 2024
4a7e5c4
Fix path to GNS sample output
kks32 Jun 25, 2024
c5bcc44
Worflow runs on Github and remove conda on circleci
kks32 Jun 25, 2024
dc27691
No black check
kks32 Jun 25, 2024
5329b78
Only try to build container if specific files have changed
kks32 Jun 25, 2024
a97cd5e
Fix resume training and README
kks32 Jun 25, 2024
170037b
Reduce number of steps to 100 for testing
kks32 Jun 25, 2024
40b4919
Refactor constants to data
kks32 Jun 25, 2024
c6b092d
Remove on PR
kks32 Jun 25, 2024
14f3b9e
Add config to tensorboard writer
kks32 Jun 26, 2024
5bab43b
Particle data loader
kks32 Jun 27, 2024
8a3a540
Add tests for data loader
kks32 Jun 27, 2024
29574a7
Use the correct container
kks32 Jun 27, 2024
402e677
Remove CircleCI
kks32 Jun 27, 2024
e28275c
Reduce nsteps to 10 for CI
kks32 Jun 27, 2024
e3283cf
Resolve merge conflict with config branch
kks32 Jun 28, 2024
3d99936
multi node parallel
Jun 28, 2024
6d6b1ca
resolve conflict
Jun 28, 2024
8f05d50
resolve confligt
Jun 28, 2024
66289a5
change GPU container
Jun 28, 2024
9a8ea12
update scripts and readme
Jun 28, 2024
79cd9f4
n_gpus
Jun 28, 2024
748b9f4
update Dockerfile
Jun 28, 2024
a74ced0
reformat
Jun 28, 2024
768702a
Remove unused dataloader
kks32 Jun 29, 2024
0a28073
GPU cocntainer
kks32 Jun 29, 2024
b1081b9
Remove blank lines in GPU container
kks32 Jun 29, 2024
3202caf
Update README with container image
kks32 Jun 29, 2024
291ce9a
Add GitHub badge
kks32 Jun 29, 2024
50fe084
WIP: Refactor train
kks32 Jun 29, 2024
27919d1
Fix validation dataloader
kks32 Jun 29, 2024
985a68a
Prepare data function
kks32 Jun 29, 2024
0e79137
check scripts
Jul 2, 2024
7748b64
Merge branch 'dataloader' of https://github.com/geoelements/gns into …
Jul 2, 2024
1a8546c
remove extra file
Jul 2, 2024
7f31242
use python to launch
Jul 3, 2024
2bcd131
black
Jul 3, 2024
86b4dbb
update test
Jul 3, 2024
f72b172
fix loading simulator and resuming from middle of epoch
Jul 11, 2024
8f4e92c
remove n_gpus
Jul 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 0 additions & 21 deletions .circleci/config.yml

This file was deleted.

44 changes: 44 additions & 0 deletions .github/workflows/container-gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Build and Push GPU Image to GHCR

on:
push:
paths:
- Dockerfile-GPU
- requirements.txt

env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}

jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Log in to the Container registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
file: ./Dockerfile-GPU
push: true
tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:gpu
labels: ${{ steps.meta.outputs.labels }}
8 changes: 4 additions & 4 deletions .github/workflows/train.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@ jobs:
gns:
runs-on: ubuntu-latest
container:
image: ghcr.io/geoelements/gns:config
image: ghcr.io/geoelements/gns:dataloader

steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Black linter check

- name: Black linter
run: |
black --check .

Expand All @@ -26,4 +26,4 @@ jobs:
TMP_DIR="../gns-sample"
DATASET_NAME="WaterDropSample"
git clone https://github.com/geoelements/gns-sample ../gns-sample
python -m gns.train
python -m gns.train mode="train" training.steps=10
19 changes: 19 additions & 0 deletions Dockerfile-GPU
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
FROM nvcr.io/nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04
RUN apt-get update
RUN apt-get upgrade -y

RUN apt-get install -y python3
RUN apt-get install -y python3-pip
RUN apt-get install -y git

RUN pip install --upgrade pip ipython ipykernel

COPY requirements.txt requirements.txt
ENV PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cu118
RUN pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
RUN pip install torch_geometric
RUN pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cu118.html
RUN pip install absl-py autopep8 numpy==1.23.1 dm-tree matplotlib pyevtk pytest tqdm toml
RUN pip install -r requirements.txt

CMD ["/bin/bash"]
63 changes: 12 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Graph Network Simulator (GNS) and MeshNet

[![DOI](https://zenodo.org/badge/427487727.svg)](https://zenodo.org/badge/latestdoi/427487727)
[![CircleCI](https://dl.circleci.com/status-badge/img/gh/geoelements/gns/tree/main.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/geoelements/gns/tree/main)
[![Docker](https://quay.io/repository/geoelements/gns/status "Docker Repository on Quay")](https://quay.io/repository/geoelements/gns)
[![GitHub Actions](https://github.com/geoelements/gns/actions/workflows/train.yml/badge.svg)](https://github.com/geoelements/gns/actions/workflows/train.yml)
[![Docker](https://img.shields.io/badge/container-gpu-limegreen.svg)](https://ghcr.io/geoelements/gns:gpu)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://raw.githubusercontent.com/geoelements/gns/main/license.md)

> Krishna Kumar, The University of Texas at Austin.
Expand Down Expand Up @@ -227,63 +227,29 @@ The dataset is shared on [DesignSafe DataDepot](https://doi.org/10.17603/ds2-fzg

GNS uses [pytorch geometric](https://www.pyg.org/) and [CUDA](https://developer.nvidia.com/cuda-downloads). These packages have specific requirements, please see [PyG installation]((https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html) for details.

> CPU-only installation on Linux (Conda)
> CPU-only installation on Linux/MacOS

```shell
conda install -y pytorch torchvision torchaudio cpuonly -c pytorch
conda install -y pyg -c pyg
conda install -y pytorch-cluster -c pyg
conda install -y absl-py -c anaconda
conda install -y numpy dm-tree matplotlib-base pyevtk -c conda-forge
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip3 install torch_geometric
pip3 install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cpu.html
pip3 install -r requirements.txt
```
You can use the [WaterDropletSample](https://github.com/geoelements/gns-sample) dataset to check if your `gns` code is working correctly.

To test the code you can run:

```
pytest test/
```

To test on the small waterdroplet sample:

```
git clone https://github.com/geoelements/gns-sample

TMP_DIR="./gns-sample"
DATASET_NAME="WaterDropSample"
### Build Docker Image

mkdir -p ${TMP_DIR}/${DATASET_NAME}/models/
mkdir -p ${TMP_DIR}/${DATASET_NAME}/rollout/
Dockerfile-GPU is supplied to build image with GPU support.

DATA_PATH="${TMP_DIR}/${DATASET_NAME}/dataset/"
MODEL_PATH="${TMP_DIR}/${DATASET_NAME}/models/"
ROLLOUT_PATH="${TMP_DIR}/${DATASET_NAME}/rollout/"

python -m gns.train --data_path=${DATA_PATH} --model_path=${MODEL_PATH} --ntraining_steps=10
```

### Building GNS environment on TACC (LS6 and Frontera)

- to setup a virtualenv

```shell
sh ./build_venv.sh
docker pull ghcr.io/geoelements/gns:gpu
```

- check tests run sucessfully.
- start your environment

```shell
source start_venv.sh
```

### Building GNS on MacOS
```shell
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip3 install torch_geometric
pip3 install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.3.0+cpu.html
pip3 install -r requirements.txt
```

## GNS training in parallel
GNS can be trained in parallel on multiple nodes with multiple GPUs.
Expand All @@ -296,17 +262,12 @@ GNS can be trained in parallel on multiple nodes with multiple GPUs.
> GNS scaling result on [TACC lonestar6 GPU nodes](https://docs.tacc.utexas.edu/hpc/lonestar6/#table2) with A100 GPUs.

### Usage
#### Single Node, Multi-GPU
```shell
python -m torch.distributed.launch --nnodes=1 --nproc_per_node=[GPU_PER_NODE] --node_rank=[LOCAL_RANK] --master_addr=[MAIN_RANK] gns/train_multinode.py [ARGS]
```

#### Multi-node, Multi-GPU
On each node, run
```shell
python -m torch.distributed.launch --nnodes=[NNODES] --nproc_per_node=[GPU_PER_NODE] --node_rank=[LOCAL_RANK] --master_addr=[MAIN_RANK ]gns/train_multinode.py [ARGS]
mpiexec.hydra -np $NNODES -ppn 1 ../slurm_scripts/launch_helper.sh $DOCKER_IMG_LOCATION $n_gpu_per_node
```


### Inspiration
PyTorch version of Graph Network Simulator and Mesh Graph Network Simulator are based on:
* [https://arxiv.org/abs/2002.09405](https://arxiv.org/abs/2002.09405) and [https://github.com/deepmind/deepmind-research/tree/master/learning_to_simulate](https://github.com/deepmind/deepmind-research/tree/master/learning_to_simulate)
Expand Down
1 change: 0 additions & 1 deletion config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ training:
# Hardware configuration
hardware:
cuda_device_number: null
n_gpus: 1

# Logging configuration
logging:
Expand Down
Loading