Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add level2 ingestion into repo #52

Open
wants to merge 125 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
5b15ed5
Add new loader code
samueljackson92 Dec 11, 2024
7178a9a
Fix ruff
samueljackson92 Dec 11, 2024
76ed719
Update tests to conditionally skip
samueljackson92 Dec 11, 2024
3184d9b
Fix ruff
samueljackson92 Dec 11, 2024
185e8ce
Update tests
samueljackson92 Dec 11, 2024
df8462e
revert to ignoring tests in CI until mastcodes is ready.
samueljackson92 Dec 13, 2024
01ba84f
Merge branch 'main' into 34-updated-reader
samueljackson92 Dec 13, 2024
0b23202
Skip more tests
samueljackson92 Dec 13, 2024
d32af1c
Refactor cli and workflow code
samueljackson92 Dec 13, 2024
0aa1f9a
refactor pipelines
samueljackson92 Dec 13, 2024
763278a
Update processing for amc
samueljackson92 Dec 13, 2024
c59d689
Add fix for names
samueljackson92 Dec 13, 2024
b5207d6
Update renaming dims
samueljackson92 Dec 16, 2024
d6b5e21
Update unit mappings
samueljackson92 Dec 16, 2024
9bb6b53
Update workflow code
samueljackson92 Dec 17, 2024
a05baa5
Ruff formatting
samueljackson92 Dec 17, 2024
200731b
Update requirements.txt
samueljackson92 Dec 17, 2024
81d257f
Update mappings
samueljackson92 Dec 17, 2024
88d06bc
Update CI
samueljackson92 Dec 17, 2024
f1d221d
Update tests
samueljackson92 Dec 17, 2024
3754551
fix ruff
samueljackson92 Dec 17, 2024
82238e8
Update workflow to upload to s3
samueljackson92 Dec 18, 2024
cbe659f
remove unused file
samueljackson92 Dec 18, 2024
51692ca
Add code to rename groups
samueljackson92 Dec 18, 2024
418cd84
Tidy up name mappings
samueljackson92 Dec 18, 2024
b44d0a7
Update mappings and pipelines for better harmony
samueljackson92 Dec 18, 2024
255360c
Add unit test
samueljackson92 Dec 18, 2024
9b16462
Fix ruff
samueljackson92 Dec 18, 2024
b42f063
Fix historical mappings
samueljackson92 Dec 18, 2024
8c5b7ba
Add bes
samueljackson92 Dec 18, 2024
c1f69db
Fix tests
samueljackson92 Dec 18, 2024
ff49aa9
Update install instructions
samueljackson92 Jan 2, 2025
5e5d558
Checkpoint work on mappings
samueljackson92 Jan 2, 2025
c2ed67b
Update mappings and trasforms for MAST-U
samueljackson92 Jan 2, 2025
729db7f
remove debug code
samueljackson92 Jan 2, 2025
d99ae3f
Add support for writing metadatabase
samueljackson92 Jan 2, 2025
868b043
Fix ruff
samueljackson92 Jan 2, 2025
6ad7669
Fix rank,shape,dims
samueljackson92 Jan 2, 2025
da4d182
Refactor code structure to support level2
samueljackson92 Jan 3, 2025
246b7ad
Add level2 code to repo
samueljackson92 Jan 3, 2025
8ac18f2
Add notebooks for checking results
samueljackson92 Jan 3, 2025
48cf3ec
Update tests
samueljackson92 Jan 3, 2025
cc2df7f
Fix ruff
samueljackson92 Jan 3, 2025
422d11a
Tidy up
samueljackson92 Jan 3, 2025
27ab443
Update gitignore
samueljackson92 Jan 3, 2025
0265bb7
Update mappings location
samueljackson92 Jan 3, 2025
d72af12
remove unused geom files
samueljackson92 Jan 3, 2025
b7dd57f
Update installation instructions
samueljackson92 Jan 3, 2025
7b376cd
Update requirements
samueljackson92 Jan 3, 2025
0db175e
Ignore notebook dir in ruff
samueljackson92 Jan 3, 2025
7bf512e
Update reqs
samueljackson92 Jan 3, 2025
6bdd7af
Update reqs
samueljackson92 Jan 3, 2025
9707bfd
refactor parallel code
samueljackson92 Jan 3, 2025
56a9a04
remove code
samueljackson92 Jan 3, 2025
18133a9
Ingest in reverse
samueljackson92 Jan 3, 2025
ff3a76a
Update readme
samueljackson92 Jan 3, 2025
571027f
Update job scripts
samueljackson92 Jan 3, 2025
a9fd976
Merge branch '51-level2-ingestion' of github.com:ukaea/fair-mast-inge…
samueljackson92 Jan 3, 2025
17abe04
remove old job scripts
samueljackson92 Jan 3, 2025
7aa1377
Add level2 job file
samueljackson92 Jan 3, 2025
4817bf1
Update to include bottleneck
samueljackson92 Jan 3, 2025
2e8e18c
Update FFT transform
samueljackson92 Jan 3, 2025
613c1fc
Merge branch '51-level2-ingestion' of github.com:ukaea/fair-mast-inge…
samueljackson92 Jan 3, 2025
086446f
add sensible metadata defaults
samueljackson92 Jan 3, 2025
76c5c10
Update the uploader
samueljackson92 Jan 3, 2025
2f8f479
Fix camera data
samueljackson92 Jan 3, 2025
eb0c09c
Merge branch '51-level2-ingestion' of github.com:ukaea/fair-mast-inge…
samueljackson92 Jan 3, 2025
e65ff02
Update to fix subtle bug
samueljackson92 Jan 3, 2025
e58e2af
Fix issue with adding geometry
samueljackson92 Jan 3, 2025
1befa0f
fix geometry names
samueljackson92 Jan 3, 2025
9c0c752
Fix coordinate bug
samueljackson92 Jan 6, 2025
0bbc992
fix upload syntax
samueljackson92 Jan 6, 2025
0fac9bb
Fix esm mappings
samueljackson92 Jan 6, 2025
ac1b207
Remove debug statement
samueljackson92 Jan 6, 2025
e57eada
Update metadata parsing script
samueljackson92 Jan 6, 2025
4c561bf
Update level2 mappings for MAST to IMAS
samueljackson92 Jan 8, 2025
b1d9b48
Update mappings
samueljackson92 Jan 14, 2025
4c7d246
Add connection reset
samueljackson92 Jan 14, 2025
3f126a3
Add error handling
samueljackson92 Jan 14, 2025
3a1bf89
Add more robust code for l1
samueljackson92 Jan 15, 2025
8a8bb79
Fix imports
samueljackson92 Jan 15, 2025
b2cb3b7
Update logger message
samueljackson92 Jan 15, 2025
5b5471e
Update mappings for ALP
samueljackson92 Jan 15, 2025
754b573
Add better exception handling
samueljackson92 Jan 15, 2025
e02a6ff
Force upload to use strings
samueljackson92 Jan 15, 2025
34e21bd
Update error logging
samueljackson92 Jan 15, 2025
74630ba
Add more robust code
samueljackson92 Jan 15, 2025
d8bb8bf
Update mappings
samueljackson92 Jan 15, 2025
ee3f86b
Update mappings
samueljackson92 Jan 15, 2025
171219e
Fix race condition
samueljackson92 Jan 15, 2025
8f0cc8f
Make more robust
samueljackson92 Jan 15, 2025
42a4bdb
Fix edge case with magnetics
samueljackson92 Jan 15, 2025
6ed7f48
Update ruff and tests
samueljackson92 Jan 15, 2025
b269bc4
Update scripts
samueljackson92 Jan 15, 2025
cd79622
Add fix mast mapping
samueljackson92 Jan 15, 2025
2e91aee
Fix scaling for plasma current
samueljackson92 Jan 16, 2025
36ccf95
Add checkpointing
samueljackson92 Jan 20, 2025
a4679e5
Add upload
samueljackson92 Jan 23, 2025
fce53fa
Add updated mappings
samueljackson92 Jan 24, 2025
98a217b
Update logging
samueljackson92 Jan 27, 2025
926e773
Comment out some magnetics
samueljackson92 Jan 27, 2025
f391d17
Add metadata parsing script
samueljackson92 Jan 28, 2025
9f5cdcd
Update for ruff
samueljackson92 Jan 28, 2025
7833124
Update channel loading to use NaN for missing channels
samueljackson92 Jan 29, 2025
02eeb58
Add bes into mappings
samueljackson92 Jan 29, 2025
afa8fee
Update target units
samueljackson92 Jan 29, 2025
f5d5c09
fixed typo in mastcodes git link
jameshod5 Jan 29, 2025
0c60f39
Checkpoint mappings
samueljackson92 Jan 29, 2025
7de80e7
Checkpoint scripts
samueljackson92 Jan 29, 2025
99a2d41
Checkpoint jobs
samueljackson92 Jan 29, 2025
ff174f2
Checkpoint configs
samueljackson92 Jan 29, 2025
83418c8
Update gitignore
samueljackson92 Jan 29, 2025
02a9be5
Checkpoint notebook
samueljackson92 Jan 29, 2025
ea868ad
Update consolidate job for freia
samueljackson92 Jan 30, 2025
bd6cd47
Update the code to remove metadata writing
samueljackson92 Jan 30, 2025
6b3ef38
Update metadata jobs
samueljackson92 Jan 30, 2025
18eac4c
Merge branch '51-level2-ingestion' of github.com:ukaea/fair-mast-inge…
samueljackson92 Jan 30, 2025
9ffe577
Remove meta data writing
samueljackson92 Jan 31, 2025
9759268
Update for ruff
samueljackson92 Jan 31, 2025
ae1fc5d
Speed-up cpf parsing
samueljackson92 Jan 31, 2025
ab66c57
Update for ruff
samueljackson92 Jan 31, 2025
f683523
Update metadata jobs
samueljackson92 Feb 3, 2025
ecdc3cf
Update mapping files
samueljackson92 Feb 7, 2025
9eceb01
Update code with channel ordering
samueljackson92 Feb 18, 2025
40a11ff
Update mappings
samueljackson92 Feb 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ jobs:
source venv/bin/activate
uv pip install -r requirements.txt
uv pip install --upgrade --force-reinstall "numpy<2.0"
uv pip install -e ".[dev]"

- name: Run tests
run: |
Expand Down
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -171,4 +171,7 @@ dask-scratch-space

fairmast-*.e*
fairmast-*.o*
.s5cfg*
fair-mast-*.e*
fair-mast-*.o*
.s5cfg*
*.db
134 changes: 55 additions & 79 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,28 @@
### FAIR MAST Data Ingestion

## Running on CSD3
### Installation on CSD3
## Project Structure

After logging into your CSD3 account (on Icelake node), first load the correct Python module:

```sh
module load python/3.9.12/gcc/pdcqf4o5
Below is a brief overview of the project structure
```
|-- campaign_shots # CSV lists of shots for each MAST campaign
|-- configs # Config files for each level of ingestion
|-- geometry # Geometry data files for each diagnostic source
|-- jobs # Job scripts for different HPC machines
|-- mappings # Mapping files for transforming units, names, dimensions, etc.
|-- notebooks # Notebooks for checking outputs
|-- scripts # Misc scripts for metadata curation
|-- src # Source code for ingestion tools
| |-- core # Core modules for ingestion, shared between all levels
| |-- level1 # Level1 data ingestion code
| |-- level2 # Level2 data ingestion code
`-- tests # Unit tests
|-- core # Core module unit tests
|-- level1 # Level1 module unit tests
|-- level2 # Level2 module unit tests
```

## Installation and Setup

Clone the repository and fetch data files (Git LFS must be installed):

```sh
Expand All @@ -17,120 +31,82 @@ cd fair-mast-ingestion
git lfs pull
```

Create a virtual environment:
Create a new python virtual environment:

```sh
python -m venv fair-mast-ingestion
source fair-mast-ingestion/bin/activate
uv venv --python 3.12.6
source .venv/bin/activate
```

Update pip and install required packages:

```sh
python -m pip install -U pip
python -m pip install -e .
uv pip install git+ssh://[email protected]/MAST-U/mastcodes.git#subdirectory=uda/python
uv pip install -e .
uv pip install -e ".[dev]"
uv pip install -e ".[mpi]"
```

The final step to installation is to have mastcodes:
If running on CSD3, we must also source the SSL certificate information by running the following command. Without this UDA cannot connect to the UKAEA network.

```sh
git clone [email protected]:MAST-U/mastcodes.git
cd mastcodes
```

Edit `uda/python/setup.py` and change the "version" to 1.3.9.

```sh
python -m pip install uda/python
cd ..
source ~/rds/rds-ukaea-ap002-mOlK9qn0PlQ/fairmast/uda-ssl.sh
```

#### S3 Support (Optional)

Finally, for uploading to S3 we need to install `s5cmd` and make sure it is on the path:

```sh
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz
PATH=$PWD:$PATH
```

And add a config file for the bucket keys, by creating a file called `.s5cfg.stfc`:
Finally, for uploading to S3 we need to create a local config file with the bucket keys. Create a file called `.s5cfg.stfc` with the following information:

```
[default]
aws_access_key_id=<access-key>
aws_secret_access_key=<secret-key>
```

You should now be able to run the following commands.

### Submitting runs on CSD3

#### First Run on CSD3

This will ingest data into the test folder in S3. The small_ingest script allows you to put one file of shots into the ingestion.

1. First submit a job to collect all the metadata:

```sh
sbatch ./jobs/metadata.csd3.slurm.sh
```
## Running Ingestion

2. Then submit an ingestion job
The following section details how to ingest data into a local folder with UDA.

Argument 1 (e.g. s3://mast/test/shots/) is where the data will ingest to, and argument 2 is the file of shots to ingest (e.g. campaign_shots/tiny_campaign.csv), arguments 3 and greater are the sources (e.g. amc)
First you must edit both the config files in `./configs/` to point the writer `output_path` at a sensible location:

```sh
sbatch ./jobs/small_ingest.csd3.slurm.sh s3://mast/test/shots/ campaign_shots/tiny_campaign.csv amc
```yaml
...
writer:
type: "zarr"
options:
zarr_format: 2
output_path: "/common/tmp/sjackson/upload-tmp/zarr/level1"
...
```

#### Ingesting All Shots
### Level 1 Ingestion

This ingestion job runs through all shots for the specified source (e.g. amc)
Below gives an example of running a level 1 ingestion which will write `ayc` data for shot `30421` from MAST.

```sh
sbatch ./jobs/small_ingest.csd3.slurm.sh s3://mast/test/shots/ amc
mpirun -n 4 python3 -m src.level1.main -v --facility MAST --shot 30421 -i ayc
```

## Manually Running Ingestor

### Local Ingestion

The following section details how to ingest data into a local folder on freia with UDA.

1. Parse the metadata for all signals and sources for a list of shots with the following command

```sh
mpirun -n 16 python3 -m src.create_uda_metadata data/uda campaign_shots/tiny_campaign.csv
```
### Level 2 Ingestion

Below gives an example of running a level 2 ingestion which will write `thomson_scattering` data for shot `30421` from MAST.
```sh
mpirun -np 16 python3 -m src.main data/local campaign_shots/tiny_campaign.csv --metadata_dir data/uda --source_names amc xsx --file_format nc
mpirun -n 4 python3 -m src.level2.main mappings/level2/mast.yml -v --shot 30421 -i thomson_scattering
```

Files will be output in the NetCDF format to `data/local`.

### Ingestion to S3

The following section details how to ingest data into the s3 storage on freia with UDA.

1. Parse the metadata for all signals and sources for a list of shots with the following command

```sh
mpirun -n 16 python3 -m src.create_uda_metadata data/uda campaign_shots/tiny_campaign.csv
```

This will create the metadata for the tiny campaign. You may do the same for full campaigns such as `M9`.
To ingest to S3 you must edit the config files in `./configs` to include the an upload entry specifying the endpoint and location to upload data to.
For example the following config sets the base path and endpoint url for object storage at CSD3:

2. Run the ingestion pipleline by submitting the following job:

```sh
mpirun -np 16 python3 -m src.main data/local campaign_shots/tiny_campaign.csv --bucket_path s3://mast/test/shots --source_names amc xsx --file_format zarr --upload --force
```yaml
upload:
base_path: "s3://mast/test/level1/shots"
mode: 's5cmd'
credentials_file: ".s5cfg.csd3"
endpoint_url: "https://object.arcus.openstack.hpc.cam.ac.uk"
```

This will submit a job to the freia job queue that will ingest all of the shots in the tiny campaign and push them to the s3 bucket.
Then simple rerun the commands as above.

## CPF Metadata

Expand Down
25 changes: 25 additions & 0 deletions configs/level1.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
metadatabase_file: /common/tmp/sjackson/level1.ukaea.db
readers:
uda:
type: 'uda'
fairmast:
type: 'zarr'
options:
base_path: 's3://mast/level1/shots'
protocol: 'simplecache'
target_protocol: "s3"
target_options:
anon: True
endpoint_url: "https://s3.echo.stfc.ac.uk"

writer:
type: "zarr"
options:
zarr_format: 2
output_path: "/common/tmp/sjackson/upload-tmp/zarr/level1"

# upload:
# base_path: "s3://fairmast/mastu/level1/shots"
# mode: 's5cmd'
# credentials_file: "../fair-mast-ingestion/.s5cfg.ukaea"
# endpoint_url: "http://mon3.cepheus.hpc.l:8000"
27 changes: 27 additions & 0 deletions configs/level2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
metadatabase_file: /common/tmp/sjackson/level2.stfc.db
readers:
uda:
type: 'uda'
fairmast:
type: 'zarr'
options:
base_path: 's3://mast/level1/shots'
protocol: 'simplecache'
target_protocol: "s3"
target_options:
anon: True
endpoint_url: "https://s3.echo.stfc.ac.uk"

writer:
type: "zarr"
options:
zarr_format: 2
# output_path: "/common/tmp/sjackson/upload-tmp/zarr/mast/level2/"
output_path: "./data-test"

# upload:
# base_path: "s3://fairmast/mastu/level2/shots"
# mode: 's5cmd'
# credentials_file: "../fair-mast-ingestion/.s5cfg.ukaea"
# endpoint_url: "http://mon3.cepheus.hpc.l:8000"

22 changes: 22 additions & 0 deletions jobs/consolidate_job.freia.qsub.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash

# Choose parallel environment
#$ -pe mpi 8

# Specify the job name in the queue system
#$ -N fairmast-consolidate

# Start the script in the current working directory
#$ -cwd

# Time requirements
#$ -l h_rt=48:00:00
#$ -l s_rt=48:00:00

source .venv/bin/activate

num_workers=8
bucket_path="s3://mast/level1/shots/"
local_path="/common/tmp/sjackson/fair-mast/consolidate"

python3 -m scripts.consolidate_s3 $bucket_path $local_path -n $num_workers --start-shot 11695 --end-shot 14830
4 changes: 2 additions & 2 deletions jobs/consolidate_job.slurm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@ num_workers=$SLURM_NTASKS

bucket_path="s3://mast/level1/shots/"
local_path="/rds/project/rds-mOlK9qn0PlQ/fairmast"
S

mpirun -n $num_workers \
python3 -m src.consolidate_s3 $bucket_path $local_path
python3 -m scripts.consolidate_s3 $bucket_path $local_path
20 changes: 20 additions & 0 deletions jobs/cpf_metadata.freia.qsub
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

# Choose parallel environment
#$ -pe mpi 16

# Specify the job name in the queue system
#$ -N fairmast-cpf-writer

# Start the script in the current working directory
#$ -cwd

# Time requirements
#$ -l h_rt=48:00:00
#$ -l s_rt=48:00:00

source .venv/bin/activate

# Run script
python3 -m scripts.create_cpf_metadata mast --shot-min 11695 --shot-max 30475
python3 -m scripts.create_cpf_metadata mastu --shot-min 41139 --shot-max 51056
28 changes: 0 additions & 28 deletions jobs/freia_write_cpf.qsub

This file was deleted.

35 changes: 0 additions & 35 deletions jobs/ingest.csd3.slurm.sh

This file was deleted.

Loading