Skip to content

Commit

Permalink
Use Hydra for Data Generation + Routine Weekly Merge (#1045)
Browse files Browse the repository at this point in the history
* Use only one fits file for sampling sources from DES.

* Update README.md

* Modified file generation to write files on the fly. Fixed color ordering.

* Fix default values for image size.

* Remove bands from config.

* Remove bands from image normalizer.

* Remove look-up code.

* Refactor inference.

* Refactor config.

* Initialize Sampler with dataset.

* Modify config to use new Asinh Normalizer.

* Modify encoder to use list of image normalizers.

* Modify data generation to use Hydra.

* Add attributes to main config.

* Instantiate Sampler with data source.

* Preliminary notebook for collating inference outputs.

* Interleave memberships to expand to 1280 x 1280 image.

* Rename notebook.

* Update README.md.

* Remove FileDatum, replace with Dict.

* Object to tile mapping script.

* Update data generation configuration parameters.

* Modify default batch size from 1 to 2.

* Add summary statistics in inference.

* Add Callback Prediction Writer support for DES inference.

* Inference callbacks module. Initial commit.

* Update output path for object-to-tile dictionary.

* Sample shapes for background from DES table.

* 500 files in one .pt file.

* Style fixes.

---------

Co-authored-by: gapatron <[email protected]>
  • Loading branch information
kapnadak and gapatron authored Jul 19, 2024
1 parent 618d145 commit 83fa80f
Show file tree
Hide file tree
Showing 16 changed files with 575 additions and 341 deletions.
2 changes: 1 addition & 1 deletion bliss/cached_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def __call__(self, datum_in):

class ChunkingSampler(Sampler):
def __init__(self, dataset: Dataset) -> None:
super().__init__()
super().__init__(dataset)
assert isinstance(dataset, ChunkingDataset), "dataset should be ChunkingDataset"
self.dataset = dataset

Expand Down
39 changes: 14 additions & 25 deletions case_studies/galaxy_clustering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,28 +10,17 @@ Built on work done in Winter 2024 by [Li Shihang](https://www.linkedin.com/in/sh

## Generation of Data

Data can be generated by running the bash scipt `data-gen.sh`. The bash script has 3 options:
1. `-n`: the number of files to be generated (defaults to 100)
2. `-s`: the size of the image (defaults to 4800)
3. `-t`: the tile size (defaults to 4)

As an example, the command

```
bash data-gen.sh -n 10 -s 2400 -t 8
```

generates 10 images of size 2400 x 2400 tiled with a tile size of 8 x 8.

The bash command `data-gen.sh` runs three scripts (located under the data_generation directory) for data generation:
1. `catalog_gen.py` which generates catalogs of images and stores them in the data/catalogs subdirectory. Keyword arguments: `image_size` and `nfiles`.
2. `galsim-des.yaml` then reads in these catalogs and uses GalSim to generate corresponding images, which are stored as .fits files (one for each band) in the data/images subdirectory. Keyword arguments: `image_size` and `nfiles`.
3. `file_datum_generation.py` reads in the catalogs and images and saves them as *FileDatum* objects which contain the tile catalog and images in a dictionary. Keyword arguments: `image_size` and `tile_size`.

Often, after image data has been generated, we would want to retile it with a different tile size. This can be done by just running the file `file_datum_generation.py` with appropriate arguments (you would have to pass in the image size as well since it defaults to 4800). For example, if we have 80 x 80 images that we want to tile with tile size 8, we may run

```
python data_generation/file_datum_generation.py image_size=80 tile_size=8
```

Note that you must run the file from the galaxy_clustering directory (since the script takes in the current working directory for the data paths).
The data generation routine proceeds through phases. The entire routine is conveniently wrapped into a single python script `data_gen.py` that draws its parameters from the Hydra configuration, located under `conf/config.yaml` under the `data_gen` key. These phases proceed as follows.

1. **Catalog Generation.** First, we sample semi-synthetic source catalogs with their relevant properties, which are stored as `.dat` files in the `data_dir/catalogs` subdirectory.
2. **Image Generation.** Then, we take in the aforementioned source catalogs and use GalSim to render them as images, which are stored as `.fits` files (one for each band) in the `data_dir/images` subdirectory.
3. **File Datum Generation.** Finally, we convert the full source catalogs generated in phase 1 into tile catalogs, stack them up with their corresponding images, and store these objects as `.pt` files (which is what the encoder ultimately uses) in the `data_dir/file_data` subdirectory.

The following parameters can be set within the configuration file `config.yaml`.
1. `data_dir`: the path of the directory where generated data will be stored.
2. `image_size`: size of the image (pixels).
3. `tile_size`: size of tile to be used (pixels).
4. `nfiles`: number of files to be generated.
5. `n_catalogs_per_file`: number of catalogs to be stored in each file datum object.
6. `bands`: survey bands to be used (`["g", "r", "i", "z"]` for DES).
7. `min_flux_for_loss`: minimum flux for filtering.
Original file line number Diff line number Diff line change
@@ -1,9 +1,17 @@
---
defaults:
- ../../bliss/conf@_here_: base_config
- ../../../bliss/conf@_here_: base_config
- _self_
- override hydra/job_logging: stdout

data_gen:
data_dir: /nfs/turbo/lsa-regier/scratch/kapnadak/new_data
image_size: 1280
tile_size: 128
nfiles: 5000
n_catalogs_per_file: 500
bands: ["g", "r", "i", "z"]
min_flux_for_loss: 0

prior:
_target_: case_studies.galaxy_clustering.prior.GalaxyClusterPrior
Expand Down Expand Up @@ -48,19 +56,15 @@ my_metrics:
cluster_membership_acc:
_target_: case_studies.galaxy_clustering.encoder.metrics.ClusterMembershipAccuracy

my_image_normalizers:
asinh:
_target_: bliss.encoder.image_normalizer.AsinhQuantileNormalizer
q: [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 0.999, 0.9999, 0.99999]

encoder:
_target_: case_studies.galaxy_clustering.encoder.encoder.GalaxyClusterEncoder
survey_bands: ["g", "r", "i", "z"]
image_normalizer:
_target_: bliss.encoder.image_normalizer.ImageNormalizer
bands: [0, 1, 2, 3]
include_original: true
include_background: false
concat_psf_params: false
num_psf_params: 6 # for SDSS, 4 for DC2
log_transform_stdevs: null
use_clahe: false
clahe_min_stdev: null
image_normalizers: ${my_image_normalizers}
mode_metrics:
_target_: torchmetrics.MetricCollection
_convert_: "partial"
Expand All @@ -80,14 +84,21 @@ predict:
_target_: case_studies.galaxy_clustering.cached_dataset.CachedDESModule
cached_data_path: /nfs/turbo/lsa-regier/scratch/gapatron/desdr-server.ncsa.illinois.edu/despublic/dr2_tiles
tiles_per_img: 64
batch_size: 1
batch_size: 2
num_workers: 4
trainer:
_target_: pytorch_lightning.Trainer
accelerator: "gpu"
devices: "6,7"
devices: [6,5]
strategy: "ddp"
precision: ${train.trainer.precision}
callbacks:
- ${predict.callbacks.writer}
callbacks:
writer:
_target_: case_studies.galaxy_clustering.inference.inference_callbacks.DESPredictionsWriter
output_dir: "/data/scratch/des/dr2_detection_output/run_1"
write_interval: "batch"
encoder: ${encoder}
weight_save_path: /nfs/turbo/lsa-regier/scratch/gapatron/best_encoder.ckpt
device: "cuda:0"
Expand Down

This file was deleted.

41 changes: 0 additions & 41 deletions case_studies/galaxy_clustering/data_generation/catalog_gen.py

This file was deleted.

67 changes: 0 additions & 67 deletions case_studies/galaxy_clustering/data_generation/data-gen.sh

This file was deleted.

Loading

0 comments on commit 83fa80f

Please sign in to comment.