Use Hydra for Data Generation + Routine Weekly Merge (#1045)

* Use only one fits file for sampling sources from DES. * Update README.md * Modified file generation to write files on the fly. Fixed color ordering. * Fix default values for image size. * Remove bands from config. * Remove bands from image normalizer. * Remove look-up code. * Refactor inference. * Refactor config. * Initialize Sampler with dataset. * Modify config to use new Asinh Normalizer. * Modify encoder to use list of image normalizers. * Modify data generation to use Hydra. * Add attributes to main config. * Instantiate Sampler with data source. * Preliminary notebook for collating inference outputs. * Interleave memberships to expand to 1280 x 1280 image. * Rename notebook. * Update README.md. * Remove FileDatum, replace with Dict. * Object to tile mapping script. * Update data generation configuration parameters. * Modify default batch size from 1 to 2. * Add summary statistics in inference. * Add Callback Prediction Writer support for DES inference. * Inference callbacks module. Initial commit. * Update output path for object-to-tile dictionary. * Sample shapes for background from DES table. * 500 files in one .pt file. * Style fixes. --------- Co-authored-by: gapatron <[email protected]>
prob-ml · Jul 19, 2024 · 83fa80f · 83fa80f
1 parent 618d145
commit 83fa80f
Show file tree

Hide file tree

Showing 16 changed files with 575 additions and 341 deletions.
diff --git a/bliss/cached_dataset.py b/bliss/cached_dataset.py
@@ -69,7 +69,7 @@ def __call__(self, datum_in):
 
 class ChunkingSampler(Sampler):
     def __init__(self, dataset: Dataset) -> None:
-        super().__init__()
+        super().__init__(dataset)
         assert isinstance(dataset, ChunkingDataset), "dataset should be ChunkingDataset"
         self.dataset = dataset
 

diff --git a/case_studies/galaxy_clustering/README.md b/case_studies/galaxy_clustering/README.md
@@ -10,28 +10,17 @@ Built on work done in Winter 2024 by [Li Shihang](https://www.linkedin.com/in/sh
 
 ## Generation of Data
 
-Data can be generated by running the bash scipt `data-gen.sh`. The bash script has 3 options:
-1. `-n`: the number of files to be generated (defaults to 100)
-2. `-s`: the size of the image (defaults to 4800)
-3. `-t`: the tile size (defaults to 4)
-
-As an example, the command
-
-```
-bash data-gen.sh -n 10 -s 2400 -t 8
-```
-
-generates 10 images of size 2400 x 2400 tiled with a tile size of 8 x 8.
-
-The bash command `data-gen.sh` runs three scripts (located under the data_generation directory) for data generation:
-1. `catalog_gen.py` which generates catalogs of images and stores them in the data/catalogs subdirectory. Keyword arguments: `image_size` and `nfiles`.
-2. `galsim-des.yaml` then reads in these catalogs and uses GalSim to generate corresponding images, which are stored as .fits files (one for each band) in the data/images subdirectory. Keyword arguments: `image_size` and `nfiles`.
-3. `file_datum_generation.py` reads in the catalogs and images and saves them as *FileDatum* objects which contain the tile catalog and images in a dictionary. Keyword arguments: `image_size` and `tile_size`.
-
-Often, after image data has been generated, we would want to retile it with a different tile size. This can be done by just running the file `file_datum_generation.py` with appropriate arguments (you would have to pass in the image size as well since it defaults to 4800). For example, if we have 80 x 80 images that we want to tile with tile size 8, we may run
-
-```
-python data_generation/file_datum_generation.py image_size=80 tile_size=8
-```
-
-Note that you must run the file from the galaxy_clustering directory (since the script takes in the current working directory for the data paths).
+The data generation routine proceeds through phases. The entire routine is conveniently wrapped into a single python script `data_gen.py` that draws its parameters from the Hydra configuration, located under `conf/config.yaml` under the `data_gen` key. These phases proceed as follows.
+
+1. **Catalog Generation.** First, we sample semi-synthetic source catalogs with their relevant properties, which are stored as `.dat` files in the `data_dir/catalogs` subdirectory.
+2. **Image Generation.** Then, we take in the aforementioned source catalogs and use GalSim to render them as images, which are stored as `.fits` files (one for each band) in the `data_dir/images` subdirectory.
+3. **File Datum Generation.** Finally, we convert the full source catalogs generated in phase 1 into tile catalogs, stack them up with their corresponding images, and store these objects as `.pt` files (which is what the encoder ultimately uses) in the `data_dir/file_data` subdirectory.
+
+The following parameters can be set within the configuration file `config.yaml`.
+1. `data_dir`: the path of the directory where generated data will be stored.
+2. `image_size`: size of the image (pixels).
+3. `tile_size`: size of tile to be used (pixels).
+4. `nfiles`: number of files to be generated.
+5. `n_catalogs_per_file`: number of catalogs to be stored in each file datum object.
+6. `bands`: survey bands to be used (`["g", "r", "i", "z"]` for DES).
+7. `min_flux_for_loss`: minimum flux for filtering.
diff --git a/case_studies/galaxy_clustering/config.yaml → ...tudies/galaxy_clustering/conf/config.yaml b/case_studies/galaxy_clustering/config.yaml → ...tudies/galaxy_clustering/conf/config.yaml
@@ -1,9 +1,17 @@
 ---
 defaults:
-    - ../../bliss/conf@_here_: base_config
+    - ../../../bliss/conf@_here_: base_config
     - _self_
     - override hydra/job_logging: stdout
 
+data_gen:
+    data_dir: /nfs/turbo/lsa-regier/scratch/kapnadak/new_data
+    image_size: 1280
+    tile_size: 128
+    nfiles: 5000
+    n_catalogs_per_file: 500
+    bands: ["g", "r", "i", "z"]
+    min_flux_for_loss: 0
 
 prior:
     _target_: case_studies.galaxy_clustering.prior.GalaxyClusterPrior
@@ -48,19 +56,15 @@ my_metrics:
   cluster_membership_acc:
     _target_: case_studies.galaxy_clustering.encoder.metrics.ClusterMembershipAccuracy
 
+my_image_normalizers:
+    asinh:
+        _target_: bliss.encoder.image_normalizer.AsinhQuantileNormalizer
+        q: [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99, 0.999, 0.9999, 0.99999]
+
 encoder:
     _target_: case_studies.galaxy_clustering.encoder.encoder.GalaxyClusterEncoder
     survey_bands: ["g", "r", "i", "z"]
-    image_normalizer:
-        _target_: bliss.encoder.image_normalizer.ImageNormalizer
-        bands: [0, 1, 2, 3]
-        include_original: true
-        include_background: false
-        concat_psf_params: false
-        num_psf_params: 6  # for SDSS, 4 for DC2
-        log_transform_stdevs: null
-        use_clahe: false
-        clahe_min_stdev: null
+    image_normalizers: ${my_image_normalizers}
     mode_metrics:
       _target_: torchmetrics.MetricCollection
       _convert_: "partial"
@@ -80,14 +84,21 @@ predict:
         _target_: case_studies.galaxy_clustering.cached_dataset.CachedDESModule
         cached_data_path: /nfs/turbo/lsa-regier/scratch/gapatron/desdr-server.ncsa.illinois.edu/despublic/dr2_tiles
         tiles_per_img: 64
-        batch_size: 1
+        batch_size: 2
         num_workers: 4
     trainer:
         _target_: pytorch_lightning.Trainer
         accelerator: "gpu"
-        devices: "6,7"
+        devices: [6,5]
         strategy: "ddp"
         precision: ${train.trainer.precision}
+        callbacks:
+            - ${predict.callbacks.writer}
+    callbacks:
+        writer:
+                _target_: case_studies.galaxy_clustering.inference.inference_callbacks.DESPredictionsWriter
+                output_dir: "/data/scratch/des/dr2_detection_output/run_1"
+                write_interval: "batch"
     encoder: ${encoder}
     weight_save_path: /nfs/turbo/lsa-regier/scratch/gapatron/best_encoder.ckpt
     device: "cuda:0"

diff --git a/case_studies/galaxy_clustering/data_generation/DES_data_extraction.py b/case_studies/galaxy_clustering/data_generation/DES_data_extraction.py
diff --git a/case_studies/galaxy_clustering/data_generation/catalog_gen.py b/case_studies/galaxy_clustering/data_generation/catalog_gen.py
diff --git a/case_studies/galaxy_clustering/data_generation/data-gen.sh b/case_studies/galaxy_clustering/data_generation/data-gen.sh