JaneliaSciComp · minnerbe · Mar 21, 2024 · Mar 25, 2024 · Mar 25, 2024 · Mar 25, 2024
diff --git a/doc/cellmap/fibsem-transfer-flow.md b/doc/cellmap/fibsem-transfer-flow.md
@@ -1,13 +1,13 @@
 ```mermaid
 graph TD
-    SCOPE_DAT_A[<u>jeiss scope</u><br> z1_0-0-0.dat, z1_0-0-1.dat, <br> z1_0-1-0.dat, z1_0-1-1.dat] --> TRANSFER_DAT
-    SCOPE_DAT_B[<u>jeiss scope</u><br> zn_0-0-0.dat, zn_0-0-1.dat, <br> zn_0-1-0.dat, zn_0-1-1.dat] --> TRANSFER_DAT
-    TRANSFER_DAT{transfer} --> DM11_DAT_A & DM11_DAT_B
-    DM11_DAT_A[<u>dm11</u><br><s> z1_0-0-0.dat, z1_0-0-1.dat, <br> z1_0-1-0.dat, z1_0-1-1.dat </s>] --> DAT_TO_H5
-    DM11_DAT_B[<u>dm11</u><br><s> zn_0-0-0.dat, zn_0-0-1.dat <br> zn_0-1-0.dat, zn_0-1-1.dat </s>] --> DAT_TO_H5
-    DAT_TO_H5{convert and <br> remove dm11 dats <br> after verification} --> DM11_RAW_H5 & NRS_ALIGN_H5
-    DM11_RAW_H5[<u>dm11</u><br><s> z1.raw.h5, ...<br> zn.raw.h5 </s>] --> ARCHIVE_H5
-    NRS_ALIGN_H5[<u>nrs</u><br> z1.uint8.h5, ...<br> zn.uint8.h5]
-    ARCHIVE_H5{archive and <br> remove dm11 raw h5s <br> after verification} --> NEARLINE_RAW_H5
-    NEARLINE_RAW_H5[<u>nearline</u><br> z1.raw-archive.h5, ...<br> zn.raw-archive.h5]
+    SCOPE_DAT_A(<b>jeiss scope</b><br> z1_0-0-0.dat, z1_0-0-1.dat, <br> z1_0-1-0.dat, z1_0-1-1.dat) --> TRANSFER_DAT
+    SCOPE_DAT_B(<b>jeiss scope</b><br> zn_0-0-0.dat, zn_0-0-1.dat, <br> zn_0-1-0.dat, zn_0-1-1.dat) --> TRANSFER_DAT
+    TRANSFER_DAT{{transfer}} --> DM11_DAT_A & DM11_DAT_B
+    DM11_DAT_A(<b>prfs</b><br><s> z1_0-0-0.dat, z1_0-0-1.dat, <br> z1_0-1-0.dat, z1_0-1-1.dat </s>) --> DAT_TO_H5
+    DM11_DAT_B(<b>prfs</b><br><s> zn_0-0-0.dat, zn_0-0-1.dat <br> zn_0-1-0.dat, zn_0-1-1.dat </s>) --> DAT_TO_H5
+    DAT_TO_H5{{ convert and <br> remove prfs dats <br> after verification }} --> DM11_RAW_H5 & NRS_ALIGN_H5
+    DM11_RAW_H5(<b>prfs</b><br><s> z1.raw.h5, ...<br> zn.raw.h5 </s>) --> ARCHIVE_H5
+    NRS_ALIGN_H5(<b>nrs</b><br> z1.uint8.h5, ...<br> zn.uint8.h5)
+    ARCHIVE_H5{{archive and <br> remove prfs raw h5s <br> after verification}} --> NEARLINE_RAW_H5
+    NEARLINE_RAW_H5(<b>nearline</b><br> z1.raw-archive.h5, ...<br> zn.raw-archive.h5)
 ```
diff --git a/doc/cellmap/project_management.md b/doc/cellmap/project_management.md
@@ -0,0 +1,49 @@
+# Project management
+All projects are tracked on the [recon_fibsem repository](https://github.com/JaneliaSciComp/recon_fibsem) on GitHub.
+This repository is used for documenting and exchanging the information necessary to ingest and process new datasets, tracking the progress of the processing, and reviewing the processed datasets.
+
+```mermaid
+sequenceDiagram
+    participant F as FIB-SEM <br/> shared resource
+    participant S as SciComp
+    participant G as GitHub
+    participant C as Collaborator
+
+    F->>G: Create new issue
+    G-->>S: Notify
+    activate S
+    S->>G: Include in project board
+    Note over S:  Ingestion and processing
+    loop Review
+        S->>G: Post updates and questions
+        G-->>C: Notify
+        C->>G: Review and feedback   
+        G-->>S: Notify
+    end
+    S->>C: Finalize and hand off
+    deactivate S
+```
+
+## Issues
+For every new dataset, an issue is opened. What information should be included and what processes this triggers is detailed in [part 1 of the pipeline documentation](steps/2-1_render_import.md). Here, we concentrate on the management of the issues.
+
+Generally, GitHub is mainly used to track the progress of the processing and to communicate with the collaborators. This can be sped up by assigning the issue to the responsible person, by using labels, and by tagging people from which feedback is needed.
+
+### Labels
+
+There are three different categories of labels:
+
+- **Workflow labels**: `00: imaging in progress`, `01: prep done`, `02: alignment done`, `03: alignment review done`, `04: z correction done`, `05: streak correction done`, `06: intensity correction done`, `07: straightening done`, `09: clean-up done`
+- **Collaborator labels**: `Cell Map`, `eFIB-SEM SR` (the FIB-SEM shared resource), `Fly EM`
+- **TODO labels**: `needs alignment`, `needs clean-up`, `needs cross-correlation`, `needs N5 creation`, `needs review`, `needs straightening`, `needs streak correction`, `non-rigid tile deformations`, `alignment review concerns`
+
+Each dataset should have the right collaborator label. Apart from that, there are no strict rules about labeling. Use the TODO labels as needed to keep track of what needs to be done. The workflow labels are used to document the processing steps that have been done. When a dataset is done processing, usually all workflow labels below 06 are removed, as they are recorded in the issue history anyway.
+
+### Project board
+
+All issues are collected on the [project board](https://github.com/JaneliaSciComp/recon_fibsem/projects/1), which has four columns:
+
+- **Alignment**: All new issues are automatically entered in this column and stay there during the all standard processing steps.
+- **Review**: Once processing is done, the issue is moved to this column, where Preibisch and the collaborators have a final look at it. If further processing is necessary, the issue stays in this column for the time necessary to address all issues.
+- **Done**: Once the dataset passes the review process, it is moved to this column to signal that we don’t do further processing on the dataset. The dataset is now ready to be handed off to the collaborator.
+- **Cleaned Up**: Once the collaborator has taken over the dataset and has indicated that no further processing is necessary, the issue is moved to this column. At this stage all intermediate data that was used for processing (including 8bit images for alignment) are deleted. This clean up is only done irregularly if disk space is scarce.
diff --git a/doc/cellmap/steps/1_transfer.md b/doc/cellmap/steps/1_transfer.md
@@ -0,0 +1,68 @@
+# Transfer data from scope to shared filesystem
+
+```bash
+# Commonly used paths
+EMRP_ROOT="/groups/flyem/data/render/git/EM_recon_pipeline"
+TRANSFER_INFO_DIR="src/resources/transfer_info/cellmap"
+TRANSFER_DIR="/groups/flyem/home/flyem/bin/dat_transfer/2022"
+```
+
+## Get necessary metadata
+```mermaid
+flowchart LR
+    github(Issue on<br>GitHub)
+    json{{<tt>volume_transfer_info.&ltdataset&gt.json</tt>}}
+    scope(Scope)
+    github-- Extract data --->json
+    json<-. Check data ..->scope
+
+```
+There is an issue template that has to be filled out by the FIB-SEM shared resource to gather necessary data for subsequent processing. This information should be converted to a file called `volume_transfer_info.<dataset>.json`, where `<dataset>` is the dataset name as specified by the collaborator. No dashes should be in the project name since MongoDB doesn't like them that much. This file should be put under `${TRANSFER_INFO_DIR}` in your own copy of this repository and committed.
+
+It is best to double-check the validity of the data from the GitHub issue. Sometimes, some post-processing is necessary to make sure all necessary data are present and correct. To this end, log into the scope as user `flyem` and navigate to the `root_keep_path` specified in the issue, e.g.:
+```bash
+su flyem
+ssh jeiss7.hhmi.org
+cd /cygdrive/d/UploadFlags
+```
+Make sure that the following data are correct:
+* `data_set_id` and `root_dat_path`, which can be directly seen from the filenames listed in the directory; the `root_dat_path` should not have any timestamp in it;
+* `columns_per_z_layer` and `rows_per_z_layer`, which can be inferred from the file suffix, e.g., `*_0-1-3.dat` being the highest suffix means that there are 2 rows and 4 columns.
+
+This is also a good point to check if there is enough space on the collaborator's drives for the data and to increase the quota if necessary. The correct group for billing can be found with `lsfgroup <username of the collaborator>`.
+
+## Transfer and conversion
+```mermaid
+flowchart LR
+    jenkins(Jenkins<br>server)
+    scope(Scope)
+    prfs(prfs)
+    nrs(nrs)
+    nearline(nearline)
+    jenkins-.->|"Initiate transfer (30min)"|scope
+    scope-->|Raw .dat|prfs
+    jenkins-.->|"Initiate conversion (2h)"|prfs
+    prfs--->|16-bit HDF5|prfs
+    prfs--->|8-bit HDF5 for processing|nrs
+    prfs--->|move 16-bit HDF5 after conversion<br>remove corresponding .dat|nearline
+```
+To set up transfer, make sure that `volume_transfer_info.<dataset>.json` is in `${EMPR_ROOT}/${TRANSFER_INFO_DIR}`. Then, go to `${TRANSFER_DIR}` and execute
+```bash
+./00_setup_transfer.sh cellmap/volume_transfer_info.<dataset>.json
+```
+This will copy the json file into `${TRANSFER_DIR}/config`, where it can be found by the processes run by the Jenkins server.
+
+### Configuring and starting the Jenkins server for transfer
+1. Log in to [the server](https://jenkins.int.janelia.org) and navigate to your scope under the FlyEM tab.
+2. If the process is not enabled, enable it.
+3. To test the process, go to the build steps of the configuration menu and select a short amount of time (e.g. 2min instead of 29min) and hit `Build now`. Make sure that the test run doesn't overlap with a scheduled run (happening every 30min - look at past runs and note that the time is in GMT).
+4. If the run is successful (check in run > console output), set the time back to 29min and save the configuration.
+
+The Jenkins process for conversion should always be running and happens every two hours. Note that everything resides under the FlyEM tab even if the current acquisition is done by the FIB-SEM shared resource for another collaborator, since the shared resources was initially founded for FlyEM, so the name has historic reasons.
+
+## Set up processing directory
+To set up a directory for subsequent processing, execute `11_setup_volume.sh` in `${TRANSFER_DIR}`. This will copy all relevant scripts for processing from this directory to a directory on `prfs` specified in `volume_transfer_info.<dataset>.json`.
+
+## Update GitHub issue
+To automatically create a text with all relevant links for the GitHub issue, execute `gen_github_text.sh` in the processing directory. This text can be used to update the issue description.
+Furthermore, assign the issue to the person responsible for the next steps, add the right labels, and add the issue to the project board.
diff --git a/doc/cellmap/steps/2-1_render_import.md b/doc/cellmap/steps/2-1_render_import.md
@@ -0,0 +1,25 @@
+# Reconstruction Part 1: Render Import
+```bash
+# Commonly used paths
+RENDER_DIR="/groups/flyem/data/render/"
+PROCESSING_DIR=<path to the processing directory set up in previous step>
+```
+We assume that the data has been successfully transferred and converted to HDF5 format, and that there is a directory set up for processing which we call `PROCESSING_DIR` (see [transfer documentation](1_transfer.md) for details). The first step for processing is to import the data into the Render database so that it can be accessed by the render clients.
+
+## Generate TileSpecs
+Go to a machine with access to `${RENDER_DIR}`, navigate to `${PROCESSING_DIR}` and run the following command:
+```bash
+./07_h5_to_render.sh
+```
+This will launch a local dask job that will generate TileSpecs from the HDF5 files and upload them to the render database. All necessary information is read from the `volume_transfer_info.<dataset>.json` file that was created in the previous step. After this step, a dynamically rendered stack can be accessed in the point match explorer and viewed in neuroglancer.
+
+## Set up preview volume: export to N5
+NOTE: this is a feature under active development and the process will likely change in the near future. For now, the following steps are necessary.
+
+Go to the Janelia compute cluster (e.g., `ssh login1.int.janelia.org`), navigate to `${PROCESSING_DIR}` and run the following command:
+```bash
+./99_append_to_export.sh <number of executors>
+```
+This will submit a couple of spark cluster jobs and set up logging directories. The number of executors should be chosen based on the size of the dataset. Currently, a single executor will occupy 10 cores on the cluster and 3 more cores for the driver are needed. The logs for scripts usually reside in `${PROCESSING_DIR}/logs`, but the spark jobs will set up additional log directories for each executor and the driver. These directories are printed to the console when executing above command.
+
+Once fininshed, the location of the final N5 volume can be found either in the driver logs or by running `gen_github_text.sh` again.