Sage-Bionetworks · thomasyu888 · Oct 31, 2024 · Oct 31, 2024 · Nov 1, 2024 · Nov 1, 2024
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2021 Sage Bionetworks
+Copyright (c) 2024 Sage Bionetworks
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

@@ -270,12 +270,17 @@ poetry debug info
 
 Before you begin, make sure you are in the latest `develop` of the repository.
 
-The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository. If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry install` (Please note this method should be used as a last resort because this would force other developers to change their development environment)
+The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository (which is generated from the libraries listed in the `pyproject.toml` file). If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry lock` (Please note this method should be used as a last resort because this would force other developers to change their development environment).
 
 ```
-poetry install --all-extras
+poetry install --dev,doc
 ```
 
+This command will install:
+* The main dependencies required for running the package.
+* Development dependencies for testing, linting, and code formatting.
+* Documentation dependencies such as `sphinx` for building and maintaining documentation.
+
 ### 5. Set up configuration files
 
 The following section will walk through setting up your configuration files with your credentials to allow for communication between `schematic` and the Synapse API.
@@ -484,12 +489,23 @@ docker run -v %cd%:/schematic \
 
 # Contributors
 
-Main contributors and developers:
 
+Sage main contributors and developers:
+
+- [Gianna Jordan](https://github.com/giajordan)
+- [Lingling Peng](https://github.com/linglp)
+- [Bryan Fauble](https://github.com/BryanFauble)
+- [Andrew Lamb](https://github.com/andrewelamb)
+- [Brad Macdonald](https://github.com/BWMac)
 - [Milen Nikolov](https://github.com/milen-sage)
+
+## Alumni
 - [Mialy DeFelice](https://github.com/mialy-defelice)
 - [Sujay Patil](https://github.com/sujaypatil96)
 - [Bruno Grande](https://github.com/BrunoGrandePhD)
-- [Robert Allaway](https://github.com/allaway)
-- [Gianna Jordan](https://github.com/giajordan)
-- [Lingling Peng](https://github.com/linglp)
+- [Jason Hwee](https://github.com/hweej)
+- [Xengie Doan](https://github.com/xdoan)
+- [James Eddy](https://github.com/jaeddy)
+- [Yooree Chae](https://github.com/ychae)
+
+See all [contributors](https://github.com/Sage-Bionetworks/schematic/graphs/contributors)
@@ -0,0 +1,70 @@
+Setting up your asset store
+===========================
+
+
+This document covers the minimal recommended elements needed in Synapse to interface with the Data Curator App (DCA) and provides options for Synapse project layout.
+
+There are two options for setting up a DCC Synapse project:
+
+1. Each team of DCC contributors has its own Synapse project that stores the team's datasets.
+2. All DCC datasets are stored in the same Synapse project.
+
+Option 1: Distributed Synapse Projects
+--------------------------------------
+
+Pick **option 1** if you answer "yes" to one or more of the following questions:
+
+- Does the DCC have multiple contributing institutions/labs, each with different data governance and access controls?
+- Does the DCC have multiple institutions with limited cross-institutional sharing?
+- Will contributors submit more than 100 datasets per release or per month?
+- Are you not willing to annotate each DCC dataset folder with the annotation `contentType:dataset`?
+
+Option 2: Single Synapse Project
+--------------------------------
+
+Pick **option 2** if you don't select option 1 and you answer "yes" to any of these questions:
+
+- Does the DCC have a project with pre-existing datasets in a complex folder hierarchy?
+- Does the DCC envision collaboration on the same dataset collection across multiple teams with shared access controls?
+- Are you willing to set up local access control for each dataset folder and annotate each with `contentType:dataset`?
+
+If neither option fits, select option 1.
+
+Option 1: Access & Project Setup - Multiple Contributing Projects
+------------------------------------------------------------------
+
+1. Create a DCC Admin Team with admin permissions.
+2. Create a Team for each data contributing institution. Begin with a "Test Team" if all teams are not yet identified.
+3. Create a Synapse Project for each institution and grant the respective team **Edit** level access.
+   - E.g., for institutions A, B, and C, create Projects A, B, and C with Teams A, B, and C. Team A has **Edit** access to Project A, etc.
+4. Within each project, create top-level dataset folders in the **Files** tab for each dataset type.
+5. Create another Synapse Project (e.g., MyDCC) containing the main **Fileview** for all DCC projects.
+   - Ensure all teams have **Download** level access to this file view.
+   - Include both file and folder entities and add the columns `id`, `name`, `type`, `parentId`, and `projectId` to the Fileview schema.
+
+Option 2: Access & Project Setup - Single Contributing Project
+--------------------------------------------------------------
+
+1. Create a Team for each data contributing institution.
+2. Create a single Synapse Project (e.g., MyDCC).
+3. Within this project, create dataset folders for each contributor. Organize them as needed.
+   - Use `contentType:dataset` for each dataset folder, which should not nest inside other dataset folders and must have unique names.
+4. In MyDCC, create the main **DCC Fileview** with `MyDCC` as scope. Add column `contentType` to the schema and grant teams **Download** level access.
+   - Add both file and folder entities and ensure columns `id`, `name`, `type`, `parentId`, `projectId`, and `contentType` are included.
+
+External Cloud Buckets Setup
+-----------------------------
+
+If DCC contributors require external cloud buckets, select one of the following configurations:
+
+1. **Basic External Storage Bucket (Default)**:
+   - Create an S3 bucket for Synapse uploads via web or CLI. Contributors will upload data without needing AWS credentials.
+   - Provision an S3 bucket, attach it to the Synapse project, and create folders for specific assay types.
+
+2. **Custom Storage Location**:
+   - For large datasets or if contributors prefer cloud storage, enable uploads via AWS CLI or GCP CLI.
+   - Configure the custom storage location with an AWS Lambda or Google Cloud function for syncing.
+   - If using AWS, provision a bucket, set up Lambda sync, and assign IAM write access.
+   - For GCP, use Google Cloud function sync and obtain contributor emails for access.
+
+Finally, set up a `synapse-service-lambda` account for syncing external cloud buckets with Synapse, granting "Edit & Delete" permissions on the contributor's project.
@@ -2,6 +2,38 @@
 CLI Reference
 =============
 
+
+1. Generate a new manifest as a Google Sheet
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+   schematic manifest -c /path/to/config.yml get -dt <your data type> -s
+
+2. Grab an existing manifest from Synapse
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+   schematic manifest -c /path/to/config.yml get -dt <your data type> -d <your synapse dataset folder id> -s
+
+3. Validate a manifest
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+   schematic model -c /path/to/config.yml validate -dt <your data type> -mp <your csv manifest path>
+
+4. Submit a manifest as a file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: shell
+
+   schematic model -c /path/to/config.yml submit -mp <your csv manifest path> -d <your synapse dataset folder id> -vc <your data type> -mrt file_only
+
+
+
+
 .. click:: schematic.__main__:main
   :prog: schematic
   :nested: full
@@ -12,6 +12,9 @@
 #
 import os
 import sys
+
+import sphinx_rtd_theme
+
 file_dir = os.path.dirname(__file__) 
 sys.path.append(file_dir)
 from utils import _parse_toml
@@ -25,20 +28,21 @@
 
 toml_metadata = _parse_toml(toml_file_path)
 project = toml_metadata["name"]
-copyright = "2022, Sage Bionetworks"
+copyright = "2024, Sage Bionetworks"
 
 author = toml_metadata["authors"]
 
 # The full version, including alpha/beta/rc tags
 release = toml_metadata["version"]
 
 
+
 # -- General configuration ---------------------------------------------------
 
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
-extensions = ["sphinx_click"]
+extensions = ["sphinx_click", "sphinx_rtd_theme"]
 
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
@@ -55,15 +59,21 @@
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = []
 
+# The master toctree document.
+master_doc = "index"
 
 # -- Options for HTML output -------------------------------------------------
 
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
-html_theme = "alabaster"
+html_theme = "sphinx_rtd_theme"
 
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ["_static"]
+
+html_theme_options = {
+    'collapse_navigation': False,
+}
diff --git a/docs/source/generate.rst b/docs/source/generate.rst
@@ -0,0 +1,4 @@
+Generate
+========
+
+Provides a manifest template for users for a particular project or data type. If a project with annotations already exists, a semi-filled-out template is provided to the user so that they do not start from scratch. If there are no existing annotations, an empty manifest template is provided.
@@ -6,7 +6,63 @@
 Welcome to Schematic's documentation!
 =====================================
 
+**SCHEMATIC** is an acronym for *Schema Engine for Manifest Ingress and Curation*. The Python-based infrastructure provides a *novel* schema-based, metadata ingress ecosystem, which is meant to streamline the process of biomedical dataset annotation, metadata validation, and submission to a data repository for various data contributors.
+
+Schematic tackles these goals:
+
+- “Ensure the highest quality structured data or metadata be contributed to Synapse BEFORE it lands in Synapse”
+- “Add accountability to data contributors for the data they upload”
+- “Visualize data models and their relationships with each other”
+
+The usage of JSON-LD
+--------------------
+
+The usage of JSON-LD to capture our data models extends beyond the creation, validation, and submission of annotations/manifests into Synapse. It can create relationships between different data models and, in the future, drive transformation of data from one data model to another. Visualization of these data models and their relationships is also possible (see *Schema Visualization - Design & Platform*), which allows the community to see the depth of connections between all the data uploaded into Synapse. As with all products, we must start somewhere.
+
+
+The following are the three main endpoints that assist with the high-level goals outlined above, with additional goals to come.
+
+1. Manifest Generation
+----------------------
+
+Provides a manifest template for users for a particular project or data type. If a project with annotations already exists, a semi-filled-out template is provided to the user so that they do not start from scratch. If there are no existing annotations, an empty manifest template is provided.
+
+2. Validate Manifest
+--------------------
+
+Given a filled-out manifest:
+
+- The manifest is validated against the JSON-LD schema as it maps to GX rules.
+- A ``jsonschema`` is generated from the data model. The data model can be in CSV, JSON-LD format, as input formats are decoupled from the internal data model representation within Schematic.
+- A set of validation rules is defined in the data model. Some validation rules are implemented via GX; others are custom Python code. All validation rules have the same interface.
+- Certain GX rules require looping through all projects a user has access to, or a specified scope of projects, to find other projects with manifests.
+- Validation results are provided before the manifest file is uploaded into Synapse.
+
+3. Submit Manifest
+------------------
+
+- Validates the manifest. If errors are present, the manifest is not stored.
+- If valid:
+  - Stores the manifest in Synapse.
+  - Uploads the manifest to a view, updating file views with annotations as follows:
+
+      - **Store manifest only**
+      - **Store manifest and annotations** (to update a file view)
+      - **Store manifest and update a corresponding Synapse table**
+
+4. Visualize Data Models
+-------------------------
+
+This endpoint allows you to visulize your data models and their relationships with each other.
+
+
 .. toctree::
    :maxdepth: 1
-
-   cli_reference
+   :hidden:
+
+   installation
+   asset_store
+   generate
+   validate
+   submit
+   cli_reference