Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md and update documentation #1533

Draft
wants to merge 56 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
391aabe
.
jaymedina Oct 31, 2024
3d1fecb
Also lock sphinx-click
jaymedina Oct 31, 2024
b8d6f6c
Update README.md
thomasyu888 Nov 1, 2024
97abaaa
move documentation libraries to doc group. update README. write new l…
jaymedina Nov 1, 2024
e5cfa79
Update copyright
thomasyu888 Nov 2, 2024
a80c1c3
Merge branch 'fds-2449-fix-rtd' into thomasyu888-patch-1
thomasyu888 Nov 2, 2024
7e83244
update version and copyright
thomasyu888 Nov 2, 2024
de14646
Add in documentation
thomasyu888 Nov 2, 2024
e0d3e10
Add to documentation
thomasyu888 Nov 2, 2024
070d58b
Edit dependencies
thomasyu888 Nov 2, 2024
028f6e8
Add to documentation
thomasyu888 Nov 2, 2024
2ed9764
edit documentation
thomasyu888 Nov 2, 2024
4ea3122
Add more details
thomasyu888 Nov 2, 2024
356a722
Add in configuration
thomasyu888 Nov 2, 2024
b1493b0
Add in more terms
thomasyu888 Nov 2, 2024
92c2db6
Edit
thomasyu888 Nov 2, 2024
99bc347
Add documentation
thomasyu888 Nov 2, 2024
5b99d89
Add documentation
thomasyu888 Nov 2, 2024
ceefd9e
Add --all-extras
thomasyu888 Nov 2, 2024
55498a9
Add typing extensions
thomasyu888 Nov 2, 2024
7292112
Add details
thomasyu888 Nov 2, 2024
8d272ef
Add documentation
thomasyu888 Nov 2, 2024
625d712
Add tutorials and troubleshooting docs
thomasyu888 Nov 5, 2024
198ed52
Add tutorial for contributing manifests
thomasyu888 Nov 5, 2024
b759ba8
Add a line
thomasyu888 Nov 5, 2024
c67ce61
Add to documentation
thomasyu888 Nov 5, 2024
8cf4342
Remove precommit
thomasyu888 Nov 5, 2024
2c1f491
Add etag value error
thomasyu888 Nov 5, 2024
abef1f6
Add schematic config section
thomasyu888 Nov 5, 2024
ad3b402
Add documentation
thomasyu888 Nov 5, 2024
0479961
Add debugging string
thomasyu888 Nov 5, 2024
f9bd85f
Fix merge conflicts
thomasyu888 Nov 5, 2024
ed2b231
Add notes
thomasyu888 Nov 5, 2024
9bf1200
Use top level fodler
thomasyu888 Nov 5, 2024
b49ec85
Fix
thomasyu888 Nov 5, 2024
1d36c37
Edit
thomasyu888 Nov 5, 2024
1ee5a80
Add more documentation around asset stores
thomasyu888 Nov 5, 2024
d50e7a6
Update section title
thomasyu888 Nov 5, 2024
42101f7
Fix formatting
thomasyu888 Nov 6, 2024
3f4b563
Edit troubleshotting docs
thomasyu888 Nov 6, 2024
3abe8ef
Add more inforamtion
thomasyu888 Nov 6, 2024
fb91a47
Add notes
thomasyu888 Nov 6, 2024
0b6ef68
Update docs
thomasyu888 Nov 6, 2024
9bd4055
Update docs
thomasyu888 Nov 6, 2024
1b43078
Add data layout
thomasyu888 Nov 6, 2024
5de2a25
Fix
thomasyu888 Nov 6, 2024
35a116d
Fix lock file
thomasyu888 Nov 7, 2024
b82d982
Fix merge conflicts
thomasyu888 Nov 7, 2024
16df9fc
Fix merge fonflicts
thomasyu888 Nov 14, 2024
a0d9da8
Add in permissions
thomasyu888 Nov 19, 2024
a49a40c
Fix merge conflicts
thomasyu888 Nov 19, 2024
b1cf2dd
Merge branch 'develop' into thomasyu888-patch-1
thomasyu888 Dec 13, 2024
970d0a6
Update docs
thomasyu888 Dec 13, 2024
558f382
Merge branch 'develop' into thomasyu888-patch-1
thomasyu888 Dec 17, 2024
8fd545b
Add documentation
thomasyu888 Dec 17, 2024
500f5b9
Fix merge conflicts
thomasyu888 Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2021 Sage Bionetworks
Copyright (c) 2024 Sage Bionetworks

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,12 +270,17 @@ poetry debug info

Before you begin, make sure you are in the latest `develop` of the repository.

The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository. If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry install` (Please note this method should be used as a last resort because this would force other developers to change their development environment)
The following command will install the dependencies based on what we specify in the `poetry.lock` file of this repository (which is generated from the libraries listed in the `pyproject.toml` file). If this step is taking a long time, try to go back to Step 2 and check your version of `poetry`. Alternatively, you can try deleting the lock file and regenerate it by doing `poetry lock` (Please note this method should be used as a last resort because this would force other developers to change their development environment).

```
poetry install --all-extras
poetry install --dev,doc
```

This command will install:
* The main dependencies required for running the package.
* Development dependencies for testing, linting, and code formatting.
* Documentation dependencies such as `sphinx` for building and maintaining documentation.

### 5. Set up configuration files

The following section will walk through setting up your configuration files with your credentials to allow for communication between `schematic` and the Synapse API.
Expand Down Expand Up @@ -484,12 +489,23 @@ docker run -v %cd%:/schematic \

# Contributors

Main contributors and developers:

Sage main contributors and developers:

- [Gianna Jordan](https://github.com/giajordan)
- [Lingling Peng](https://github.com/linglp)
- [Bryan Fauble](https://github.com/BryanFauble)
- [Andrew Lamb](https://github.com/andrewelamb)
- [Brad Macdonald](https://github.com/BWMac)
- [Milen Nikolov](https://github.com/milen-sage)

## Alumni
- [Mialy DeFelice](https://github.com/mialy-defelice)
- [Sujay Patil](https://github.com/sujaypatil96)
- [Bruno Grande](https://github.com/BrunoGrandePhD)
- [Robert Allaway](https://github.com/allaway)
- [Gianna Jordan](https://github.com/giajordan)
- [Lingling Peng](https://github.com/linglp)
- [Jason Hwee](https://github.com/hweej)
- [Xengie Doan](https://github.com/xdoan)
- [James Eddy](https://github.com/jaeddy)
- [Yooree Chae](https://github.com/ychae)

See all [contributors](https://github.com/Sage-Bionetworks/schematic/graphs/contributors)
70 changes: 70 additions & 0 deletions docs/source/asset_store.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Setting up your asset store
===========================


This document covers the minimal recommended elements needed in Synapse to interface with the Data Curator App (DCA) and provides options for Synapse project layout.

There are two options for setting up a DCC Synapse project:

1. Each team of DCC contributors has its own Synapse project that stores the team's datasets.
2. All DCC datasets are stored in the same Synapse project.

Option 1: Distributed Synapse Projects
--------------------------------------

Pick **option 1** if you answer "yes" to one or more of the following questions:

- Does the DCC have multiple contributing institutions/labs, each with different data governance and access controls?
- Does the DCC have multiple institutions with limited cross-institutional sharing?
- Will contributors submit more than 100 datasets per release or per month?
- Are you not willing to annotate each DCC dataset folder with the annotation `contentType:dataset`?

Option 2: Single Synapse Project
--------------------------------

Pick **option 2** if you don't select option 1 and you answer "yes" to any of these questions:

- Does the DCC have a project with pre-existing datasets in a complex folder hierarchy?
- Does the DCC envision collaboration on the same dataset collection across multiple teams with shared access controls?
- Are you willing to set up local access control for each dataset folder and annotate each with `contentType:dataset`?

If neither option fits, select option 1.

Option 1: Access & Project Setup - Multiple Contributing Projects
------------------------------------------------------------------

1. Create a DCC Admin Team with admin permissions.
2. Create a Team for each data contributing institution. Begin with a "Test Team" if all teams are not yet identified.
3. Create a Synapse Project for each institution and grant the respective team **Edit** level access.
- E.g., for institutions A, B, and C, create Projects A, B, and C with Teams A, B, and C. Team A has **Edit** access to Project A, etc.
4. Within each project, create top-level dataset folders in the **Files** tab for each dataset type.
5. Create another Synapse Project (e.g., MyDCC) containing the main **Fileview** for all DCC projects.
- Ensure all teams have **Download** level access to this file view.
- Include both file and folder entities and add the columns `id`, `name`, `type`, `parentId`, and `projectId` to the Fileview schema.

Option 2: Access & Project Setup - Single Contributing Project
--------------------------------------------------------------

1. Create a Team for each data contributing institution.
2. Create a single Synapse Project (e.g., MyDCC).
3. Within this project, create dataset folders for each contributor. Organize them as needed.
- Use `contentType:dataset` for each dataset folder, which should not nest inside other dataset folders and must have unique names.
4. In MyDCC, create the main **DCC Fileview** with `MyDCC` as scope. Add column `contentType` to the schema and grant teams **Download** level access.
- Add both file and folder entities and ensure columns `id`, `name`, `type`, `parentId`, `projectId`, and `contentType` are included.

External Cloud Buckets Setup
-----------------------------

If DCC contributors require external cloud buckets, select one of the following configurations:

1. **Basic External Storage Bucket (Default)**:
- Create an S3 bucket for Synapse uploads via web or CLI. Contributors will upload data without needing AWS credentials.
- Provision an S3 bucket, attach it to the Synapse project, and create folders for specific assay types.

2. **Custom Storage Location**:
- For large datasets or if contributors prefer cloud storage, enable uploads via AWS CLI or GCP CLI.
- Configure the custom storage location with an AWS Lambda or Google Cloud function for syncing.
- If using AWS, provision a bucket, set up Lambda sync, and assign IAM write access.
- For GCP, use Google Cloud function sync and obtain contributor emails for access.

Finally, set up a `synapse-service-lambda` account for syncing external cloud buckets with Synapse, granting "Edit & Delete" permissions on the contributor's project.
32 changes: 32 additions & 0 deletions docs/source/cli_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,38 @@
CLI Reference
=============


1. Generate a new manifest as a Google Sheet
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

schematic manifest -c /path/to/config.yml get -dt <your data type> -s

2. Grab an existing manifest from Synapse
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

schematic manifest -c /path/to/config.yml get -dt <your data type> -d <your synapse dataset folder id> -s

3. Validate a manifest
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

schematic model -c /path/to/config.yml validate -dt <your data type> -mp <your csv manifest path>

4. Submit a manifest as a file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell

schematic model -c /path/to/config.yml submit -mp <your csv manifest path> -d <your synapse dataset folder id> -vc <your data type> -mrt file_only




.. click:: schematic.__main__:main
:prog: schematic
:nested: full
16 changes: 13 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
#
import os
import sys

import sphinx_rtd_theme

file_dir = os.path.dirname(__file__)
sys.path.append(file_dir)
from utils import _parse_toml
Expand All @@ -25,20 +28,21 @@

toml_metadata = _parse_toml(toml_file_path)
project = toml_metadata["name"]
copyright = "2022, Sage Bionetworks"
copyright = "2024, Sage Bionetworks"

author = toml_metadata["authors"]

# The full version, including alpha/beta/rc tags
release = toml_metadata["version"]



# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ["sphinx_click"]
extensions = ["sphinx_click", "sphinx_rtd_theme"]

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
Expand All @@ -55,15 +59,21 @@
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = []

# The master toctree document.
master_doc = "index"

# -- Options for HTML output -------------------------------------------------

# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "alabaster"
html_theme = "sphinx_rtd_theme"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]

html_theme_options = {
'collapse_navigation': False,
}
4 changes: 4 additions & 0 deletions docs/source/generate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Generate
========

Provides a manifest template for users for a particular project or data type. If a project with annotations already exists, a semi-filled-out template is provided to the user so that they do not start from scratch. If there are no existing annotations, an empty manifest template is provided.
60 changes: 58 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,63 @@
Welcome to Schematic's documentation!
=====================================

**SCHEMATIC** is an acronym for *Schema Engine for Manifest Ingress and Curation*. The Python-based infrastructure provides a *novel* schema-based, metadata ingress ecosystem, which is meant to streamline the process of biomedical dataset annotation, metadata validation, and submission to a data repository for various data contributors.

Schematic tackles these goals:

- “Ensure the highest quality structured data or metadata be contributed to Synapse BEFORE it lands in Synapse”
- “Add accountability to data contributors for the data they upload”
- “Visualize data models and their relationships with each other”

The usage of JSON-LD
--------------------

The usage of JSON-LD to capture our data models extends beyond the creation, validation, and submission of annotations/manifests into Synapse. It can create relationships between different data models and, in the future, drive transformation of data from one data model to another. Visualization of these data models and their relationships is also possible (see *Schema Visualization - Design & Platform*), which allows the community to see the depth of connections between all the data uploaded into Synapse. As with all products, we must start somewhere.


The following are the three main endpoints that assist with the high-level goals outlined above, with additional goals to come.

1. Manifest Generation
----------------------

Provides a manifest template for users for a particular project or data type. If a project with annotations already exists, a semi-filled-out template is provided to the user so that they do not start from scratch. If there are no existing annotations, an empty manifest template is provided.

2. Validate Manifest
--------------------

Given a filled-out manifest:

- The manifest is validated against the JSON-LD schema as it maps to GX rules.
- A ``jsonschema`` is generated from the data model. The data model can be in CSV, JSON-LD format, as input formats are decoupled from the internal data model representation within Schematic.
- A set of validation rules is defined in the data model. Some validation rules are implemented via GX; others are custom Python code. All validation rules have the same interface.
- Certain GX rules require looping through all projects a user has access to, or a specified scope of projects, to find other projects with manifests.
- Validation results are provided before the manifest file is uploaded into Synapse.

3. Submit Manifest
------------------

- Validates the manifest. If errors are present, the manifest is not stored.
- If valid:
- Stores the manifest in Synapse.
- Uploads the manifest to a view, updating file views with annotations as follows:

- **Store manifest only**
- **Store manifest and annotations** (to update a file view)
- **Store manifest and update a corresponding Synapse table**

4. Visualize Data Models
-------------------------

This endpoint allows you to visulize your data models and their relationships with each other.


.. toctree::
:maxdepth: 1

cli_reference
:hidden:

installation
asset_store
generate
validate
submit
cli_reference
Loading
Loading