diff --git a/gen3/docs/gen3-resources/developer-guide/microservices.md b/gen3/docs/gen3-resources/developer-guide/microservices.md index 3b6ca654..c3199750 100644 --- a/gen3/docs/gen3-resources/developer-guide/microservices.md +++ b/gen3/docs/gen3-resources/developer-guide/microservices.md @@ -45,7 +45,7 @@ This service handles reading from and writing to a user's S3 folder containing t The Metadata Service provides an API for retrieving JSON metadata of GUIDs. It is a flexible option for "semi-structured" data (key:value mappings). The content of the MDS powers the Data Portal Discovery Page for a Data Commons. The Gen3 SDK can be used to upload and edit the metadata. This service includes a feature known as the aggregated metadata service (AggMDS), which caches metadata from the metadata services of multiple data commons. The AggMDS holds the content viewable in a Data Portal Discovery page for a Data Mesh. ## [Peregrine][peregrine github] -Peregrine is the metadata seeking service which responds to GraphQL search queries and translates them to queries over our graph-like source of truth postgres database for structured data. The GraphQL service allows Commons operators and users to precisely query only the information they are most interested in from the metadata collections. The service translates the GraphQL search into the appropriate statements which are run against the PostgreSQL backend before being returned as friendly JSON. +Peregrine is the metadata seeking service which responds to GraphQL search queries and translates them to queries over the graph-like postgres database for structured data. The service translates the GraphQL search into the appropriate statements which are run against the PostgreSQL backend before being returned as friendly JSON. ## [Requestor][requestor github] Requestor exposes an API to manage access requests. diff --git a/gen3/docs/gen3-resources/glossary.md b/gen3/docs/gen3-resources/glossary.md index 17fbf73a..e8ba2208 100644 --- a/gen3/docs/gen3-resources/glossary.md +++ b/gen3/docs/gen3-resources/glossary.md @@ -1,9 +1,15 @@ # Gen3 Glossary +## Allow Lists +An allow list is simply a list of users (identified based on your method of authentication) that controls which users have access to which data. It is in the form of a user.yaml file that is maintained by the operator of your Gen3 system. Gaining access may require you to sign a Data Use Agreement. Data access is granted at the program or project level. + +Alternatively, an allow list may also refer to a list of authorized tools within a Gen3 workspace. +## API +Gen3 services expose APIs (or Application Programming Interface), which allows users to interact directly with the system or data without using the Data Portal or GUI. An Open API refers to an API that does not require any authentication. ## Commons Services Operations Center (CSOC) A Common Services Operations Center is an operations center operated by a commons services provider for setting up, configuring, operating, and monitoring data commons, data meshes, data hubs, and other data platforms for managing, analyzing, and sharing data. ## Crosswalk -Linking patients from across data commons where some patient data exists in commons A and additional data exists in commons B. This linkage is recorded in the metadata service. An example of how to set this up is found [here][crosswalk setup]. +Typically, used for linking patients from across data commons where some patient data exists in commons A and additional data exists in commons B. This linkage enables metadata associations across commons and the promise of richer datasets. Crosswalks can be made for several types of metadata and are recorded in the metadata service. An example of how to set this up is found [here][crosswalk setup]. ## Data Commons A data commons co-locates data with cloud computing infrastructure and commonly used software services, tools, and applications for managing, integrating, analyzing and sharing data that are exposed through web portals and APIs to create an interoperable resource for a research community. A data commons provides services so that the data is findable, accessible, interoperable and reusable (FAIR) ## Data Dictionary @@ -39,12 +45,7 @@ Structured data submitted to commons are stored in PostgreSQL. Querying data fro FAIR data are data which meet the principles of findability, accessibility, interoperability, and reusability [12]. There is now an extensive literature on FAIR data. ## Framework Services Framework Services or Data Commons Framework (DCF) Services is the term used by Gen3 to refer to data mesh services in the narrow middle architecture, for data meshes, such as the NCI Cancer Research Data Commons. These are set of standards-based services with open APIs for authentication, authorization, creating and accessing FAIR data objects, and for working with bulk structured data in machine-readable, self-contained format. -## Globally Unique Identifier (GUID) -A GUID is an essentially unique identifier that is generated by an algorithm so that no central authority is needed, but rather different programs running in different locations can generate GUID with a low probability that they will collide. A common format for a GUID is the hexadecimal representation of a 128-bit binary number. -## Kubernetes -An open-source system for automating deployment, scaling, and management of containerized applications, which Gen3 is built from. -## Microservice -Microservices are a software architecture that organizes software into small, independent services that communicate over well-defined APIs. These services can be developed, set up, and scaled independently. A more traditional architecture is to put all the APIs and other required functionality into a single application. This is sometimes called a monolithic architecture. Microservices provide important advantages for large-scale systems that require scalability and must continue to evolve even as their code base grows very large, but increases the complexity of operating small-scale systems. + ## Flattened Data Structured data that has been processed via Tube and stored in elasticsearch to accelerate searchability. ## Gen3 Client @@ -69,6 +70,18 @@ A simple list of most relevant microservices are included below. For a descript * Tube * Workspace Token Service +## Globally Unique Identifier (GUID) +A GUID is an essentially unique identifier that is generated by an algorithm so that no central authority is needed, but rather different programs running in different locations can generate GUID with a low probability that they will collide. A common format for a GUID is the hexadecimal representation of a 128-bit binary number. Some external systems may use the term Universallt Unique Identifier (UUID), which is essentially the same thing. +## Graph model +The graph model refers to the structured data within a Gen3 data commons. The "graph" is defined by the relationship between nodes that is specified in the data dictionary. This can be flattened via Tube and stored in elasticsearch to accelerate searchability. +## Kubernetes +An open-source system for automating deployment, scaling, and management of containerized applications, which Gen3 is built from. +## Manifest +Usually refers to a file manifest. This is a json formatted file that includes GUIDs, file names, md5 checksums, and file sizes for files of interest. It can be used by the gen3 client or SDK to download files provided a user has the appropriate credentials. +## Microservice +Microservices are a software architecture that organizes software into small, independent services that communicate over well-defined APIs. These services can be developed, set up, and scaled independently. A more traditional architecture is to put all the APIs and other required functionality into a single application. This is sometimes called a monolithic architecture. Microservices provide important advantages for large-scale systems that require scalability and must continue to evolve even as their code base grows very large, but increases the complexity of operating small-scale systems. +## Persistent Drive +This is a directory in a Gen3 workspace that allows a user to store files that will remain available after termination of the workspace session. It will be represented as `pd`. ## Portable Format for Biomedical data (PFB) PFB is a serialization file format designed to store bio-medical data and metadata. The format is built on top Avro to make it fast, extensible and interoperable between different systems. You can find the GitHub repo [here][PFB GitHub] and the publication [here][PFB Pub]. ## Workspace @@ -84,7 +97,7 @@ Gen3 workspaces are secure data analysis environments in the cloud that can acce [Data Portal User Guide]: user-guide/portal.md [Microservices]: developer-guide/microservices.md [Gen3 client docs]: user-guide/access-data.md#installation-instructions -[SDK docs]: user-guide/search.md#exporting-structured-data-programmatically +[SDK docs]: user-guide/search.md#the-gen3-sdk [PFB GitHub]: https://github.com/uc-cdis/pypfb [PFB Pub]: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010944 [workspace use]: user-guide/analyze-data.md diff --git a/gen3/docs/gen3-resources/index.md b/gen3/docs/gen3-resources/index.md index 60aee1bb..5b3b0f1f 100644 --- a/gen3/docs/gen3-resources/index.md +++ b/gen3/docs/gen3-resources/index.md @@ -4,6 +4,6 @@ This section contains the bulk of the Gen3 technical documentation. It is broke * **Gen3 User Guide** - This is for the data scientist, researcher, or analyst who needs to explore, download, or analyze data found within an existing instance of Gen3. -* **Gen3 Operator Guide** - This is for those organizations who operate their own Gen3 instances. It will include content on how to deploy, configure, and maintain a Gen3 instances; configuring a data dictionary and uploading data; and customizing the frontend. +* **Gen3 Operator Guide** - This is for those organizations who operate their own Gen3 instances. It will include content on how to deploy, configure, and maintain a Gen3 instance; configure a data dictionary and upload data; and customize the frontend. * **Gen3 Developer Guide** - This is for a software engineer who wants to extend Gen3 either by contributing to the source code or by integrating Gen3 services into a larger system. This section will cover the Gen3 architecture including the individual microservices and how they interact with each other. * **Glossary** - This section can be used as reference for terminology found within Gen3 technical documentation. diff --git a/gen3/docs/gen3-resources/operator-guide/img/Gen3 Forum July 6 2023 - Data Models.pdf b/gen3/docs/gen3-resources/operator-guide/img/Gen3 Forum July 6 2023 - Data Models.pdf new file mode 100644 index 00000000..c2888a00 Binary files /dev/null and b/gen3/docs/gen3-resources/operator-guide/img/Gen3 Forum July 6 2023 - Data Models.pdf differ diff --git a/gen3/docs/gen3-resources/operator-guide/img/Gen3_Data_Submission_Use_Form.png b/gen3/docs/gen3-resources/operator-guide/img/Gen3_Data_Submission_Use_Form.png index 2335debb..62d753d2 100644 Binary files a/gen3/docs/gen3-resources/operator-guide/img/Gen3_Data_Submission_Use_Form.png and b/gen3/docs/gen3-resources/operator-guide/img/Gen3_Data_Submission_Use_Form.png differ diff --git a/gen3/docs/gen3-resources/operator-guide/img/Gen3_Toolbar_data_submission.png b/gen3/docs/gen3-resources/operator-guide/img/Gen3_Toolbar_data_submission.png index 4b646ea4..968ab15b 100644 Binary files a/gen3/docs/gen3-resources/operator-guide/img/Gen3_Toolbar_data_submission.png and b/gen3/docs/gen3-resources/operator-guide/img/Gen3_Toolbar_data_submission.png differ diff --git a/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md index 65504ae0..d4b96409 100644 --- a/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md +++ b/gen3/docs/gen3-resources/operator-guide/submit-semi-structured-data.md @@ -1,3 +1,7 @@ +--- +tags: + - submission +--- # Semi-structured Data diff --git a/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md index 53163b30..51b73dd6 100644 --- a/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md +++ b/gen3/docs/gen3-resources/operator-guide/submit-structured-data.md @@ -1,3 +1,7 @@ +--- +tags: + - submission +--- # Structured Data (Clinical or experimental data) @@ -16,7 +20,7 @@ There are a number of properties that deserve special mention: * `type`: Every node has a `type` property, which is simply the name of the node. By providing the node name in the "type" property, the submission portal knows which node to put the data in. -* `id`: Every record in every node in a data commons has the unique property `id`, which is not submitted by the data contributor but rather generated on the backend. The value of the property `id` is a 128-bit UUID (a unique 32 character identifier). +* `id`: Every record in every node in a data commons has the unique property `id`, which is not submitted by the data contributor but rather generated on the backend. The value of the property `id` is a 128-bit GUID (a unique 32 character identifier). * `project_id` and `code`: Every project record in a data commons is linked to a parent `program` node and has the properties `project_id` and a `code`. The property `project_id` is the dash-separated combination of `program` and the project's `code`. For example, if your project was named 'Experiment1', and this project was part of the 'Pilot' program, the project's `project_id` would be 'Pilot-Experiment1', and the project's `code` would be 'Experiment1'. Finally, just like every record in the data commons, the project has the unique property `id`, which is not to be confused with the project's `project_id`. @@ -108,14 +112,11 @@ To submit a TSV in the data portal: 3. Click on "Submit Data" beside the project of interest to submit metadata. - ![Submit Data][submit data] - 4. Click on "Upload File". ![Upload and Submit][upload file] -5. Navigate to the TSV and click "open". The contents of the TSV should appear in the grey box -below. +5. Navigate to the TSV from your local directory and click "open". The contents of the TSV should appear in the grey box below. 6. Click "Submit". @@ -156,6 +157,15 @@ To confirm that a data file is properly registered, enter the GUID of a data fil > __Note:__ Gen3 users can also submit metadata using the Gen3 SDK, which is a Python library containing functions for sending standard requests to the Gen3 APIs. For example, the function `submit_file` from the **Gen3Submission** class will submit data in a spreadsheet file containing multiple records in rows to a Gen3 Commons. The code is open-source and available on [GitHub](https://github.com/uc-cdis/gen3sdk-python) along with [documentation for using it](https://uc-cdis.github.io/gen3sdk-python/_build/html/index.html). Furthermore, [this section](https://gen3.org/resources/user/analyze-data/#4-using-the-gen3-sdk) describes steps on how to get started. +### Review submitted structured data + +To review the content of submitted data, you can start from the [directions above](#begin-metadata-tsv-submissions) and instead of selecting "Upload" in Step 4, you can review the graph below. You can select particular nodes to view individual records where you have the option to delete, view, or download. + +> Note: Users who are not authorized to submit data may see a “Browse Data” button instead of “Submit Data”. These users will still have access to view the graph and individual nodes, but not to upload or delete. + +The number you see underneath the node name, for example ‘subject’, reflects the number of records in that node of the project. The “Toggle View” button is used to show or hide nodes in the data model that the project has no records for. + + ### TSV Formatting Checklist @@ -174,13 +184,7 @@ Templates can be downloaded from the data dictionary page of your commons. See If the submission throws errors or claims the submission to be invalid, it will be the submitter's task to fix the submission. The best first step is to go through the outputs from the individual entities, as seen in the previous section. The error fields will give a rough description of what failed the validation check. The most common problems are simple issues such as spelling errors, mislabeled properties, or missing required fields. -## Learning More About the Existing Submission - -When viewing a project, clicking on a node name will allow the user to view the records in that node. From here a user can download, view, or completely delete records associated with any project for which they have delete access. - -![Node Click][Node Click] -![Node Information][Node Informatin] @@ -192,5 +196,3 @@ When viewing a project, clicking on a node name will allow the user to view the [toolbar submission]: img/Gen3_Toolbar_data_submission.png [submit data]: img/Gen3_Data_Submission_submit_data.png [upload file]: img/Gen3_Data_Submission_Use_Form.png -[Node Click]: img/Gen3_Model_Click_highlight.png -[Node Informatin]: img/Gen3_Model_node_view.png diff --git a/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md b/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md index 4046bdfe..db36d14f 100644 --- a/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md +++ b/gen3/docs/gen3-resources/operator-guide/submit-unstructured-data.md @@ -1,3 +1,7 @@ +--- +tags: + - submission +--- # Unstructured Data (Data Files) diff --git a/gen3/docs/gen3-resources/user-guide/access-data.md b/gen3/docs/gen3-resources/user-guide/access-data.md index a3cc87ef..9dc8a900 100644 --- a/gen3/docs/gen3-resources/user-guide/access-data.md +++ b/gen3/docs/gen3-resources/user-guide/access-data.md @@ -25,7 +25,7 @@ Operators can also take advantage of the Requestor Service for dynamic authoriza ## Download Files Using the Gen3-client -The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. In some systems "download" may be restricted to only within a Gen3 Workspace. +The gen3-client provides an easy-to-use, command-line interface for uploading and downloading data files to and from a Gen3 data commons from the terminal or command prompt, respectively. In some systems "download" may be restricted to only within a Gen3 Workspace. Note that Gen3 also comes with an SDK tool that can perform many of the same functions as the client for downloading along with many other features not found in the client. You can read more about the Python SDK tool [here][sdk_github]. This guide has the following sections: @@ -74,7 +74,7 @@ To check that your copy of the client is working and confirm the version, the to ### Configure a Profile with Credentials -Before using the gen3-client to upload or download data, the gen3-client needs to be configured with API credentials downloaded from the user’s data commons Profile (via Windmill data portal): +Before using the gen3-client to upload or download data, the gen3-client needs to be configured with API credentials downloaded from the user’s data commons Profile: 1. To download the “credentials.json” from the data commons, the user should start from that common’s Windmill data portal, followed by clicking on “Profile” in the top navigation bar and then creating an API key. In the popup window which informs user an API key has been successfully created, click the “Download json” button to save a local copy of the API key. @@ -248,6 +248,15 @@ Most programs require some sort of user input to run properly. Some programs wil For example, when configuring a profile with the client, the user must specify the `configure` option and also specify the profile name, API endpoint, and credentials file by adding the flags `--profile`, `--apiendpoint` and `--cred` to the end of the command (see [configuring a profile section][config profile] above for specific examples). +### Expired Token + +Many commons have a limit to how long a token is good before it is expired. Once expired you may receive an error such + +``RequestNewAccessToken with error code 401`` + +If this happens (and you are still authorized to access the data), you can download a new API token and re-create your profile using the previously used command. + + [configure auth]: ../operator-guide/gen3-authn-methods.md @@ -257,6 +266,7 @@ For example, when configuring a profile with the client, the user must specify t [Gen3 Client]: https://github.com/uc-cdis/cdis-data-client/releases/latest +[sdk_github]: https://github.com/uc-cdis/gen3sdk-python [PATH]: access-data.md#working-from-the-command-line [img create API key]: img/Gen3_Keys.png [config profile]: access-data.md#configure-a-profile-with-credentials diff --git a/gen3/docs/gen3-resources/user-guide/analyze-data.md b/gen3/docs/gen3-resources/user-guide/analyze-data.md index befb74f2..94adba0a 100644 --- a/gen3/docs/gen3-resources/user-guide/analyze-data.md +++ b/gen3/docs/gen3-resources/user-guide/analyze-data.md @@ -37,8 +37,11 @@ Terminal sessions can also be started in the Workspace and used to download othe You can manage active notebooks and terminal processes by clicking on the tab “Running”. Click “Shutdown” to close the terminal session or Jupyter Notebook. Be sure to save your notebooks before terminating them, and again, ensure any notebooks to be accessed later are saved in the “pd” directory. +>Important: Always click the "Terminate Workspace" or "Shutdown" button when finished to release cloud resources + ![GIF showing how manage Jupyter notebooks on the Running tab][gif Jupyter Running tab] + ## Getting Files into the Gen3 Workspace Bringing in files into the Gen3 Workspace can be achieved via the UI (directly from the Exploration page) or programmatically. Find below a description of both methods. @@ -59,7 +62,7 @@ The Exploration page allows to search through data and create cohorts, which can ### Getting Files into the Workspace programmatically -In order to download data files directly and programmatically from a Gen3 data commons into your workspace, install and use the [gen3-client][gen3 client] in a terminal window from your Workspace. Launch a terminal window by clicking on the “New” dropdown menu, then click on “Terminal”. +In order to download data files directly and programmatically from a Gen3 data commons into your workspace, install and use the [gen3-client][gen3 client] in a terminal window from your Workspace. Launch a terminal window by clicking on the “New” dropdown menu, then click on “Terminal”. General instructions for use of the Gen3 client can be found [here][gen3_client_instructions]. From the command line, download the latest [Linux version of the gen3-client][linux gen3 client] using the `wget` command. Next, unzip the archive and add it to your path: @@ -73,6 +76,9 @@ Now the gen3-client should be ready to use in the JupyterHub terminal. Other required files, like the `credentials.json` file which contains API keys needed to configure a profile or a download `manifest.json` file can be uploaded to the workspace by clicking on the “Upload” button or by dragging and dropping into the ‘Files’ tab. Text can also be pasted into a file by clicking “New”, then choosing “Text File”. Filenames can be changed by clicking the checkbox next to the file and then clicking the “Rename” button that appears. +![GIF demonstrating how to upload files into workspace][gif upload credentials] + + Example: ``` jovyan@jupyter-user:~$ wget https://github.com/uc-cdis/cdis-data-client/releases/download/2020.11/dataclient_linux.zip @@ -121,11 +127,14 @@ GSM1558854_Sample40_3.CEL.gz 4.20 MiB / 4.20 MiB [====================.... jovyan@jupyter-user:~$ mv *.gz files ``` + + + ## Working with the proxy and allow lists ### Working with the Proxy -To prevent unauthorized traffic, the Gen3 VPC utilizes a proxy. If you are using one of the custom VMs setup, there is already a line in your .bashrc file to handle traffic requests. +To prevent unauthorized traffic, the Gen3 VPC utilizes a proxy server. The proxy server acts as an intermediary between the internal network resources within the VPC and the external internet or other networks. If you are using one of the custom VMs setup, there is already a line in your .bashrc file to handle traffic requests. ```bash export http_proxy=http://cloud-proxy.internal.io:3128 @@ -138,7 +147,7 @@ https_proxy=https://cloud-proxy.internal.io:3128 aws s3 ls s3://gen3-data/ --pro ### Allow lists -Additionally, to aid Gen3 Commons security, the installation of tools from outside resources is managed through an allow list. If you have problems installing a tool you need for your work, contact [support@gen3.org](mailto:support@gen3.org) and with a list of any sites from which you might wish to install tools. After passing a security review, these can be added to the allow list to facilitate access. +Additionally, to aid Gen3 Commons security, the installation of tools from outside resources is managed through an allow list. If you have problems installing a tool you need for your work, contact [support@gen3.org](mailto:support@gen3.org) (or the appropriate Gen3 operator) with a list of any sites from which you might wish to install tools. After passing a security review, these can be added to the allow list to facilitate access. ## Using the Gen3 Python SDK To make programmatic interaction with Gen3 data commons easier, the bioinformatics team at the Center for Translational Data Science (CTDS) at University of Chicago has developed the Gen3 Python SDK, which is a Python library containing functions for sending standard requests to the Gen3 APIs. The code is open-source and available on [GitHub][Gen3 Python SDK Github] along with [documentation for using it][Gen3 Python SDK doc]. @@ -171,7 +180,7 @@ git pull origin master ### Examples -1.) Most requests sent to a Gen3 data commons API will require an authorization token to be sent in the request’s header. The SDK class *Gen3Auth* is used for authentication purposes, and has functions for generating these access tokens. Users do need to authenticate when using the SDK from the terminal, but do not need to authenticate once being logged in and working in the workspace of a Data Commons. +1) Most requests sent to a Gen3 data commons API will require an authorization token to be sent in the request’s header. The SDK class *Gen3Auth* is used for authentication purposes, and has functions for generating these access tokens. Users do need to authenticate when using the SDK from the terminal, but do not need to authenticate once being logged in and working in the workspace of a Data Commons. From the python shell run the following: @@ -183,7 +192,7 @@ creds = "/user/directory/credentials.json" auth = Gen3Auth(endpoint, creds) ``` -2.) The structured data in a Gen3 data commons can be created, deleted, queried and exported using functions in the *Gen3Submission* class. +2) The structured data in a Gen3 data commons can be created, deleted, queried and exported using functions in the *Gen3Submission* class. 2.1) All available programs in the data commons will be shown with the function `get_programs`. The following commands: @@ -197,10 +206,12 @@ will return: `{'links': ['/v0/submission/OpenNeuro', '/v0/submission/GEO', '/v0/ 2.2) All projects under a particular program (“OpenAccess”) will be shown with the function `get_projects`. The following commands: -```from gen3.submission import Gen3Submission +``` +from gen3.submission import Gen3Submission sub = Gen3Submission(endpoint, auth) sub.get_projects("OpenAccess") ``` + will return “CCLE” as the project under the program “OpenAccess”: `{'links': ['/v0/submission/OpenAccess/CCLE']}` 2.3) All structured metadata stored under one node of a project can be exported as a tsv file with the function `export_node`. The following commands: @@ -216,7 +227,7 @@ sub.export_node(program, project, node_type, fileformat, filename) ``` will return: `Output written to file: OpenAccess_CCLE_aligned_reads_file.tsv` -3.) The function `get_record` in the class *Gen3Index* is used to show all metadata associated with a given id by interacting with Gen3’s Indexd service. GUIDs can be found on the [Exploration page][Exploration page] under the `Files` tab. The following commands: +3) The function `get_record` in the class *Gen3Index* is used to show all metadata associated with a given id by interacting with Gen3’s Indexd service. GUIDs can be found on the [Exploration page][Exploration page] under the `Files` tab. The following commands: ``` from gen3.index import Gen3Index @@ -224,6 +235,7 @@ ind = Gen3Index(endpoint, auth) record1 = ind.get_record("92183610-735e-4e43-afd6-7b15c91f6d10") print(record1) ``` + will return: `{'acl': ['*'], 'authz': ['/programs/OpenAccess/projects/CCLE'], 'baseid': 'e9bd6198-300c-40c8-97a1-82dfea8494e4', 'created_date': '2020-03-13T16:08:53.743421', 'did': '92183610-735e-4e43-afd6-7b15c91f6d10', 'file_name': None, 'form': 'object', 'hashes': {'md5': 'cbccc3cd451e09cf7f7a89a7387b716b'}, 'metadata': {}, 'rev': '13077495', 'size': 15411918474, 'updated_date': '2020-03-13T16:08:53.743427', 'uploader': None, 'urls': ['https://api.gdc.cancer.gov/data/30dc47eb-aa58-4ff7-bc96-42a57512ba97'], 'urls_metadata': {'https://api.gdc.cancer.gov/data/30dc47eb-aa58-4ff7-bc96-42a57512ba97': {}}, 'version': None}` ## Jupyter Notebook Demos @@ -252,8 +264,10 @@ When finished, please, shut down the workspace server by clicking the “Termina [img Export Data to Workspace]: img/export_to_workspace.png [gen3 client]: https://github.com/uc-cdis/cdis-data-client/releases/latest [linux gen3 client]: https://github.com/uc-cdis/cdis-data-client/releases/latest +[gen3_client_instructions]: access-data.md#download-files-using-the-gen3-client +[gif upload credentials]: img/workspace_pd.gif - + [Gen3 Python SDK Github]: https://github.com/uc-cdis/gen3sdk-python [Gen3 Python SDK doc]: https://uc-cdis.github.io/gen3sdk-python/_build/html/index.html [Jupyter demos]: analyze-data.md#jupyter-notebook-demos diff --git a/gen3/docs/gen3-resources/user-guide/img/graph-a-project.gif b/gen3/docs/gen3-resources/user-guide/img/graph-a-project.gif deleted file mode 100755 index 93068418..00000000 Binary files a/gen3/docs/gen3-resources/user-guide/img/graph-a-project.gif and /dev/null differ diff --git a/gen3/docs/gen3-resources/user-guide/img/projects-nodes-view.png b/gen3/docs/gen3-resources/user-guide/img/projects-nodes-view.png deleted file mode 100644 index fde91ff1..00000000 Binary files a/gen3/docs/gen3-resources/user-guide/img/projects-nodes-view.png and /dev/null differ diff --git a/gen3/docs/gen3-resources/user-guide/img/workspace_pd.gif b/gen3/docs/gen3-resources/user-guide/img/workspace_pd.gif new file mode 100644 index 00000000..149d0986 Binary files /dev/null and b/gen3/docs/gen3-resources/user-guide/img/workspace_pd.gif differ diff --git a/gen3/docs/gen3-resources/user-guide/search.md b/gen3/docs/gen3-resources/user-guide/search.md index 21a3ccdd..ffd941ca 100644 --- a/gen3/docs/gen3-resources/user-guide/search.md +++ b/gen3/docs/gen3-resources/user-guide/search.md @@ -2,16 +2,10 @@ # Searching and Exploring Structured Data The data in a Gen3 data commons can be browsed and downloaded using several different methods. The following general documentation will cover some standard methods of data access in a Gen3 data commons. Ultimately, however, the methods of data access offered in a Gen3 data commons is determined by agreements made between the data commons’ sponsors and data contributors. -Various levels of data access can be configured in a Gen3 data commons using the Gen3 Framework Services. If open access data is hosted, a data commons can be configured to allow anonymous access to data, which means users can explore data without logging in. This is the case for the [Gen3 Data Hub][Gen3 Data Hub]. - -In cases where data is controlled access, typically external users will receive instructions on how to access data and may be required to sign a DUA (Data Use Agreement) legal document. +Various levels of data access can be configured in a Gen3 data commons using the Gen3 Framework Services. If open-access data is hosted, a data commons can be configured to allow anonymous access to metadata or structured data, which means users can explore data without logging in. This is the case for the [Gen3 Data Hub][Gen3 Data Hub]. However, even in the Gen3 Data Hub where there is no required Data Use Agreement (DUA), users must still login to generate a token in order to download the open-access files. In cases where data is controlled access or otherwise protected, users will typically receive instructions on how to access data and may be required to sign a DUA legal document or perhaps signify approval of some terms of use within the portal. The following sections provide details on how to explore and access data from within the data commons website and from the command-line by sending requests to the Gen3 open APIs. -[//]: # (Alex: I would like to see this ordered like this Data Discovery -> Data Access Approval - if controlled data -> Data Exploration. I would like to promote using the Gen3 Discovery page as the initial place to find data as that's it's intended purpose.) - -[//]: # (Alex: I started thinking through user "questions" and relating them to particular groups of endpoints in our API and thinking through how users would expect to move between parts of the product. It's old and maybe needs some dust removed, but it was done while thinking through the idea of the Data Lake. The relevant section of the feature doc is here: https://docs.google.com/document/d/1UjQjvUuasmfe_5iaEfqjFzLTHsC-8EXXNYCSyP0-k_s/edit?tab=t.0#heading=h.xp6x9p5szrmw) - ## Searching for Data from the Data Portal Gen3 offers a data portal service that creates a website with graphical tools for performing the basic functionality of a data commons, like browsing data in projects, building patient cohorts across projects, downloading metadata or data files for cohorts, and building database queries. @@ -22,7 +16,7 @@ While this may vary from system to system, to find data in a data commons you ca 3. Explore files in the Exploration Page 4. Export data to workspace or download locally depending on the requirements of your system. -Instead of using the Discovery and Exploration pages you could instead use the [API][API instructions] for locating data of interest. +If you prefer programmatic access, instead of using the Discovery and Exploration pages you could instead use the [API][API instructions] for locating data of interest. ### Discovery Page @@ -60,13 +54,11 @@ If the table is a list of files, there should be a button for downloading a JSON -### Access Data using the API +### Use the API All the functionality of the data portal is available by sending requests to the open APIs of the Gen3 system. Detailed API specifications of the Gen3 services can be browsed in [the developer documentation][Developer API specs]. -#### Querying Structured Data Programmatically Users can submit queries to the Gen3 APIs to access structured data across the projects in the data commons. Queries can be sent using, for example, the Python “requests” package or similar functions in other programming languages. Further details on how to send queries to the API are documented [here][API documentation]. -#### Exporting Structured Data Programmatically Users with “read” access to a project can export entire structured metadata records by sending requests to the API. Single records can be exported or all records in a specific node of a project can be retrieved. For more information, see the [documentation on using the API][API documentation]. #### The Gen3 SDK @@ -74,29 +66,29 @@ To make sending requests to the Gen3 APIs easier, the bioinformatics team at the The SDK is essentially a collection of Python wrapper functions for sending requests to the API. It is open source and can be found on [Github][Gen3 SDK GitHub pg]. Thorough documentation for the SDK can be found in the GitHub repository [documentation page][SDK doc pg]. -#### Sending Queries using the SDK -The [Gen3Submission class][Gen3Submission Python SDK class] of the Python SDK has functions for [sending queries to the API][send query to API] and also for [retrieving the graphQL schema][Retrieve graphQL schema] of the data commons. Queries can be used to pinpoint specific data of interest by providing query arguments that act as filters on records in the database and providing lists of properties to retrieve for those records. If all the structured data in a record or node is desired, as opposed to only specific properties, then see the export functions below. +##### Sending Queries using the SDK +The [Gen3Submission class][Gen3Submission Python SDK class] of the Python SDK has functions for sending queries to the API and also for retrieving the graphQL schema of the data commons. Queries can be used to pinpoint specific data of interest by providing query arguments that act as filters on records in the database and providing lists of properties to retrieve for those records. If all the structured data in a record or node is desired, as opposed to only specific properties, then see the export functions below. -#### Exporting Structured Data using the SDK -Entire structured data records can be exported as a JSON or TSV file using the [Gen3Submission Python SDK class][Gen3Submission Python SDK class]. The [`export_record`][export_record] function will export a single structured metadata record in a specific node of a specific project, whereas the [`export_node`][export_node] function will export all the structured metadata records in a specified node of a specific project. +##### Exporting Structured Data using the SDK +Entire structured data records can be exported as a JSON or TSV file using the [Gen3Submission Python SDK class][Gen3Submission Python SDK class]. The `export_record` function will export a single structured metadata record in a specific node of a specific project, whereas the `export_node` function will export all the structured metadata records in a specified node of a specific project. More SDK examples and how to get started with the SDK can be also found in the [analyze-data section][Using Gen3 SDK]. -### Query Page +#### Query Page While you may call the API directly as described above, Gen3 also includes an interactive interface for creating [graphQL query language][GraphQL] calls on the Query Page. This can be accessed by clicking “Query” in the top navigation bar or by navigating to the URL: [https://gen3.datacommons.io/query][Query page]. The URL https://gen3.datacommons.io can be replaced with the URL of other Gen3 data commons. This query portal has been optimized to autocomplete fields based on content, increase speed and responsiveness, pass variables, and generally make it easier for users to find information. The “Docs” button will display documentation of the queryable nodes and properties. From the GraphiQL interface of the data portal, you can switch between *Graph Model* or *Flat Model* – each using endpoints that query different databases (Postgres and ElasticSearch, respectively). Notably, the same queries can be sent to both the flat and graph model API endpoints from the command-line. -#### Graph Model +##### Graph Model In the Graph Model, our microservice *Peregrine* converts GraphiQL queries and hits the PostgreSQL database. ![Animation showing how you can use the Peregrine UI to build a GraphQL query that hits the PostgreSQL database and returns a response][gif Peregrine] -##### List all nodes of a particular node category +###### List all nodes of a particular node category ``` { @@ -107,7 +99,7 @@ In the Graph Model, our microservice *Peregrine* converts GraphiQL queries and h } ``` -##### Find specific files by querying a datanode +###### Find specific files by querying a datanode Metadata for specific files can be obtained by including arguments in “datanode” queries. The following are some commonly used arguments (not an exhaustive list): @@ -204,11 +196,11 @@ For example, if there are 2,550 records returned, and the GraphiQL query is timi (first:1000, offset:2000) # this will return records 2000-2,550 ``` -#### Flat Model +##### Flat Model In the Flat Model, our microservice *Guppy* converts GraphiQL queries and hits the Elasticsearch database. Here, queries support Aggregations for string (bin counts; number of records that each key has) and numeric (summary statistics such as minimum, maximum, sum, etc) fields. For more details see the full description in the [Guppy repository][Guppy]. -##### Querying +###### Querying Guppy allows you to query the raw data with offset, the maximum number of rows, sorting, and filters. Queries by default return the first 10 entries. To return more entries, the query call can specify a larger number such as `(first: 100)`. Example: @@ -231,7 +223,7 @@ Example: ``` The maximum number of results returned is 10,000, which can be requested with the `(first: 10000)` argument. If you need to access more than that number, we suggest using the [Guppy download endpoint][Guppy download endpoint]. -##### Aggregations +###### Aggregations Aggregation query is wrapped within the `_aggregation` keyword. In total, five aggregations are feasible at the moment: @@ -254,7 +246,7 @@ query ($filter: JSON) { } ``` -##### Filtering +###### Filtering Currently, Guppy uses `JSON`-based syntax for filters. Filters can be text/string/number-based, combined, or nested. For more examples, see the full description in the [Github repository][Guppy]. @@ -285,18 +277,7 @@ query ($filter: JSON) { ### Submission Page - -#### Browsing the List of Projects -A graphical model of the structured data in individual data projects of a data commons can be browsed by navigating to the /submission endpoint of a data commons website or by clicking on the “Browse Data” or “Submit Data” buttons in the top navigation bar, for example, [https://gen3.datacommons.io/submission][Gen3 Data Submission pg]. This page lists all the data projects within a commons a user has authorization to view. Clicking the “Browse” or “Submit Data” button by a project ID will open a view of that individual project’s structured metadata graph, which can be further inspected by clicking on a node in the graph model and then viewing individual records by clicking “View” by the submitter_id or downloading all the records in that node by clicking the “Download All” button. - -> Note: Users who are authorized to submit data may see a “Submit Data” button instead of “Browse Data”, and will also be able to upload or create structured data in the project on this page. - -##### Example: Browse Data in Individual Projects -![Image showing options for browsing nodes in individual projects using either the dropdown list or project graph model][img Browse Nodes in Projects] - -In the graphical model of a data project, the number you see underneath the node name, for example ‘subject’, reflects the number of records in that node of the project. The “Toggle View” button is used to show or hide nodes in the data model that the project has no records for. - -![GIF showing how to view the graphical model of a project, toggling to show or hide nodes that have no records][img Graphing a project] +A graphical model of the structured data of a data commons can be browsed by navigating to the /submission endpoint of a data commons website or by clicking on the “Browse Data” or “Submit Data” buttons in the top navigation bar (if available). This section is often available even if a user does not have access to submit data. You can see more details in the [Submit Structured Data section][submit_structured]. @@ -309,12 +290,13 @@ In the graphical model of a data project, the number you see underneath the node [Discovery Page]: img/Discovery_page3.gif [BRH Discovery]: https://brh.data-commons.org/discovery +[img Discover grid]: img/grid_discovery_color_080322.png [img Gen3 Toolbar Exploration]: img/Gen3_Toolbar_exploration.png [img Explorer GIF]: img/explorer_gif_2020.gif [Gen3 Data Hub]: https://gen3.datacommons.io/ -[API instructions]: search.md#access-data-using-the-api +[API instructions]: search.md#use-the-api [Gen3 client]: access-data.md#download-files-using-the-gen3-client [Gen3 bulk download]: access-data.md#multiple-file-download-with-manifest @@ -323,21 +305,8 @@ In the graphical model of a data project, the number you see underneath the node [API documentation]: using-api.md [SDK doc pg]: https://uc-cdis.github.io/gen3sdk-python/_build/html/index.html [Gen3Submission Python SDK class]: https://uc-cdis.github.io/gen3sdk-python/_build/html/submission.html -[send query to API]: https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/submission.py#L289 -[Retrieve graphQL schema]: https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/submission.py#L327 -[export_record]: https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/submission.py#L223 -[export_node]: https://github.com/uc-cdis/gen3sdk-python/blob/master/gen3/submission.py#L255 [Using Gen3 SDK]: analyze-data.md#using-the-gen3-python-sdk [Gen3 SDK GitHub pg]: https://github.com/uc-cdis/gen3sdk-python - -[Gen3 Data Submission pg]: https://gen3.datacommons.io/submission -[img Browse Nodes in Projects]: img/projects-nodes-view.png -[img Graphing a project]: img/graph-a-project.gif -[GraphQL]: https://graphql.org/ -[Query page]: https://gen3.datacommons.io/query -[gif Peregrine]: img/simple_query_2020.gif -[GraphQL pagination]: http://graphql.org/learn/pagination/ -[Guppy]: https://github.com/uc-cdis/guppy/blob/master/doc/queries.md -[Guppy download endpoint]: https://github.com/uc-cdis/guppy/blob/master/doc/download.md +[submit_structured]: ../operator-guide/submit-structured-data.md diff --git a/gen3/docs/gen3-resources/user-guide/using-api.md b/gen3/docs/gen3-resources/user-guide/using-api.md index e52978dd..86ae0f54 100644 --- a/gen3/docs/gen3-resources/user-guide/using-api.md +++ b/gen3/docs/gen3-resources/user-guide/using-api.md @@ -6,7 +6,10 @@ The application programming interface (API) can be a set of code, rules, functio The beauty of a Gen3 data commons is that all the functionality of the data commons website is available by sending requests to the open APIs of the data commons. Typical requests at Gen3 include querying, [uploading][Gen3 Submit Data] or downloading data, which leads to communication between Gen3 microservices such as the data portal Windmill or the metadata submission service Sheepdog via open APIs. - **Note:** The Gen3 commons uses GraphQL as the language for querying metadata across Gen3 Data Commons. To learn the basics of writing queries in GraphQL, please visit: [http://graphql.org/learn][learn GraphQL]. +> +Note: The Gen3 commons uses GraphQL as the language for querying metadata across Gen3 Data Commons. To learn the basics of writing queries in GraphQL, please visit: [http://graphql.org/learn][learn GraphQL]. You can also try out creating and executing GraphQL queries in the [Data Portal Query Page][Query_page_instructions]. + + Gen3 features a variety of API endpoints such as `/submission`, `/index`, or `/graphql`, which differ in how they access the resource and contain each a subset of REST (Representational State Transfer) APIs for networked applications. REST APIs are restricted in their interactions via HTTP request methods such as GET, POST, PATCH, PUT, or DELETE. The GET request retrieves data in read-only mode, POST typically sends data and creates a new resource, PATCH typically updates/modifies a resource, PUT typically updates/replaces a resource, and DELETE deletes a resource. At Gen3, the GET endpoint ``` @@ -66,7 +69,7 @@ ql = requests.post('https://gen3.datacommons.io/api/v0/submission/graphql/', jso print(ql.text) # display the response # Data Download via API Endpoint Request: -durl = 'https://gen3.datacommons.io/api/v0/submission///export?format=tsv&ids=' + ids[0:-1] # define the download url with the UUIDs of the records to download in "ids" list +durl = 'https://gen3.datacommons.io/api/v0/submission///export?format=tsv&ids=' + ids[0:-1] # define the download url with the GUIDs of the records to download in "ids" list dl = requests.get(durl, headers=headers) print(dl.text) # display response @@ -91,41 +94,48 @@ If an error such as "You don’t have access… " occurs, then either you do not Users with read access to a project can download individual metadata records in the project or all records in a specified node of the project using the API. -The API endpoint for downloading all the records in a single node of a project is: -``` -{commons-url}/api/v0/submission/{program name}/{project code}/export/?node_label={node}&format={json/tsv} - Where: -{commons-url} is the gen3 data commons url (for example, 'gen3.datacommons.io'), -{program name} is the program node's property 'name' (for example, 'GEO'), -{project code} is the project node's property 'code' (for example, 'GSE63878'), -{node} is the name (or 'node_id') of the node (e.g., 'subject'), -{json/tsv} is the format in which data will be downloaded, either 'json' or 'tsv'. -``` +**Example 1** The API endpoint for downloading all the records in a single node of a project is: + +`{commons-url}/api/v0/submission/{program name}/{project code}/export/?node_label={node}&format={json/tsv}` + +Where: +`{commons-url}` is the gen3 data commons url, +`{program name}` is the program node's property 'name', +`{project code}` is the project node's property 'code', +`{node}` is the name (or 'node_id') of the node, +`{json/tsv}` is the format in which data will be downloaded, either 'json' or 'tsv'. + For example, submitting the following API request will download all the records in the ‘sample’ node of the project ‘GEO-GSE63878’ in [the Gen3 Data Hub][Gen3 DC] as a tab-separated values file (TSV): ``` https://gen3.datacommons.io/api/v0/submission/GEO/GSE63878/export/?node_label=sample&format=tsv ``` -The API endpoint for downloading a single record in a project is as follows: -``` -{commons-url}/api/v0/submission/{program name}/{project code}/export?ids={ids}&format={json/tsv} - Where: -{commons-url} is the gen3 data commons url (for example, 'gen3.datacommons.io'), -{program name} is the program node's property 'name' (for example, 'GEO'), -{project code} is the project node's property 'code' (for example, 'GSE63878'), -{ids} is a comma separated list of the UUIDs for the records to be downloaded (e.g., '180554c0-d0e1-41f2-b5b0-47655d7975ed,3e54268c-b4a6-4cf8-bedc-1e9e49f9d6e9'), -{json/tsv} is the format in which data will be downloaded, either 'json' or 'tsv'. -``` -For example, submitting the following API request will download the two records corresponding to the UUIDs (the ‘id’ property) ‘180554c0-d0e1-41f2-b5b0-47655d7975ed’ and ‘3e54268c-b4a6-4cf8-bedc-1e9e49f9d6e9’, which are records in the ‘subject’ node of the project ‘GEO-GSE63878’ in the [Gen3 Data Hub][Gen3 DC]. This downloads a ‘subjects.tsv’ file containing those two records. + + +**Example 2** The API endpoint for downloading a single record in a project is as follows: + +`{commons-url}/api/v0/submission/{program name}/{project code}/export?ids={ids}&format={json/tsv}` + +Where: +`commons-url`is the gen3 data commons url, +`program name` is the program node's property 'name', +`project code` is the project node's property 'code', +`ids` is a comma separated list of the GUIDs for the records to be downloaded, +`json/tsv` is the format in which data will be downloaded, either 'json' or 'tsv'. + +For example, submitting the following API request will download the two records corresponding to the GUIDs (the ‘id’ property) ‘180554c0-d0e1-41f2-b5b0-47655d7975ed’ and ‘3e54268c-b4a6-4cf8-bedc-1e9e49f9d6e9’. This downloads a ‘subjects.tsv’ file containing those two records. + ``` https://gen3.datacommons.io/api/v0/submission/GEO/GSE63878/export?ids=180554c0-d0e1-41f2-b5b0-47655d7975ed,3e54268c-b4a6-4cf8-bedc-1e9e49f9d6e9&format=tsv ``` -The API endpoint for querying the metadata associated with the given UUID is as follows: -``` -{commons-url}/index/{GUID} - Where: -{commons-url} is the gen3 data commons url (for example, 'gen3.datacommons.io'), -{GUID} is the globally unique identifier (for example, 'ce214f52-1a98-4a6f-bda1-2bb2731cfd61') -``` + +**Example 3** The API endpoint for querying the metadata associated with the given GUID is as follows: + +`{commons-url}/index/{GUID}` + +Where: +`commons-url` is the gen3 data commons url, +`GUID` is the globally unique identifier. + The following request will show the metadata of the indexed record in .JSON format. Thus, a browser that includes a .JSON viewer (e.g. Firefox) or a manually installed plug-in can show the .JSON in pretty-print. ``` @@ -136,6 +146,7 @@ https://gen3.datacommons.io/index/47c46ead-f6f5-4cc9-86b9-2354cafe8c64 [Gen3 Submit Data]: ../operator-guide/submit-structured-data.md [learn GraphQL]: http://graphql.org/learn +[Query_page_instructions]: portal.md/#query-page [microservice docs]: ../developer-guide/microservices.md [Sheepdog]: https://petstore.swagger.io/?url=https://raw.githubusercontent.com/uc-cdis/sheepdog/master/openapi/swagger.yml#/ [Peregrine]: https://petstore.swagger.io/?url=https://raw.githubusercontent.com/uc-cdis/peregrine/master/openapis/swagger.yaml diff --git a/gen3/docs/index.md b/gen3/docs/index.md index 4a72db26..f0c54ab5 100644 --- a/gen3/docs/index.md +++ b/gen3/docs/index.md @@ -10,7 +10,7 @@ Gen3 documentation is organized by the category of person interacting with Gen3: * **Gen3 User** - This is a data scientist, researcher, or analyst who needs to explore, download, or analyze data found within an existing instance of Gen3. * **Gen3 Developer** - This is a software engineer who wants to extend Gen3 either by contributing to the source code or by integrating Gen3 services into a larger system. This section will cover the Gen3 architecture including the individual microservices and how they interact with each other. -* **Gen3 Operator** - This is for those organizations who operate their own Gen3 instances. It will include content on how to Deploy or Spin up a Gen3 instances, configuring and data dictionary and uploading data, and customizing the frontend. +* **Gen3 Operator** - This is for those organizations who operate their own Gen3 instances. It will include content on how to Deploy or Spin up a Gen3 instance, configure a data dictionary and upload data, and customize the frontend.