You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat(metadata): add Gen3Metadata for communication with the new metadata service, tests are WIP
* feat(metadata): add auth to metadata calls
* feat(utils): add common utils file for functions
* fix(metadata): cleanup and fix bugs in existing code, update docs
* feat(index): add async functions for querying indexd
* feat(metadata-tools): add initial script/tool for adding to the metadata service from a file
* feat(mds): docstrings, support updating for metadata ingestion, use asyncio for mds calls, centralize backoff cfg in utils
* chore(backoff): use default settings in indexd class for backoff
* chore(refactor): add docs for metadata capabilities, cleanup other docs and rename things to be more clear, add missing imports to utils
* feat(tests): update unit tests, fix some quick issues
* fix(metadata): when localhost don't use admin endpoint suffix
* fix(download): no spaces between csv header (so columns don't have leading spaces)
* feat(merge): tool for merging indexing and metadata manifests
* Apply automatic documentation changes
* Apply automatic documentation changes
* Apply automatic documentation changes
* fix(ingestion): fix bugs, handle None empty column in row and catch 404 exception when missing record
* Apply automatic documentation changes
* fix(metadata): improve logging, cleanup docstring, more clear popping from queue loop, handle exception on missing guid
* Apply automatic documentation changes
* fix(indexing): remove possibility of trying to .strip() a None value
* fix(review): cleanup unneeded/misleading code, add missing arguments, don't retry on 404, reword logs
* Apply automatic documentation changes
* Apply automatic documentation changes
* chore(docs): update docs
* Apply automatic documentation changes
* chore(merge): revert changes from bad merge
* Apply automatic documentation changes
* chore(refactor): make more readable - don't do "smaller file" logic
* Apply automatic documentation changes
* chore(docs): fix some comments/docs, add more details for clarity about capabilities/assumptions
* Apply automatic documentation changes
Co-authored-by: Alexander VT <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+180-1Lines changed: 180 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -506,6 +506,8 @@ provided. However, this is limited by indexd's ability to scale to the queries y
506
506
want to run. Indexd's querying capabilities are limited and don't scale well with a
507
507
large volume of records (it is meant to be a key:value store much like the metadata service).
508
508
509
+
> WARNING: This is not recommended to be used at scale. Consider making these associations to metadata before ingestion. See [merging tools](#manifest-merge)
510
+
509
511
```python
510
512
import sys
511
513
import logging
@@ -607,4 +609,181 @@ Setting `get_guid_from_file` to `False` tells tool to try and get the guid usin
607
609
the provided custom query function instead of relying on a column in the manifest.
608
610
609
611
> NOTE: By default, the `indexed_file_object_guid` function attempts to query indexd URLs to pattern match
610
-
whatever is in the manifest column `submitted_sample_id`.
612
+
whatever is in the manifest column `submitted_sample_id`.
613
+
614
+
615
+
### Manifest Merge
616
+
617
+
If you have a manifest full of metadata and a manifest of indexed file objects in Indexd, you can use this script to merge the two into a metadata manifest for ingestion.
618
+
619
+
For example, a common use case for this is if you have a file full of metadata from dbGaP and want to get associated GUIDs for each row. You can then add the dbGaP metadata to the metadata service for those GUIDs with the file output from this merge script.
620
+
621
+
The script is also fairly configurable depending on how you need to map between the two files.
622
+
623
+
The ideal scenario is when you can map column to column between your _metadata manifest_ and _indexing manifest_ (e.g. what's in indexd).
624
+
625
+
The non-ideal scenario is if you need something for partially matching one column to another. For example: if one of the indexed URLs will contain `submitted_sample_id` somewhere in the filename. In this case, the efficiency of the script becomes O(n^2). If you can reliably parse out the section of the URL to match that could improve this. *tl;dr* Depending on your logic and number of rows in both files, this could be very very slow.
626
+
627
+
By default this merge can match multiple GUIDs with the same metadata (depending on the configuration). This supports situations where there may exist metadata that applies to multiple files. For example: dbGaP sample metadata applied to both CRAM and CRAI genomic files.
628
+
629
+
So while this supports metadata matching multiple GUIDs, it does *not* support GUIDs matching multiple sets of metadata.
630
+
631
+
> IMPORTANT NOTE: The tool will log warnings about unmatched records but it will not halt execution, so be sure to check logs when using these tools.
632
+
633
+
#### Ideal Scenario (Column to Column Match, Indexing:Metadata Manifest Rows)
634
+
635
+
Consider the following example files.
636
+
637
+
*metadata manifest*: dbGaP extract file perhaps by using [this tool](https://github.com/uc-cdis/dbgap-extract):
The strategy here is to map from the `submitted_sample_id` from the metadata manifest into the `sample_id` and then use the `guid` from the indexing manifest in the final output. That final output will can be used as the ingestion file for metadata ingestion.
650
+
651
+
```python
652
+
import sys
653
+
import logging
654
+
655
+
from gen3.tools.merge import merge_guids_into_metadata
656
+
from gen3.tools.merge import manifests_mapping_config
0 commit comments