Skip to content

Commit 97603fd

Browse files
Feat/metadata merge (#40)
* feat(metadata): add Gen3Metadata for communication with the new metadata service, tests are WIP * feat(metadata): add auth to metadata calls * feat(utils): add common utils file for functions * fix(metadata): cleanup and fix bugs in existing code, update docs * feat(index): add async functions for querying indexd * feat(metadata-tools): add initial script/tool for adding to the metadata service from a file * feat(mds): docstrings, support updating for metadata ingestion, use asyncio for mds calls, centralize backoff cfg in utils * chore(backoff): use default settings in indexd class for backoff * chore(refactor): add docs for metadata capabilities, cleanup other docs and rename things to be more clear, add missing imports to utils * feat(tests): update unit tests, fix some quick issues * fix(metadata): when localhost don't use admin endpoint suffix * fix(download): no spaces between csv header (so columns don't have leading spaces) * feat(merge): tool for merging indexing and metadata manifests * Apply automatic documentation changes * Apply automatic documentation changes * Apply automatic documentation changes * fix(ingestion): fix bugs, handle None empty column in row and catch 404 exception when missing record * Apply automatic documentation changes * fix(metadata): improve logging, cleanup docstring, more clear popping from queue loop, handle exception on missing guid * Apply automatic documentation changes * fix(indexing): remove possibility of trying to .strip() a None value * fix(review): cleanup unneeded/misleading code, add missing arguments, don't retry on 404, reword logs * Apply automatic documentation changes * Apply automatic documentation changes * chore(docs): update docs * Apply automatic documentation changes * chore(merge): revert changes from bad merge * Apply automatic documentation changes * chore(refactor): make more readable - don't do "smaller file" logic * Apply automatic documentation changes * chore(docs): fix some comments/docs, add more details for clarity about capabilities/assumptions * Apply automatic documentation changes Co-authored-by: Alexander VT <[email protected]>
1 parent 23ff8a7 commit 97603fd

File tree

6 files changed

+436
-9
lines changed

6 files changed

+436
-9
lines changed

README.md

Lines changed: 180 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -506,6 +506,8 @@ provided. However, this is limited by indexd's ability to scale to the queries y
506506
want to run. Indexd's querying capabilities are limited and don't scale well with a
507507
large volume of records (it is meant to be a key:value store much like the metadata service).
508508

509+
> WARNING: This is not recommended to be used at scale. Consider making these associations to metadata before ingestion. See [merging tools](#manifest-merge)
510+
509511
```python
510512
import sys
511513
import logging
@@ -607,4 +609,181 @@ Setting `get_guid_from_file` to `False` tells tool to try and get the guid usin
607609
the provided custom query function instead of relying on a column in the manifest.
608610

609611
> NOTE: By default, the `indexed_file_object_guid` function attempts to query indexd URLs to pattern match
610-
whatever is in the manifest column `submitted_sample_id`.
612+
whatever is in the manifest column `submitted_sample_id`.
613+
614+
615+
### Manifest Merge
616+
617+
If you have a manifest full of metadata and a manifest of indexed file objects in Indexd, you can use this script to merge the two into a metadata manifest for ingestion.
618+
619+
For example, a common use case for this is if you have a file full of metadata from dbGaP and want to get associated GUIDs for each row. You can then add the dbGaP metadata to the metadata service for those GUIDs with the file output from this merge script.
620+
621+
The script is also fairly configurable depending on how you need to map between the two files.
622+
623+
The ideal scenario is when you can map column to column between your _metadata manifest_ and _indexing manifest_ (e.g. what's in indexd).
624+
625+
The non-ideal scenario is if you need something for partially matching one column to another. For example: if one of the indexed URLs will contain `submitted_sample_id` somewhere in the filename. In this case, the efficiency of the script becomes O(n^2). If you can reliably parse out the section of the URL to match that could improve this. *tl;dr* Depending on your logic and number of rows in both files, this could be very very slow.
626+
627+
By default this merge can match multiple GUIDs with the same metadata (depending on the configuration). This supports situations where there may exist metadata that applies to multiple files. For example: dbGaP sample metadata applied to both CRAM and CRAI genomic files.
628+
629+
So while this supports metadata matching multiple GUIDs, it does *not* support GUIDs matching multiple sets of metadata.
630+
631+
> IMPORTANT NOTE: The tool will log warnings about unmatched records but it will not halt execution, so be sure to check logs when using these tools.
632+
633+
#### Ideal Scenario (Column to Column Match, Indexing:Metadata Manifest Rows)
634+
635+
Consider the following example files.
636+
637+
*metadata manifest*: dbGaP extract file perhaps by using [this tool](https://github.com/uc-cdis/dbgap-extract):
638+
639+
```
640+
submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
641+
```
642+
643+
*indexing manifest* (perhaps provided by the data owner):
644+
645+
```
646+
guid, sample_id, file_size, md5, md5_hex, aws_uri, gcp_uri
647+
```
648+
649+
The strategy here is to map from the `submitted_sample_id` from the metadata manifest into the `sample_id` and then use the `guid` from the indexing manifest in the final output. That final output will can be used as the ingestion file for metadata ingestion.
650+
651+
```python
652+
import sys
653+
import logging
654+
655+
from gen3.tools.merge import merge_guids_into_metadata
656+
from gen3.tools.merge import manifests_mapping_config
657+
658+
659+
logging.basicConfig(filename="output.log", level=logging.DEBUG)
660+
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
661+
662+
COMMONS = "https://{{insert-commons-here}}/"
663+
664+
def main():
665+
indexing_manifest = (
666+
"/path/to/indexing_manifest.csv"
667+
)
668+
metadata_manifest = (
669+
"/path/to/metadata_extract.tsv"
670+
)
671+
672+
# what column to use as the final GUID for metadata (this MUST exist in the
673+
# indexing file)
674+
manifests_mapping_config["guid_column_name"] = "guid"
675+
676+
# what column from the "metadata file" to use for mapping
677+
manifests_mapping_config["row_column_name"] = "submitted_sample_id"
678+
679+
# this configuration tells the function to use the "sample_id" column
680+
# from the "indexing file" to map to the metadata column configured above
681+
# (and these should match EXACTLY, 1:1)
682+
manifests_mapping_config["indexing_manifest_column_name"] = "sample_id"
683+
684+
output_filename = "metadata-manifest.tsv"
685+
686+
merge_guids_into_metadata(
687+
indexing_manifest, metadata_manifest, output_filename=output_filename,
688+
manifests_mapping_config=manifests_mapping_config
689+
)
690+
691+
if __name__ == "__main__":
692+
main()
693+
694+
```
695+
696+
The final output file will contain all the columns from the metadata manifest in addition to a new GUID column which maps to indexed records.
697+
698+
*output manifest* (to be used in metadata ingestion):
699+
700+
```
701+
guid, submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
702+
```
703+
704+
#### Non-Ideal Scenario (Partial URL Matching)
705+
706+
Consider the following example files.
707+
708+
*metadata manifest*: dbGaP extract file perhaps by using [this tool](https://github.com/uc-cdis/dbgap-extract):
709+
710+
```
711+
submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
712+
```
713+
714+
*indexing manifest* (perhaps by using the [download manifest tool](#download-manifest)):
715+
716+
```
717+
guid, urls, authz, acl, md5, file_size, file_name
718+
```
719+
720+
> NOTE: The indexing manifest contains no exact column match to the metadata manifest.
721+
722+
The strategy here is to look for partial matches of the metadata manifest's `submitted_sample_id` in the indexing manifest's `urls` field.
723+
724+
```python
725+
import sys
726+
import logging
727+
728+
from gen3.tools.merge import (
729+
merge_guids_into_metadata,
730+
manifest_row_parsers,
731+
manifests_mapping_config,
732+
get_guids_for_manifest_row_partial_match,
733+
)
734+
735+
736+
logging.basicConfig(filename="output.log", level=logging.DEBUG)
737+
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
738+
739+
COMMONS = "https://{{insert-commons-here}}/"
740+
741+
742+
def main():
743+
indexing_manifest = (
744+
"/path/to/indexing_manifest.csv"
745+
)
746+
metadata_manifest = (
747+
"/path/to/metadata_extract.tsv"
748+
)
749+
# what column to use as the final GUID for metadata (this MUST exist in the
750+
# indexing file)
751+
manifests_mapping_config["guid_column_name"] = "guid"
752+
753+
# what column from the "metadata file" to use for mapping
754+
manifests_mapping_config["row_column_name"] = "submitted_sample_id"
755+
756+
# this configuration tells the function to use the "gcp_uri" column
757+
# from the "indexing file" to map to the metadata column configured above
758+
# (for partial matching the metdata data column to this column )
759+
manifests_mapping_config["indexing_manifest_column_name"] = "urls"
760+
761+
# by default, the functions for parsing the manifests and rows assumes a 1:1
762+
# mapping. There is an additional function provided for partial string matching
763+
# which we can use here.
764+
manifest_row_parsers["guids_for_manifest_row"] = get_guids_for_manifest_row_partial_match
765+
766+
output_filename = "metadata-manifest-partial.tsv"
767+
768+
merge_guids_into_metadata(
769+
indexing_manifest=indexing_manifest,
770+
metadata_manifest=metadata_manifest,
771+
output_filename=output_filename,
772+
manifests_mapping_config=manifests_mapping_config,
773+
manifest_row_parsers=manifest_row_parsers,
774+
)
775+
776+
777+
if __name__ == "__main__":
778+
main()
779+
```
780+
781+
> WARNING: The efficiency here is O(n2) so this does not scale well with large files.
782+
783+
The final output file will contain all the columns from the metadata manifest in addition to a new GUID column which maps to indexed records.
784+
785+
*output manifest* (to be used in metadata ingestion):
786+
787+
```
788+
guid, submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
789+
```

docs/_build/html/_modules/gen3/tools/metadata/ingest_manifest.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -276,9 +276,9 @@ <h1>Source code for gen3.tools.metadata.ingest_manifest</h1><div class="highligh
276276
<span class="c1"># Basically make sure the resulting column is something that we can</span>
277277
<span class="c1"># later json.loads().</span>
278278
<span class="c1"># remove redudant quoting</span>
279-
<span class="n">new_row</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span> <span class="o">=</span> <span class="p">(</span>
280-
<span class="n">value</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&quot;&#39;&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;&quot;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;&#39;&#39;&quot;</span><span class="p">,</span> <span class="s2">&quot;&#39;&quot;</span><span class="p">)</span>
281-
<span class="p">)</span>
279+
<span class="k">if</span> <span class="n">value</span><span class="p">:</span>
280+
<span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&quot;&#39;&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;&quot;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;&#39;&#39;&quot;</span><span class="p">,</span> <span class="s2">&quot;&#39;&quot;</span><span class="p">)</span>
281+
<span class="n">new_row</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span> <span class="o">=</span> <span class="n">value</span>
282282
<span class="k">await</span> <span class="n">queue</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">new_row</span><span class="p">)</span>
283283

284284
<span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span>

docs/_build/html/tools/indexing.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ <h1>Indexing Tools<a class="headerlink" href="#indexing-tools" title="Permalink
312312

313313
<dl class="py function">
314314
<dt id="gen3.tools.indexing.verify_manifest.async_verify_object_manifest">
315-
<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.indexing.verify_manifest.</code><code class="sig-name descname">async_verify_object_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'acl': &lt;function _get_acl_from_row&gt;</em>, <em class="sig-param">'authz': &lt;function _get_authz_from_row&gt;</em>, <em class="sig-param">'file_name': &lt;function _get_file_name_from_row&gt;</em>, <em class="sig-param">'file_size': &lt;function _get_file_size_from_row&gt;</em>, <em class="sig-param">'guid': &lt;function _get_guid_from_row&gt;</em>, <em class="sig-param">'md5': &lt;function _get_md5_from_row&gt;</em>, <em class="sig-param">'urls': &lt;function _get_urls_from_row&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='verify-manifest-errors-1586811722.265852.log'</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Permalink to this definition"></a></dt>
315+
<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.indexing.verify_manifest.</code><code class="sig-name descname">async_verify_object_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'acl': &lt;function _get_acl_from_row&gt;</em>, <em class="sig-param">'authz': &lt;function _get_authz_from_row&gt;</em>, <em class="sig-param">'file_name': &lt;function _get_file_name_from_row&gt;</em>, <em class="sig-param">'file_size': &lt;function _get_file_size_from_row&gt;</em>, <em class="sig-param">'guid': &lt;function _get_guid_from_row&gt;</em>, <em class="sig-param">'md5': &lt;function _get_md5_from_row&gt;</em>, <em class="sig-param">'urls': &lt;function _get_urls_from_row&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='verify-manifest-errors-1587584755.0931818.log'</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Permalink to this definition"></a></dt>
316316
<dd><p>Verify all file object records into a manifest csv</p>
317317
<dl class="field-list simple">
318318
<dt class="field-odd">Parameters</dt>

docs/_build/html/tools/metadata.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ <h1>Metadata Tools<a class="headerlink" href="#metadata-tools" title="Permalink
101101

102102
<dl class="py function">
103103
<dt id="gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest">
104-
<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.metadata.ingest_manifest.</code><code class="sig-name descname">async_ingest_metadata_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">metadata_source</em>, <em class="sig-param">auth=None</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'guid_for_row': &lt;function _get_guid_for_row&gt;</em>, <em class="sig-param">'indexed_file_object_guid': &lt;function _query_for_associated_indexd_record_guid&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='ingest-metadata-manifest-errors-1586811722.6718857.log'</em>, <em class="sig-param">get_guid_from_file=True</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Permalink to this definition"></a></dt>
104+
<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.metadata.ingest_manifest.</code><code class="sig-name descname">async_ingest_metadata_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">metadata_source</em>, <em class="sig-param">auth=None</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'guid_for_row': &lt;function _get_guid_for_row&gt;</em>, <em class="sig-param">'indexed_file_object_guid': &lt;function _query_for_associated_indexd_record_guid&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='ingest-metadata-manifest-errors-1587584755.5095317.log'</em>, <em class="sig-param">get_guid_from_file=True</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Permalink to this definition"></a></dt>
105105
<dd><p>Ingest all metadata records into a manifest csv</p>
106106
<dl class="field-list simple">
107107
<dt class="field-odd">Parameters</dt>

0 commit comments

Comments
 (0)