Feat/metadata merge (#40)

Avantol13 · Avantol13-machine-user · web-flow · commit 97603fd3066d · 2020-04-23T09:19:21.000-05:00
* feat(metadata): add Gen3Metadata for communication with the new metadata service, tests are WIP

* feat(metadata): add auth to metadata calls

* feat(utils): add common utils file for functions

* fix(metadata): cleanup and fix bugs in existing code, update docs

* feat(index): add async functions for querying indexd

* feat(metadata-tools): add initial script/tool for adding to the metadata service from a file

* feat(mds): docstrings, support updating for metadata ingestion, use asyncio for mds calls, centralize backoff cfg in utils

* chore(backoff): use default settings in indexd class for backoff

* chore(refactor): add docs for metadata capabilities, cleanup other docs and rename things to be more clear, add missing imports to utils

* feat(tests): update unit tests, fix some quick issues

* fix(metadata): when localhost don't use admin endpoint suffix

* fix(download): no spaces between csv header (so columns don't have leading spaces)

* feat(merge): tool for merging indexing and metadata manifests

* Apply automatic documentation changes

* Apply automatic documentation changes

* Apply automatic documentation changes

* fix(ingestion): fix bugs, handle None empty column in row and catch 404 exception when missing record

* Apply automatic documentation changes

* fix(metadata): improve logging, cleanup docstring, more clear popping from queue loop, handle exception on missing guid

* Apply automatic documentation changes

* fix(indexing): remove possibility of trying to .strip() a None value

* fix(review): cleanup unneeded/misleading code, add missing arguments, don't retry on 404, reword logs

* Apply automatic documentation changes

* Apply automatic documentation changes

* chore(docs): update docs

* Apply automatic documentation changes

* chore(merge): revert changes from bad merge

* Apply automatic documentation changes

* chore(refactor): make more readable - don't do "smaller file" logic

* Apply automatic documentation changes

* chore(docs): fix some comments/docs, add more details for clarity about capabilities/assumptions

* Apply automatic documentation changes

Co-authored-by: Alexander VT &lt;alexander.m.vantol@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -506,6 +506,8 @@ provided. However, this is limited by indexd's ability to scale to the queries y
 want to run. Indexd's querying capabilities are limited and don't scale well with a
 large volume of records (it is meant to be a key:value store much like the metadata service).
 
+> WARNING: This is not recommended to be used at scale. Consider making these associations to metadata before ingestion. See [merging tools](#manifest-merge)
+
 ```python
 import sys
 import logging
@@ -607,4 +609,181 @@ Setting `get_guid_from_file`  to `False` tells tool to try and get the guid usin
 the provided custom query function instead of relying on a column in the manifest.
 
 > NOTE: By default, the `indexed_file_object_guid` function attempts to query indexd URLs to pattern match
-whatever is in the manifest column `submitted_sample_id`.
+whatever is in the manifest column `submitted_sample_id`.
+
+
+### Manifest Merge
+
+If you have a manifest full of metadata and a manifest of indexed file objects in Indexd, you can use this script to merge the two into a metadata manifest for ingestion.
+
+For example, a common use case for this is if you have a file full of metadata from dbGaP and want to get associated GUIDs for each row. You can then add the dbGaP metadata to the metadata service for those GUIDs with the file output from this merge script.
+
+The script is also fairly configurable depending on how you need to map between the two files.
+
+The ideal scenario is when you can map column to column between your _metadata manifest_ and _indexing manifest_ (e.g. what's in indexd).
+
+The non-ideal scenario is if you need something for partially matching one column to another. For example: if one of the indexed URLs will contain `submitted_sample_id` somewhere in the filename. In this case, the efficiency of the script becomes O(n^2). If you can reliably parse out the section of the URL to match that could improve this. *tl;dr* Depending on your logic and number of rows in both files, this could be very very slow.
+
+By default this merge can match multiple GUIDs with the same metadata (depending on the configuration). This supports situations where there may exist metadata that applies to multiple files. For example: dbGaP sample metadata applied to both CRAM and CRAI genomic files.
+
+So while this supports metadata matching multiple GUIDs, it does *not* support GUIDs matching multiple sets of metadata.
+
+> IMPORTANT NOTE: The tool will log warnings about unmatched records but it will not halt execution, so be sure to check logs when using these tools.
+
+#### Ideal Scenario (Column to Column Match, Indexing:Metadata Manifest Rows)
+
+Consider the following example files.
+
+*metadata manifest*: dbGaP extract file perhaps by using [this tool](https://github.com/uc-cdis/dbgap-extract):
+
+```
+submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
+```
+
+*indexing manifest* (perhaps provided by the data owner):
+
+```
+guid, sample_id, file_size, md5, md5_hex, aws_uri, gcp_uri
+```
+
+The strategy here is to map from the `submitted_sample_id` from the metadata manifest into the `sample_id` and then use the `guid` from the indexing manifest in the final output. That final output will can be used as the ingestion file for metadata ingestion.
+
+```python
+import sys
+import logging
+
+from gen3.tools.merge import merge_guids_into_metadata
+from gen3.tools.merge import manifests_mapping_config
+
+
+logging.basicConfig(filename="output.log", level=logging.DEBUG)
+logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
+
+COMMONS = "https://{{insert-commons-here}}/"
+
+def main():
+    indexing_manifest = (
+        "/path/to/indexing_manifest.csv"
+    )
+    metadata_manifest = (
+        "/path/to/metadata_extract.tsv"
+    )
+
+    # what column to use as the final GUID for metadata (this MUST exist in the
+    # indexing file)
+    manifests_mapping_config["guid_column_name"] = "guid"
+
+    # what column from the "metadata file" to use for mapping
+    manifests_mapping_config["row_column_name"] = "submitted_sample_id"
+
+    # this configuration tells the function to use the "sample_id" column
+    # from the "indexing file" to map to the metadata column configured above
+    # (and these should match EXACTLY, 1:1)
+    manifests_mapping_config["indexing_manifest_column_name"] = "sample_id"
+
+    output_filename = "metadata-manifest.tsv"
+
+    merge_guids_into_metadata(
+        indexing_manifest, metadata_manifest, output_filename=output_filename,
+        manifests_mapping_config=manifests_mapping_config
+    )
+
+if __name__ == "__main__":
+    main()
+
+```
+
+The final output file will contain all the columns from the metadata manifest in addition to a new GUID column which maps to indexed records.
+
+*output manifest* (to be used in metadata ingestion):
+
+```
+guid, submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
+```
+
+#### Non-Ideal Scenario (Partial URL Matching)
+
+Consider the following example files.
+
+*metadata manifest*: dbGaP extract file perhaps by using [this tool](https://github.com/uc-cdis/dbgap-extract):
+
+```
+submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
+```
+
+*indexing manifest* (perhaps by using the [download manifest tool](#download-manifest)):
+
+```
+guid, urls, authz, acl, md5, file_size, file_name
+```
+
+> NOTE: The indexing manifest contains no exact column match to the metadata manifest.
+
+The strategy here is to look for partial matches of the metadata manifest's `submitted_sample_id` in the indexing manifest's `urls` field.
+
+```python
+import sys
+import logging
+
+from gen3.tools.merge import (
+    merge_guids_into_metadata,
+    manifest_row_parsers,
+    manifests_mapping_config,
+    get_guids_for_manifest_row_partial_match,
+)
+
+
+logging.basicConfig(filename="output.log", level=logging.DEBUG)
+logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
+
+COMMONS = "https://{{insert-commons-here}}/"
+
+
+def main():
+    indexing_manifest = (
+        "/path/to/indexing_manifest.csv"
+    )
+    metadata_manifest = (
+        "/path/to/metadata_extract.tsv"
+    )
+    # what column to use as the final GUID for metadata (this MUST exist in the
+    # indexing file)
+    manifests_mapping_config["guid_column_name"] = "guid"
+
+    # what column from the "metadata file" to use for mapping
+    manifests_mapping_config["row_column_name"] = "submitted_sample_id"
+
+    # this configuration tells the function to use the "gcp_uri" column
+    # from the "indexing file" to map to the metadata column configured above
+    # (for partial matching the metdata data column to this column )
+    manifests_mapping_config["indexing_manifest_column_name"] = "urls"
+
+    # by default, the functions for parsing the manifests and rows assumes a 1:1
+    # mapping. There is an additional function provided for partial string matching
+    # which we can use here.
+    manifest_row_parsers["guids_for_manifest_row"] = get_guids_for_manifest_row_partial_match
+
+    output_filename = "metadata-manifest-partial.tsv"
+
+    merge_guids_into_metadata(
+        indexing_manifest=indexing_manifest,
+        metadata_manifest=metadata_manifest,
+        output_filename=output_filename,
+        manifests_mapping_config=manifests_mapping_config,
+        manifest_row_parsers=manifest_row_parsers,
+    )
+
+
+if __name__ == "__main__":
+    main()
+```
+
+> WARNING: The efficiency here is O(n2) so this does not scale well with large files.
+
+The final output file will contain all the columns from the metadata manifest in addition to a new GUID column which maps to indexed records.
+
+*output manifest* (to be used in metadata ingestion):
+
+```
+guid, submitted_sample_id, dbgap_subject_id, consent_short_name, body_site, ....
+```
diff --git a/docs/_build/html/_modules/gen3/tools/metadata/ingest_manifest.html b/docs/_build/html/_modules/gen3/tools/metadata/ingest_manifest.html
@@ -276,9 +276,9 @@ <h1>Source code for gen3.tools.metadata.ingest_manifest</h1><div class="highligh
                 <span class="c1"># Basically make sure the resulting column is something that we can</span>
                 <span class="c1"># later json.loads().</span>
                 <span class="c1"># remove redudant quoting</span>
-                <span class="n">new_row</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span> <span class="o">=</span> <span class="p">(</span>
-                    <span class="n">value</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&quot;&#39;&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;&quot;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;&#39;&#39;&quot;</span><span class="p">,</span> <span class="s2">&quot;&#39;&quot;</span><span class="p">)</span>
-                <span class="p">)</span>
+                <span class="k">if</span> <span class="n">value</span><span class="p">:</span>
+                    <span class="n">value</span> <span class="o">=</span> <span class="n">value</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s2">&quot;&#39;&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;&quot;&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;&#39;&#39;&quot;</span><span class="p">,</span> <span class="s2">&quot;&#39;&quot;</span><span class="p">)</span>
+                <span class="n">new_row</span><span class="p">[</span><span class="n">key</span><span class="o">.</span><span class="n">strip</span><span class="p">()]</span> <span class="o">=</span> <span class="n">value</span>
             <span class="k">await</span> <span class="n">queue</span><span class="o">.</span><span class="n">put</span><span class="p">(</span><span class="n">new_row</span><span class="p">)</span>
 
     <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span>
diff --git a/docs/_build/html/tools/indexing.html b/docs/_build/html/tools/indexing.html
@@ -312,7 +312,7 @@ <h1>Indexing Tools<a class="headerlink" href="#indexing-tools" title="Permalink
 
 <dl class="py function">
 <dt id="gen3.tools.indexing.verify_manifest.async_verify_object_manifest">
-<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.indexing.verify_manifest.</code><code class="sig-name descname">async_verify_object_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'acl': &lt;function _get_acl_from_row&gt;</em>, <em class="sig-param">'authz': &lt;function _get_authz_from_row&gt;</em>, <em class="sig-param">'file_name': &lt;function _get_file_name_from_row&gt;</em>, <em class="sig-param">'file_size': &lt;function _get_file_size_from_row&gt;</em>, <em class="sig-param">'guid': &lt;function _get_guid_from_row&gt;</em>, <em class="sig-param">'md5': &lt;function _get_md5_from_row&gt;</em>, <em class="sig-param">'urls': &lt;function _get_urls_from_row&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='verify-manifest-errors-1586811722.265852.log'</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Permalink to this definition">¶</a></dt>
+<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.indexing.verify_manifest.</code><code class="sig-name descname">async_verify_object_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'acl': &lt;function _get_acl_from_row&gt;</em>, <em class="sig-param">'authz': &lt;function _get_authz_from_row&gt;</em>, <em class="sig-param">'file_name': &lt;function _get_file_name_from_row&gt;</em>, <em class="sig-param">'file_size': &lt;function _get_file_size_from_row&gt;</em>, <em class="sig-param">'guid': &lt;function _get_guid_from_row&gt;</em>, <em class="sig-param">'md5': &lt;function _get_md5_from_row&gt;</em>, <em class="sig-param">'urls': &lt;function _get_urls_from_row&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='verify-manifest-errors-1587584755.0931818.log'</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/indexing/verify_manifest.html#async_verify_object_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.indexing.verify_manifest.async_verify_object_manifest" title="Permalink to this definition">¶</a></dt>
 <dd><p>Verify all file object records into a manifest csv</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
diff --git a/docs/_build/html/tools/metadata.html b/docs/_build/html/tools/metadata.html
@@ -101,7 +101,7 @@ <h1>Metadata Tools<a class="headerlink" href="#metadata-tools" title="Permalink
 
 <dl class="py function">
 <dt id="gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest">
-<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.metadata.ingest_manifest.</code><code class="sig-name descname">async_ingest_metadata_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">metadata_source</em>, <em class="sig-param">auth=None</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'guid_for_row': &lt;function _get_guid_for_row&gt;</em>, <em class="sig-param">'indexed_file_object_guid': &lt;function _query_for_associated_indexd_record_guid&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='ingest-metadata-manifest-errors-1586811722.6718857.log'</em>, <em class="sig-param">get_guid_from_file=True</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Permalink to this definition">¶</a></dt>
+<em class="property">async </em><code class="sig-prename descclassname">gen3.tools.metadata.ingest_manifest.</code><code class="sig-name descname">async_ingest_metadata_manifest</code><span class="sig-paren">(</span><em class="sig-param">commons_url</em>, <em class="sig-param">manifest_file</em>, <em class="sig-param">metadata_source</em>, <em class="sig-param">auth=None</em>, <em class="sig-param">max_concurrent_requests=24</em>, <em class="sig-param">manifest_row_parsers={'guid_for_row': &lt;function _get_guid_for_row&gt;</em>, <em class="sig-param">'indexed_file_object_guid': &lt;function _query_for_associated_indexd_record_guid&gt;}</em>, <em class="sig-param">manifest_file_delimiter=None</em>, <em class="sig-param">output_filename='ingest-metadata-manifest-errors-1587584755.5095317.log'</em>, <em class="sig-param">get_guid_from_file=True</em><span class="sig-paren">)</span><a class="reference internal" href="../_modules/gen3/tools/metadata/ingest_manifest.html#async_ingest_metadata_manifest"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#gen3.tools.metadata.ingest_manifest.async_ingest_metadata_manifest" title="Permalink to this definition">¶</a></dt>
 <dd><p>Ingest all metadata records into a manifest csv</p>
 <dl class="field-list simple">
 <dt class="field-odd">Parameters</dt>
diff --git a/gen3/tools/merge.py b/gen3/tools/merge.py
diff --git a/gen3/tools/metadata/ingest_manifest.py b/gen3/tools/metadata/ingest_manifest.py