Ingest from NCBI Virus + NCBI Datasets #40

joverlee521 · 2024-05-25T02:44:35Z

Description of proposed changes

NCBI Datasets does not include the strain, serotype, and segment in the metadata, so this PR uses the NCBI Virus vvsearch2 API to pull down the metadata and joins it with the sequences downloaded via NCBI Datasets.

On this branch, I can run the ingest workflow with

nextstrain build ingest ingest_ncbi --configfile build-configs/ncbi/defaults/config.yaml

then successfully run the genome build (~~using tmp changes in 4b511fc~~)

nextstrain build . --snakefile Snakefile.genome --config local_ingest=True ingest_source=ncbi

~~Still needs some clean up but at least wanted to get this out for people to try out.~~

Major change

This PR completely changes over the h5n1-cattle-outbreak build to the NCBI data.
See the latest build at https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome

Related issue(s)

Related to #37

Checklist

Checks pass
Add description.md for h5n1-cattle-outbreak
phylo include/exclude strain names
Add uploads to S3 -> tracking in ingest: Upload NCBI/Andersen lab outputs to S3 #41
join NCBI w/ Anderson lab data -> tracking in ingest: Join NCBI and Andersen lab data #42
add ingest automation -> tracking in Automate ingest via GH Actions #43
nextclade run as part of ingest -> tracking in ingest: Run Nextclade as part of ingest #44

Prepping for starting the ingest workflow for NCBI data. Copying over the ingest workflow from the pathgoen-repo-guide¹ as a "custom" build for the ingest workflow because the default ingest for avian-flu uses fauna. I will make changes to adopt the workflow to this repo in subsequent commits. ¹ <https://github.com/nextstrain/pathogen-repo-guide/tree/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/ingest>

1. Remove `ingest/build-configs/nbci/vendored` -- We will be using the shared `ingest/vendored` scripts that already exist in this repo. No need to vendored inception! 2. Remove `ingest/build-configs/ncbi/build-configs` -- the NCBI workflow itself is already a "custom" build. No need for build-configs inception! 3. Remove files and rules related to Nextclade. We can add them back later if we want to integrate Nextclade outputs with the metadata. 4. Remove extra geolocation_rules.tsv, just use the shared ingest/defaults/geolocation_rules.tsv since these fixes should be data source agnostic. 5. Remove boilerplate rules related to fetching from NCBI Entrez. 6. Remove boilerplate text from pathogen-repo-guide.

Use the `custom_rules` to include the NCBI ingest Snakefile with the main workflow. Include file paths updates to namespace the NCBI ingest outputs under `ncbi/`.

@trvrb

Using NCBI taxon id `11320` for Influenza A viruses to include all GenBank records that are then further labeled with serotypes (e.g. H5N1). Filtering down with the `--released-after` and `--geo-location` options for NCBI datasets to get a smaller subset of data that centered around the H5N1 outbreak in the U.S. Based on the filter URL for NCBI virus that was shared by @trvrb: <https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&EnvSample_s=include&VirusLineage_ss=Alphainfluenzavirus,%20taxid:197911&Serotype_s=H5N1%20H5N*&CollectionDate_dr=2023-12-31T00:00:00.00Z%20TO%202024-05-08T23:59:59.00Z&Region_s=North%20America>

ingest/build-configs/ncbi/rules/curate.smk

subrepo: subdir: "ingest/vendored" merged: "0f55d15" upstream: origin: "https://github.com/nextstrain/ingest" branch: "main" commit: "0f55d15" git-subrepo: version: "0.4.6" origin: "https://github.com/ingydotnet/git-subrepo" commit: "110b9eb"

Copying two scripts that were in nextstrain/ingest before they were removed in <nextstrain/ingest#19> - ncbi-virus-url - fetch-from-ncbi-virus Done in preparation for editing the NCBI Virus URL to be used for ingest because the NCBI Datasets does not include strain, segment, and serotype fields.

@AngieHinrichs

Thanks to comment by @AngieHinrichs¹ which linked to an example URL that uses the `Strain_s` field. Based on this field, I was able to guess the fields for serotype and segment. Keeping the `isolate` field because some records use the `isolate` for the strain name instead of the `strain` field. Also removes the `sequence` field since that is no longer returned by the API.² ¹ <#37 (comment)> ² <nextstrain/ingest#18>

Since the API no longer returns the actual genomic sequence, we will need to join the metadata with the sequences FASTA. This is simpler to do with `augur curate passthru` where the metadata is a CSV/TSV file. Then we don't need to copy over the `csv-to-ndjson` script.

Fetch from NCBI Virus to replace the metadata that is downloaded with NCBI Datasets since it includes additional metadata that we want in our builds, i.e. strain, serotype, and segment.

Transform the numeric segments from NCBI Virus into the segment names that are used in fauna.

Following the pattern in fauna and andersen-lab ingest to split the metadata and sequences by segment, then remerging the metadata into single file with an `n_segments` column added.

Running into a couple of GenBank records where their strain name and metadata are exactly the same, but they have different accessions. For example, A/redhead duck/North Carolina/W24-83A/2024 has four different GenBank records: - https://www.ncbi.nlm.nih.gov/nuccore/PP761257 - https://www.ncbi.nlm.nih.gov/nuccore/PP761548 - https://www.ncbi.nlm.nih.gov/nuccore/PP761557 - https://www.ncbi.nlm.nih.gov/nuccore/PP761576 Deduping the the data after they've been split out into segments so that we aren't dropping segments with the same strain name. Sadly, not every record has a BioSample accession, so we can only dedup on strain name.

Add hard-coded values and change metadata column names to match the output of fauna. This will make it easier to run the downstream phylogenetic analysis without making too many changes.

Transforms the scientific name of hosts to the host groupings used in the fauna outputs. Currently just done with manual curation of a host-map.tsv, but there's probably smarter ways to transform the scientific names.

Including the log output because I wanted it to help debug why certain sequences were not getting included in the build.

Makes it easier to use the newly added NCBI ingest data by changing the config param `ingest_source`.

Switch hard-coded strain names to match those from the NCBI ingest. Adding an assert to make sure that we are using the NCBI data source. This currently only supports the local ingest data since I have not added the S3 uploads to the NCBI ingest workflow (yet!). Includes changes to the Auspice config to point to GenBank as the data source. When testing with the new NCBI ingest data, I got a bunch of warnings from `augur export` about filter fields not appearing in the tree: > WARNING: The filter "domestic_status" does not appear as a property on any tree nodes. > WARNING: The filter "h5_label_clade" does not appear as a property on any tree nodes. > WARNING: The filter "gisaid_clade" does not appear as a property on any tree nodes. > WARNING: The filter "authors" does not appear as a property on any tree nodes. > WARNING: The filter "originating_lab" does not appear as a property on any tree nodes. It makes sense that domestic_status, h5_label_clade, gisaid_clade, and originating_lab are not included because they all have "?" as values. Fixing the `author` filter that _should_ be included.

Significantly outgrouped relative to cattle outbreak

…tput Allows us to include the `genbank_accession` field in the Auspice JSON so that Auspice can automatically generate the link the original GenBank record.

Use a separate description.md for the h5n1-cattle-outbreak to point to the public data sources not used by the usual avian-flu builds. I put this together with the general text from the mpox description.md¹ and the additional text on the H5N1 outbreak from the default avian-flu desciption.md² ¹ <https://github.com/nextstrain/mpox/blob/7f5adc3cab0c3034e455882581cc081f26f5ebbe/phylogenetic/defaults/description.md> ² <https://github.com/nextstrain/avian-flu/blob/master/config/description.md?plain=1#L3>

The strain name contains additional metadata that can supplement the metadata from NCBI. This expects strain names to be formatted as For non-human hosts: A / <host> / <location> / <sample id> / <year> For human hosts: A / <location> / <sample id> / <year> The host and year (as date) are added to the metadata if the original metadata field is empty. The location field is specially handled because the strain name might just contain a more detailed geolocation than the metadata location. In that case, the strain name location is just appended to the existing location field in the GenBank location format <record_location>:<strain_location>.

Including additional hosts were parsed from the strain names following the addition of `parse-metadata-from-strain` in the previous commit.

joverlee521 · 2024-05-29T22:24:20Z

Division is more filled out with metadata from the strain name using 19bc00e:

I'll plan to merge this tomorrow if there are no other comments.

genehack

LGTM; contains several very nitpicky nits picked by a nitwit.

README.md

genehack · 2024-05-29T22:12:52Z

Snakefile.genome

@@ -167,7 +173,7 @@ rule refine:
        coalescent = "const",
        date_inference = "marginal",
        clock_rate = clock_rate,
-        root_strain = "A/skunk/NewMexico/24006483001/2024"
+        root_strain = "A/skunk/NewMexico/24-006483-001/2024"


is it worth a comment here about why this strain for the root? feels kinda magic-number-y

It's the closest outgroup.

Added comment in b477716

ingest/build-configs/ncbi/bin/ncbi-virus-url

ingest/build-configs/ncbi/bin/parse-metadata-from-strain

ingest/build-configs/ncbi/bin/transform-host

ingest/build-configs/ncbi/bin/transform-segment

genehack · 2024-05-29T22:27:10Z

ingest/build-configs/ncbi/rules/curate.smk

+        all_geolocation_rules="ncbi/data/all-geolocation-rules.tsv",
+    shell:
+        """
+        cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}


why the >> instead of > ?

Good question. This was copied from pathogen-repo-guide, which was based on mpox/ingest and leads back to my own commit from ~2 years ago.

What was I thinking 🙃 My best guess is this is left over from when I had a rule to just append the local_geolocation_rules to the existing general_geolocation_rules without creating the additional all_geolocation_rules file.

Updated in 096ce90

…084466/2024 Just realized/remembered that BioSample can contain more detailed geolocation metadata than the GenBank record. This was the only sequence that I could find that needed the additional location data. Will revisit the joining of BioSample data in a future change if we see that it can fill in a lot more data.

Specify command to run to fetch data from NCBI GenBank

trvrb

Your script to parse location from strain name was very clever!

This is basically good to go for me. There was just one thing I caught where annotations don't seem to be being applied.

ingest/build-configs/ncbi/defaults/annotations.tsv

Additional comments in the annotations file need to be separated by a space after the field value. Previously separating comments by a tab resulted in a warning from the `merge-user-metadata` script: > WARNING: Could not decode annotation line ... We can make the `merge-user-metadata` script smarter about parsing comments in the annotations file, but punting that for now.

@genehack

Adding some comments to clarify the `root_strain`. Thanks @genehack for calling out the magic-number¹ We will probably want to refactor the workflow to pull out these types of values into a config YAML at some point, but I'm punting on that for now. ¹ <#40 (comment)>

@genehack

Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>

@genehack

Thanks @genehack for pointing it out in <#40 (comment)>

Applying annotation for divison of `A/domesticcat/UnitedStates/24-009311-006/2024` across the board to use '?' for divisions that are unknown.

@genehack

Adding some comments to clarify the `root_strain`. Thanks @genehack for calling out the magic-number¹ We will probably want to refactor the workflow to pull out these types of values into a config YAML at some point, but I'm punting on that for now. ¹ <#40 (comment)>

@genehack

Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>

@genehack

Thanks @genehack for pointing it out in <#40 (comment)>

Follow up to #40

joverlee521 added 5 commits May 23, 2024 16:27

ingest: Connect NCBI ingest with main workflow

4d6a8c8

Use the `custom_rules` to include the NCBI ingest Snakefile with the main workflow. Include file paths updates to namespace the NCBI ingest outputs under `ncbi/`.

ingest/build-configs/ncbi: README cleanup

3c491c9

joverlee521 commented May 28, 2024

View reviewed changes

ingest/build-configs/ncbi/rules/curate.smk Outdated Show resolved Hide resolved

joverlee521 commented May 28, 2024

View reviewed changes

ingest/build-configs/ncbi/rules/curate.smk Show resolved Hide resolved

joverlee521 mentioned this pull request May 28, 2024

transform-field-names: Skip field if old_field_name == new_field_name nextstrain/ingest#37

Closed

joverlee521 added 15 commits May 28, 2024 15:34

ingest/build-configs/ncbi: Add rules to fetch from NCBI Virus

c506aa8

Fetch from NCBI Virus to replace the metadata that is downloaded with NCBI Datasets since it includes additional metadata that we want in our builds, i.e. strain, serotype, and segment.

ingest/build-configs/ncib: Add transform-segment

32ab5ab

Transform the numeric segments from NCBI Virus into the segment names that are used in fauna.

ingest/build-configs/ncbi: Split curated data by segment

d5db437

Following the pattern in fauna and andersen-lab ingest to split the metadata and sequences by segment, then remerging the metadata into single file with an `n_segments` column added.

ingest/build-configs/ncbi: Match fauna metadata output

e785416

Add hard-coded values and change metadata column names to match the output of fauna. This will make it easier to run the downstream phylogenetic analysis without making too many changes.

ingest/build-configs/ncbi: Add transform-host

78f4ee7

Transforms the scientific name of hosts to the host groupings used in the fauna outputs. Currently just done with manual curation of a host-map.tsv, but there's probably smarter ways to transform the scientific names.

Snakemake.genome/filter: Add log output for debugging

f2cf349

Including the log output because I wanted it to help debug why certain sequences were not getting included in the build.

rules/common.smk: Add INGEST_SOURCE

1540385

Makes it easier to use the newly added NCBI ingest data by changing the config param `ingest_source`.

h5n1-cattle-outbreak: Add more dropped strains

1b4430a

Significantly outgrouped relative to cattle outbreak

README: Update instructions for h5n1-cattle-outbreak

7a08f1e

joverlee521 force-pushed the ingest-ncbi-virus branch from 4b511fc to 7a08f1e Compare May 29, 2024 01:03

joverlee521 marked this pull request as ready for review May 29, 2024 01:06

ingest/build-configs/ncbi: Include genbank_accession in metadata ou…

eb11d2e

…tput Allows us to include the `genbank_accession` field in the Auspice JSON so that Auspice can automatically generate the link the original GenBank record.

This comment was marked as resolved.

Sign in to view

joverlee521 requested review from a team and trvrb May 29, 2024 17:30

joverlee521 added 2 commits May 29, 2024 15:07

ingest/build-configs/ncbi: Add to host-map.tsv

64d8d99

Including additional hosts were parsed from the strain names following the addition of `parse-metadata-from-strain` in the previous commit.

genehack approved these changes May 29, 2024

View reviewed changes

joverlee521 and others added 3 commits May 29, 2024 15:44

Update README under ingest

03f401e

Specify command to run to fetch data from NCBI GenBank

Include additional annotations

5b51817

trvrb self-requested a review May 29, 2024 23:59

trvrb requested changes May 30, 2024

View reviewed changes

ingest/build-configs/ncbi/defaults/annotations.tsv Outdated Show resolved Hide resolved

joverlee521 added a commit that referenced this pull request May 30, 2024

Various nits

eb5e068

Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>

joverlee521 added a commit that referenced this pull request May 30, 2024

ingest/build-configs/ncbi: Switch unnecessary >> to >

096ce90

Thanks @genehack for pointing it out in <#40 (comment)>

joverlee521 added 4 commits May 30, 2024 15:52

ingest: Update geolocation_rules.tsv

67c34f2

Applying annotation for divison of `A/domesticcat/UnitedStates/24-009311-006/2024` across the board to use '?' for divisions that are unknown.

Various nits

d4b9693

Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹ Prompted me to clea up various other nitpicky nits. ¹ <#40 (review)>

ingest/build-configs/ncbi: Switch unnecessary >> to >

66e1903

Thanks @genehack for pointing it out in <#40 (comment)>

joverlee521 force-pushed the ingest-ncbi-virus branch from 096ce90 to 66e1903 Compare May 30, 2024 22:56

joverlee521 merged commit 24cdc04 into master May 30, 2024
6 checks passed

joverlee521 deleted the ingest-ncbi-virus branch May 30, 2024 23:05

This was referenced May 30, 2024

Ingest NCBI GenBank data for H5N1 outbreak #37

Closed

ingest: Upload NCBI/Andersen lab outputs to S3 #41

Closed

ingest: Join NCBI and Andersen lab data #42

Closed

Automate ingest via GH Actions #43

Closed

trvrb self-requested a review May 31, 2024 00:20

joverlee521 mentioned this pull request May 31, 2024

ingest: Run Nextclade as part of ingest #44

Open

joverlee521 added a commit that referenced this pull request May 31, 2024

.gitignore: Ignore ingest/ncbi subdirectory

f1569ec

Follow up to #40

joverlee521 mentioned this pull request May 31, 2024

.gitignore: Ignore ingest/ncbi subdirectory #45

Merged

1 task

trvrb mentioned this pull request Jun 3, 2024

Ingest extra sources #35

Closed

joverlee521 mentioned this pull request Jun 4, 2024

ingest/build-configs/ncbi: Download full NCBI dataset #48

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest from NCBI Virus + NCBI Datasets #40

Ingest from NCBI Virus + NCBI Datasets #40

joverlee521 commented May 25, 2024 •

edited

Loading

This comment was marked as resolved.

joverlee521 commented May 29, 2024

genehack left a comment

genehack May 29, 2024

trvrb May 30, 2024

joverlee521 May 30, 2024

genehack May 29, 2024

joverlee521 May 30, 2024

joverlee521 May 30, 2024

trvrb left a comment

Ingest from NCBI Virus + NCBI Datasets #40

Ingest from NCBI Virus + NCBI Datasets #40

Conversation

joverlee521 commented May 25, 2024 • edited Loading

Description of proposed changes

Major change

Related issue(s)

Checklist

This comment was marked as resolved.

joverlee521 commented May 29, 2024

genehack left a comment

Choose a reason for hiding this comment

genehack May 29, 2024

Choose a reason for hiding this comment

trvrb May 30, 2024

Choose a reason for hiding this comment

joverlee521 May 30, 2024

Choose a reason for hiding this comment

genehack May 29, 2024

Choose a reason for hiding this comment

joverlee521 May 30, 2024

Choose a reason for hiding this comment

joverlee521 May 30, 2024

Choose a reason for hiding this comment

trvrb left a comment

Choose a reason for hiding this comment

joverlee521 commented May 25, 2024 •

edited

Loading