Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest from NCBI Virus + NCBI Datasets #40

Merged
merged 36 commits into from
May 30, 2024
Merged

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented May 25, 2024

Description of proposed changes

NCBI Datasets does not include the strain, serotype, and segment in the metadata, so this PR uses the NCBI Virus vvsearch2 API to pull down the metadata and joins it with the sequences downloaded via NCBI Datasets.

On this branch, I can run the ingest workflow with

nextstrain build ingest ingest_ncbi --configfile build-configs/ncbi/defaults/config.yaml

then successfully run the genome build (using tmp changes in 4b511fc)

nextstrain build . --snakefile Snakefile.genome --config local_ingest=True ingest_source=ncbi

Still needs some clean up but at least wanted to get this out for people to try out.

Major change

This PR completely changes over the h5n1-cattle-outbreak build to the NCBI data.
See the latest build at https://nextstrain.org/staging/avian-flu/ncbi/h5n1-cattle-outbreak/genome

Related issue(s)

Related to #37

Checklist

Prepping for starting the ingest workflow for NCBI data.
Copying over the ingest workflow from the pathgoen-repo-guide¹ as a
"custom" build for the ingest workflow because the default ingest
for avian-flu uses fauna.

I will make changes to adopt the workflow to this repo in subsequent
commits.

¹ <https://github.com/nextstrain/pathogen-repo-guide/tree/f33c43edd9ebad10aa0e8d2b0791755ddbe2f5c8/ingest>
1. Remove `ingest/build-configs/nbci/vendored` -- We will be using the
shared `ingest/vendored` scripts that already exist in this repo.
No need to vendored inception!

2. Remove `ingest/build-configs/ncbi/build-configs` -- the NCBI workflow
itself is already a "custom" build. No need for build-configs inception!

3. Remove files and rules related to Nextclade. We can add them back
later if we want to integrate Nextclade outputs with the metadata.

4. Remove extra geolocation_rules.tsv, just use the shared
ingest/defaults/geolocation_rules.tsv since these fixes should be
data source agnostic.

5. Remove boilerplate rules related to fetching from NCBI Entrez.

6. Remove boilerplate text from pathogen-repo-guide.
Use the `custom_rules` to include the NCBI ingest Snakefile with
the main workflow. Include file paths updates to namespace the
NCBI ingest outputs under `ncbi/`.
Using NCBI taxon id `11320` for Influenza A viruses to include all
GenBank records that are then further labeled with serotypes (e.g. H5N1).

Filtering down with the `--released-after` and `--geo-location`
options for NCBI datasets to get a smaller subset of data that centered
around the H5N1 outbreak in the U.S. Based on the filter URL for NCBI
virus that was shared by @trvrb:

<https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&EnvSample_s=include&VirusLineage_ss=Alphainfluenzavirus,%20taxid:197911&Serotype_s=H5N1%20H5N*&CollectionDate_dr=2023-12-31T00:00:00.00Z%20TO%202024-05-08T23:59:59.00Z&Region_s=North%20America>
subrepo:
  subdir:   "ingest/vendored"
  merged:   "0f55d15"
upstream:
  origin:   "https://github.com/nextstrain/ingest"
  branch:   "main"
  commit:   "0f55d15"
git-subrepo:
  version:  "0.4.6"
  origin:   "https://github.com/ingydotnet/git-subrepo"
  commit:   "110b9eb"
Copying two scripts that were in nextstrain/ingest before they were
removed in <nextstrain/ingest#19>

- ncbi-virus-url
- fetch-from-ncbi-virus

Done in preparation for editing the NCBI Virus URL to be used for
ingest because the NCBI Datasets does not include strain, segment,
and serotype fields.
Thanks to comment by @AngieHinrichs¹ which linked to an example URL that
uses the `Strain_s` field. Based on this field, I was able to guess the
fields for serotype and segment.

Keeping the `isolate` field because some records use the `isolate` for
the strain name instead of the `strain` field.

Also removes the `sequence` field since that is no longer returned by
the API.²

¹ <#37 (comment)>
² <nextstrain/ingest#18>
Since the API no longer returns the actual genomic sequence, we will
need to join the metadata with the sequences FASTA. This is simpler to
do with `augur curate passthru` where the metadata is a CSV/TSV file.
Then we don't need to copy over the `csv-to-ndjson` script.
Fetch from NCBI Virus to replace the metadata that is downloaded
with NCBI Datasets since it includes additional metadata that we want
in our builds, i.e. strain, serotype, and segment.
Transform the numeric segments from NCBI Virus into the segment
names that are used in fauna.
Following the pattern in fauna and andersen-lab ingest to split the
metadata and sequences by segment, then remerging the metadata into
single file with an `n_segments` column added.
Running into a couple of GenBank records where their strain name
and metadata are exactly the same, but they have different accessions.

For example, A/redhead duck/North Carolina/W24-83A/2024 has four
different GenBank records:
- https://www.ncbi.nlm.nih.gov/nuccore/PP761257
- https://www.ncbi.nlm.nih.gov/nuccore/PP761548
- https://www.ncbi.nlm.nih.gov/nuccore/PP761557
- https://www.ncbi.nlm.nih.gov/nuccore/PP761576

Deduping the the data after they've been split out into segments so that
we aren't dropping segments with the same strain name. Sadly, not every
record has a BioSample accession, so we can only dedup on strain name.
Add hard-coded values and change metadata column names to match the
output of fauna. This will make it easier to run the downstream
phylogenetic analysis without making too many changes.
Transforms the scientific name of hosts to the host groupings used
in the fauna outputs. Currently just done with manual curation of
a host-map.tsv, but there's probably smarter ways to transform the
scientific names.
Including the log output because I wanted it to help debug why
certain sequences were not getting included in the build.
Makes it easier to use the newly added NCBI ingest data by changing
the config param `ingest_source`.
Switch hard-coded strain names to match those from the NCBI ingest.
Adding an assert to make sure that we are using the NCBI data source.
This currently only supports the local ingest data since I have not
added the S3 uploads to the NCBI ingest workflow (yet!).

Includes changes to the Auspice config to point to GenBank as the
data source. When testing with the new NCBI ingest data, I got a bunch
of warnings from `augur export` about filter fields not appearing
in the tree:

> WARNING:  The filter "domestic_status" does not appear as a property on any tree nodes.
> WARNING:  The filter "h5_label_clade" does not appear as a property on any tree nodes.
> WARNING:  The filter "gisaid_clade" does not appear as a property on any tree nodes.
> WARNING:  The filter "authors" does not appear as a property on any tree nodes.
> WARNING:  The filter "originating_lab" does not appear as a property on any tree nodes.

It makes sense that domestic_status, h5_label_clade, gisaid_clade,
and originating_lab are not included because they all have "?" as values.

Fixing the `author` filter that _should_ be included.
Significantly outgrouped relative to cattle outbreak
@joverlee521 joverlee521 marked this pull request as ready for review May 29, 2024 01:06
…tput

Allows us to include the `genbank_accession` field in the Auspice JSON
so that Auspice can automatically generate the link the original
GenBank record.
@joverlee521

This comment was marked as resolved.

Use a separate description.md for the h5n1-cattle-outbreak to point
to the public data sources not used by the usual avian-flu builds.

I put this together with the general text from the mpox description.md¹
and the additional text on the H5N1 outbreak from the default
avian-flu desciption.md²

¹ <https://github.com/nextstrain/mpox/blob/7f5adc3cab0c3034e455882581cc081f26f5ebbe/phylogenetic/defaults/description.md>
² <https://github.com/nextstrain/avian-flu/blob/master/config/description.md?plain=1#L3>
@joverlee521 joverlee521 requested review from a team and trvrb May 29, 2024 17:30
The strain name contains additional metadata that can supplement the
metadata from NCBI. This expects strain names to be formatted as

For non-human hosts:

    A / <host> / <location> / <sample id> / <year>

For human hosts:

    A / <location> / <sample id> / <year>

The host and year (as date) are added to the metadata if the original
metadata field is empty. The location field is specially handled
because the strain name might just contain a more detailed geolocation
than the metadata location. In that case, the strain name location is
just appended to the existing location field in the GenBank location
format <record_location>:<strain_location>.
Including additional hosts were parsed from the strain names following
the addition of `parse-metadata-from-strain` in the previous commit.
@joverlee521
Copy link
Contributor Author

Division is more filled out with metadata from the strain name using 19bc00e:

Screenshot 2024-05-29 at 3 19 13 PM


I'll plan to merge this tomorrow if there are no other comments.

Copy link
Contributor

@genehack genehack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; contains several very nitpicky nits picked by a nitwit.

README.md Outdated Show resolved Hide resolved
@@ -167,7 +173,7 @@ rule refine:
coalescent = "const",
date_inference = "marginal",
clock_rate = clock_rate,
root_strain = "A/skunk/NewMexico/24006483001/2024"
root_strain = "A/skunk/NewMexico/24-006483-001/2024"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth a comment here about why this strain for the root? feels kinda magic-number-y

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the closest outgroup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment in b477716

ingest/build-configs/ncbi/bin/ncbi-virus-url Outdated Show resolved Hide resolved
ingest/build-configs/ncbi/bin/ncbi-virus-url Outdated Show resolved Hide resolved
ingest/build-configs/ncbi/bin/parse-metadata-from-strain Outdated Show resolved Hide resolved
ingest/build-configs/ncbi/bin/transform-host Outdated Show resolved Hide resolved
ingest/build-configs/ncbi/bin/transform-segment Outdated Show resolved Hide resolved
all_geolocation_rules="ncbi/data/all-geolocation-rules.tsv",
shell:
"""
cat {input.general_geolocation_rules} {input.local_geolocation_rules} >> {output.all_geolocation_rules}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the >> instead of > ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This was copied from pathogen-repo-guide, which was based on mpox/ingest and leads back to my own commit from ~2 years ago.

What was I thinking 🙃 My best guess is this is left over from when I had a rule to just append the local_geolocation_rules to the existing general_geolocation_rules without creating the additional all_geolocation_rules file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 096ce90

joverlee521 and others added 3 commits May 29, 2024 15:44
…084466/2024

Just realized/remembered that BioSample can contain more detailed
geolocation metadata than the GenBank record. This was the only
sequence that I could find that needed the additional location data.

Will revisit the joining of BioSample data in a future change if we
see that it can fill in a lot more data.
Specify command to run to fetch data from NCBI GenBank
@trvrb trvrb self-requested a review May 29, 2024 23:59
Copy link
Member

@trvrb trvrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your script to parse location from strain name was very clever!

This is basically good to go for me. There was just one thing I caught where annotations don't seem to be being applied.

ingest/build-configs/ncbi/defaults/annotations.tsv Outdated Show resolved Hide resolved
Additional comments in the annotations file need to be separated by
a space after the field value. Previously separating comments by a tab
resulted in a warning from the `merge-user-metadata` script:

> WARNING: Could not decode annotation line ...

We can make the `merge-user-metadata` script smarter about parsing
comments in the annotations file, but punting that for now.
joverlee521 added a commit that referenced this pull request May 30, 2024
Adding some comments to clarify the `root_strain`.

Thanks @genehack for calling out the magic-number¹
We will probably want to refactor the workflow to pull out these types
of values into a config YAML at some point, but I'm punting on that
for now.

¹ <#40 (comment)>
joverlee521 added a commit that referenced this pull request May 30, 2024
Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹
Prompted me to clea up various other nitpicky nits.

¹ <#40 (review)>
joverlee521 added a commit that referenced this pull request May 30, 2024
Applying annotation for divison of `A/domesticcat/UnitedStates/24-009311-006/2024`
across the board to use '?' for divisions that are unknown.
Adding some comments to clarify the `root_strain`.

Thanks @genehack for calling out the magic-number¹
We will probably want to refactor the workflow to pull out these types
of values into a config YAML at some point, but I'm punting on that
for now.

¹ <#40 (comment)>
Thanks @genehack for the "several very nitpicky nits picked by a nitwit"¹
Prompted me to clea up various other nitpicky nits.

¹ <#40 (review)>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants