Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Run Nextclade as part of ingest #44

Open
joverlee521 opened this issue May 31, 2024 · 2 comments · May be fixed by #62
Open

ingest: Run Nextclade as part of ingest #44

joverlee521 opened this issue May 31, 2024 · 2 comments · May be fixed by #62
Assignees

Comments

@joverlee521
Copy link
Contributor

Follow up to #40

With the recent addition of the community H5 Nextclade datasets in nextstrain/nextclade_data#196, it should now be possible to run Nextclade as part of ingest to assign clades to the H5 sequences.

Maybe this can replace the current manual clade labeling process with clade-labeling scripts?

@joverlee521
Copy link
Contributor Author

joverlee521 commented Jun 19, 2024

I'm starting with the community/moncla-lab/iav-h5/ha/all-clades Nextclade dataset since that should work across fauna and NCBI sequences. Tested manually on ingest-with-nextclade branch.

NCBI

nextstrain build \
    ingest \
        joined-ncbi/results/nextclade.tsv \
        --configfile build-configs/ncbi/defaults/config.yaml

Almost everything gets assigned to the expected 2.3.4.4b clade, except 3 sequences were assigned to the 0 clade:

  • A/skunk/Utah/24-008032-004/2024
  • A/cattle/Texas/24-009499-002/2024
  • A/cattle/Texas/24-009308-001/2024

Fauna

nextstrain build \
    --envdir ../env.d/seasonal-flu/ \
    ingest \
        fauna/results/nextclade.tsv \
        --configfile build-configs/ncbi/defaults/config.yaml

Since this is all avian flu and not just H5, there's ~30% not assigned to any clade.

See detailed breakdown of counts
clade count percent
13371 30.65
0 598 1.37
1 630 1.44
1.1 127 0.29
1.1.1 76 0.17
1.1.2 224 0.51
2.1.1 80 0.18
2.1.2 79 0.18
2.1.3 55 0.13
2.1.3.1 36 0.08
2.1.3.2 429 0.98
2.1.3.2a 129 0.30
2.1.3.2b 133 0.30
2.1.3.3 51 0.12
2.2 817 1.87
2.2.1 555 1.27
2.2.1.1 140 0.32
2.2.1.1a 117 0.27
2.2.1.2 645 1.48
2.2.2 184 0.42
2.2.2.1 72 0.17
2.3.1 21 0.05
2.3.2 100 0.23
2.3.2.1 217 0.50
2.3.2.1a 932 2.14
2.3.2.1b 92 0.21
2.3.2.1c 211 0.48
2.3.2.1d 96 0.22
2.3.2.1e 793 1.82
2.3.2.1f 559 1.28
2.3.2.1g 384 0.88
2.3.3 28 0.06
2.3.4 574 1.32
2.3.4.1 58 0.13
2.3.4.2 86 0.20
2.3.4.3 206 0.47
2.3.4.4 307 0.70
2.3.4.4a 246 0.56
2.3.4.4b 14625 33.53
2.3.4.4c 1268 2.91
2.3.4.4d 134 0.31
2.3.4.4e 718 1.65
2.3.4.4f 173 0.40
2.3.4.4g 221 0.51
2.3.4.4h 736 1.69
2.4 6 0.01
2.5 18 0.04
3 7 0.02
4 26 0.06
5 16 0.04
6 11 0.03
7 69 0.16
7.1 20 0.05
7.2 60 0.14
8 4 0.01
9 22 0.05
Am-nonGsGD 1263 2.90
EA-nonGsGD 769 1.76

I'm going to join with metadata tomorrow Thursday and cross check the clades with the existing clades from fauna.

@joverlee521
Copy link
Contributor Author

Latest push to the ingest-with-nextclade branch now joins the metadata with the Nextclade output.

I did a brief look into the fauna side to compare Nextclade clades with the existing clades

Of the 43,642 records

  • 21,714 had the exact same clade designations between nextclade clade and gisaid clade.
  • 13,242 had no clade designation in both
  • 5590 had Nextclade clade but no gisaid clade
  • 3096 had discrepancies between Nextclade clade and gisaid clade

@joverlee521 joverlee521 linked a pull request Jun 24, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant