Skip to content
This repository has been archived by the owner on Aug 5, 2021. It is now read-only.

lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

Open
AngieHinrichs opened this issue Sep 15, 2020 · 0 comments

Comments

@AngieHinrichs
Copy link
Member

This may be expected with minimal metadata.csv (just sequence_name and lineage), but llama_report.md ends up with a bunch of error stacks from a KeyError and several NameErrors. Feel free to ignore if it's not high priority for you to support lineage-only metadata.

Since the QC will accept any sequence with at least 10,000 bases and at most 50% Ns, I figured I would throw in some partial sequences and see what happened. I downloaded GenBank sequences LC571017.1 and LC571028.1 and concatenated them into a file sequences10kTo20k.fa. I made an input.csv file sequences10kTo20k.csv with a "name" header and the two sequence names (see attached sequences10kTo20k.zip).

I made a data directory 28-08-20 with files alignment.fasta, global.tree and lineage-only metadata.csv using the tree from the 28-08-20 release of Rob Lanfear's sarscov2phylo pipeline, corresponding GISAID sequence alignments and GISAID metadata lineage assignments. I can't post it here due to GISAID Terms and Conditions, but would be glad to share it by email with other registered GISAID users.

Then I ran llama like this:

llama -r -i sequences10kTo20k.csv -f sequences10kTo20k.fa -d 28-08-20

and it ran to completion as far as I could tell (ended with "Weaved .../2020-09-15-114148668/llama_report.md"). Output dir here, including llama_report.md with errors like "KeyError: 'sample_date'" and "NameError: name 'query_dict' is not defined" (also full_tax_dict, colour_dict_dict, overall_tree_number, too_tall_trees):

https://hgwdev.gi.ucsc.edu/~angie/2020-09-15-114148668/

I see combined_metadata.csv has the lineages of the most similar sequences; that's what I was most curious about. (Using the usher program in Yatish Turakhia's strain_phylogenetics repo, I get B.5 (e.g. Japan/DP0462/2020|EPI_ISL_416602) for LC571017.1 and just B (Japan/DP0804/2020|EPI_ISL_416631) for LC571028.1. I'm working on the UCSC Genome Browser web interface for usher; you can try uploading fasta like sequences10kTo20k.fa, or VCF with sample genotype columns. usher is really fast but our reporting of results is still pretty rudimentary.)

Thanks for making your awesome suite of tools publicly available and easy to install.

Angie

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant