lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

AngieHinrichs · 2020-09-15T22:04:46Z

This may be expected with minimal metadata.csv (just sequence_name and lineage), but llama_report.md ends up with a bunch of error stacks from a KeyError and several NameErrors. Feel free to ignore if it's not high priority for you to support lineage-only metadata.

Since the QC will accept any sequence with at least 10,000 bases and at most 50% Ns, I figured I would throw in some partial sequences and see what happened. I downloaded GenBank sequences LC571017.1 and LC571028.1 and concatenated them into a file sequences10kTo20k.fa. I made an input.csv file sequences10kTo20k.csv with a "name" header and the two sequence names (see attached sequences10kTo20k.zip).

I made a data directory 28-08-20 with files alignment.fasta, global.tree and lineage-only metadata.csv using the tree from the 28-08-20 release of Rob Lanfear's sarscov2phylo pipeline, corresponding GISAID sequence alignments and GISAID metadata lineage assignments. I can't post it here due to GISAID Terms and Conditions, but would be glad to share it by email with other registered GISAID users.

Then I ran llama like this:

llama -r -i sequences10kTo20k.csv -f sequences10kTo20k.fa -d 28-08-20

and it ran to completion as far as I could tell (ended with "Weaved .../2020-09-15-114148668/llama_report.md"). Output dir here, including llama_report.md with errors like "KeyError: 'sample_date'" and "NameError: name 'query_dict' is not defined" (also full_tax_dict, colour_dict_dict, overall_tree_number, too_tall_trees):

https://hgwdev.gi.ucsc.edu/~angie/2020-09-15-114148668/

I see combined_metadata.csv has the lineages of the most similar sequences; that's what I was most curious about. (Using the usher program in Yatish Turakhia's strain_phylogenetics repo, I get B.5 (e.g. Japan/DP0462/2020|EPI_ISL_416602) for LC571017.1 and just B (Japan/DP0804/2020|EPI_ISL_416631) for LC571028.1. I'm working on the UCSC Genome Browser web interface for usher; you can try uploading fasta like sequences10kTo20k.fa, or VCF with sample genotype columns. usher is really fast but our reporting of results is still pretty rudimentary.)

Thanks for making your awesome suite of tools publicly available and easy to install.

Angie

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

AngieHinrichs commented Sep 15, 2020

lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

lineage-only metadata.csv gets KeyError & NameErrors in llama_report.md #3

Comments

AngieHinrichs commented Sep 15, 2020