De-duplicate NCBI strain #79

Resolves #77 It is unclear whether the flagged records listed in the issue are true duplicates or data entry errrors, but the workflow should deduplicate by strain name anyways since we depend on the strain name to match metadata and sequences in the phylogenetic workflow.

I noticed in working through #77 that we do not deduplicate the joined GenBank/Andersen lab SRA data. We only append SRA data if they are not linked to any GenBank records, but there's still a chance that they can share the same strain name. We have not run into any duplicates yet, but adding deduplcation as a precaution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

De-duplicate NCBI strain #79

De-duplicate NCBI strain #79

Commits on Jul 22, 2024