Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

De-duplicate NCBI strain #79

Merged
merged 2 commits into from
Jul 23, 2024
Merged

De-duplicate NCBI strain #79

merged 2 commits into from
Jul 23, 2024

Commits on Jul 22, 2024

  1. ingest/andersen-lab: dedup by strain name

    Resolves #77
    
    It is unclear whether the flagged records listed in the issue are true
    duplicates or data entry errrors, but the workflow should deduplicate
    by strain name anyways since we depend on the strain name to match
    metadata and sequences in the phylogenetic workflow.
    joverlee521 committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    084db57 View commit details
    Browse the repository at this point in the history
  2. ingest/ncbi: Dedup joined-ncbi data

    I noticed in working through #77
    that we do not deduplicate the joined GenBank/Andersen lab SRA data.
    
    We only append SRA data if they are not linked to any GenBank records,
    but there's still a chance that they can share the same strain name.
    We have not run into any duplicates yet, but adding deduplcation as a
    precaution.
    joverlee521 committed Jul 22, 2024
    Configuration menu
    Copy the full SHA
    5d5e36a View commit details
    Browse the repository at this point in the history