Skip to content

Commit

Permalink
Merge pull request #38 from nextstrain/ingest-no-curation
Browse files Browse the repository at this point in the history
ingest: Provide target for raw metadata from NCBI Datasets
  • Loading branch information
joverlee521 authored Apr 3, 2024
2 parents 5938287 + aa67664 commit 0f799d7
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 0 deletions.
12 changes: 12 additions & 0 deletions ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,18 @@ This produces the default outputs of the ingest workflow:
- metadata = results/metadata.tsv
- sequences = results/sequences.fasta

### Dumping the full raw metadata from NCBI Datasets

The workflow has a target for dumping the full raw metadata from NCBI Datasets.

```
nextstrain build ingest dump_ncbi_dataset_report
```

This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`,
which you can inspect to determine what fields and data to use if you want to
configure the workflow for your pathogen.

## Defaults

The defaults directory contains all of the default configurations for the ingest workflow.
Expand Down
14 changes: 14 additions & 0 deletions ingest/rules/fetch_from_ncbi.smk
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,20 @@ rule fetch_ncbi_dataset_package:
--filename {output.dataset_package}
"""

# Note: This rule is not part of the default workflow!
# It is intended to be used as a specific target for users to be able
# to inspect and explore the full raw metadata from NCBI Datasets.
rule dump_ncbi_dataset_report:
input:
dataset_package="data/ncbi_dataset.zip",
output:
ncbi_dataset_tsv="data/ncbi_dataset_report_raw.tsv",
shell:
"""
dataformat tsv virus-genome \
--package {input.dataset_package} > {output.ncbi_dataset_tsv}
"""


rule extract_ncbi_dataset_sequences:
input:
Expand Down

0 comments on commit 0f799d7

Please sign in to comment.