Skip to content

Latest commit

 

History

History

ingest

nextstrain.org/zika/ingest

This is the ingest pipeline for zika virus sequences.

Software requirements

Follow the standard installation instructions for Nextstrain's suite of software tools.

Usage

NOTE: All command examples assume you are within the ingest directory. If running commands from the outer zika directory, please replace the . with ingest

Fetch sequences with

nextstrain build . data/sequences.ndjson

Run the complete ingest pipeline with

nextstrain build .

This will produce two files (within the ingest directory):

  • results/metadata.tsv
  • results/sequences.fasta

Run the complete ingest pipeline and upload results to AWS S3 with

nextstrain build \
    --env AWS_ACCESS_KEY_ID \
    --env AWS_SECRET_ACCESS_KEY \
    . \
        upload_all \
        --configfile build-configs/nextstrain-automation/config.yaml

Adding new sequences not from GenBank

Static Files

Do the following to include sequences from static FASTA files.

  1. Convert the FASTA files to NDJSON files with:

    ./ingest/bin/fasta-to-ndjson \
        --fasta {path-to-fasta-file} \
        --fields {fasta-header-field-names} \
        --separator {field-separator-in-header} \
        --exclude {fields-to-exclude-in-output} \
        > ingest/data/{file-name}.ndjson
  2. Add the following to the .gitignore to allow the file to be included in the repo:

    !ingest/data/{file-name}.ndjson
  3. Add the file-name (without the .ndjson extension) as a source to ingest/defaults/config.yaml. This will tell the ingest pipeline to concatenate the records to the GenBank sequences and run them through the same transform pipeline.

Configuration

Configuration takes place in defaults/config.yaml by default. Optional configs for uploading files and Slack notifications are in defaults/optional.yaml.

Environment Variables

The complete ingest pipeline with AWS S3 uploads and Slack notifications uses the following environment variables:

Required

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • SLACK_TOKEN
  • SLACK_CHANNELS

Optional

These are optional environment variables used in our automated pipeline for providing detailed Slack notifications.

Input data

GenBank data

GenBank sequences and metadata are fetched via NCBI datasets.

ingest/vendored

This repository uses git subrepo to manage copies of ingest scripts in ingest/vendored, from nextstrain/ingest.

See vendored/README.md for instructions on how to update the vendored scripts.