Skip to content

straz/dna_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA search

Try it out

The client url is https://straz.github.io/dna_search/client/

Requirements addressed

  • Technologies used are basically javascript, python, AWS
  • All processing is service-based, coupled by message-passing.
  • A single script (install/start.py) bootstraps all dependencies (other than docker) and launches all local dev processes.
  • Support for local development with multiple dev environments operating on a multitenant dev environment.

Components

Client

The client is a simple static front end using just bootstrap and jquery. For local development, run serve.sh

The client makes two API calls:

  • upload files to S3 (in {s3}/inbox/)
  • get status of current and pending jobs (from {api}/{env}/queries/{user})

The logged-in user is currently hard-coded ('[email protected]') in the client settings, but jobs are tagged - multiple users are supported. Each user sees only their own data.

When uploading to S3, the files are tagged with metadata which is captured in the jobs table.

Files:

  client/index.html
  client/script.js
  client/style.css
  client/serve.sh

S3

S3 uses one bucket (ginkgo-search), with top-level folders for each environment (prd, dev, bob, etc.)

Within each environment, there are two folders:

  {s3}/{env}/inbox    # for transient incoming files. Permissions are globally writable, not readable.
  {s3}/{env}/queries  # incoming files are validated, normalized, and placed here for processing

In addition, {s3}/prd/artifacts/functions.zip is the deploy package for the lambdas.

DynamoDB

There is one table in Dynamo holding all jobs. The fields are:

   guid         # primary key
   env          # production, or one of several developers
   start_time   # start of job
   email        # email of user who submitted job
   filename     # original filename (directory not included) of file submitted by user
   status       # one of: uploading, uploaded, done, error
   results      # list of dicts. Each describes either sequence matches or errors

The SAM local developer environments use the cloud-hosted DynamoDB instance for everything. Since it's sharded by env and is NoSQL, the developers can coexist peacefully.

SQS

When files arrive in S3, they are minimally processed (file ingestion) and a message is queued on SQS for further (sequence search) processing.

There is a separate SQS queue for each environment, with names GinkgoSQS-prd, GinkgoSQS-dev, etc.

GinkgoSQS-prd behaves specially. It only carries 'process' messages, issued after bucket_watcher runs.

All the other (developer) queues carry both upload and process messages. Since S3 has no path to trigger the SAM local environment on uploads, bucket_watcher forwards upload events to the appropriate developer queue, which eventually end up in that developer's SAM local environment for processing.

Lambda

There are four Lambdas and one layer. The worker code is in the processor lambda.

Files:

   functions/bucket_watcher.py   # accept and ingest file uploads
   functions/processor.py        # worker code: sequence matching
   functions/queries.py          # API for client gui - retrieves job data for a user
   functions/dev_proxy.py        # API for polling - forwards dev queues to the local environment
   functions/common.py           # shared code
   install/ncbi_download.py      # pulls NCBI reference data used in biopython layer
   install/build_layer.py        # pip and zip to create biopython layer

biopython layer

This Layer is shared by workers: it contains the (large) biopython and numpy libraries, and it contains the (potentially large) NCBI reference dataset.

bucket_watcher

bucket_watcher watches {s3}/{env}/inbox for incoming files. When a file is uploaded to S3:

  • the file is validated (trivially)
  • enter record in the database
  • move file to {s3}/{env}/queries folder
  • send message to SQS to notify workers

In addition, the bucket_watcher instance on prd will forward events to each developer's queue so it can be handled by the developer's bucket_watcher instance.

processor

processor listens to the SQS queue for awaiting processing. When a message is received:

  • reference data is loaded (cached in memory, as far as lambda will allow) from the biopython layer.
  • the query is retrieved from {s3}/{env}/queries
  • search using biopython
  • record results in DynamoDB

queries

queries listens to http requests from API Gateway. Currently it's open to the public, no authentication. A query is GET /{env}/queries/{email}, and returns a list of the database items for that user, including the status and any results.

dev_proxy

dev_proxy is a cloud-based API endpoint lets the developer poll for pending messages in that developer's queue.

biopython

The biopython layer contains large python libraries and a copy of the reference data files. These are available to the lambda at runtime, they are mounted in /opt/data while the lambda runs.

The build.sh script for the layer is basically pip install plus gzip. You might think it's ok to build this layer on your mac, but not so. Apparently there are binaries involved, and any error messages you might see are quite deceptive. So you have to build the layer on a CentOS image (or an authentic Amazon AMI).

Install and configure

See Install