Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud parser refactor #74

Merged
merged 20 commits into from
Jun 26, 2020
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
06e5c02
moved csv_reader to a new common.py file to decrease dup code
JeffreyThiessen Mar 31, 2020
c4239cd
Moved directory finding code to common parser code
JeffreyThiessen Mar 31, 2020
7b52198
Refactor miseq get_sequencing_run
JeffreyThiessen Apr 7, 2020
249b530
cleanup whitespace pep8
JeffreyThiessen Apr 7, 2020
ee588f9
Change sample sheet error to directory error when dir does not exist
JeffreyThiessen Apr 17, 2020
35237a7
Refactor miniseq parser for cloud support
JeffreyThiessen Apr 17, 2020
00aea45
add warning about cloud deployment to miseq parser
JeffreyThiessen Apr 17, 2020
5913c20
Merge branch 'development' of https://github.com/phac-nml/irida-uploa…
JeffreyThiessen Apr 22, 2020
b559e32
fix miseq parser tests
JeffreyThiessen Apr 27, 2020
29d23a4
cleanup empty test suite
JeffreyThiessen Apr 27, 2020
013b759
fix miniseq tests
JeffreyThiessen Apr 27, 2020
398b504
improved testing for common csv parser
JeffreyThiessen Apr 28, 2020
885c93e
updated directory parser and tests for cloud deploy
JeffreyThiessen Apr 29, 2020
4cbdc9c
added page on cloud deployment to dev docs
JeffreyThiessen Apr 29, 2020
d453bf1
Update version number and add to changelog
JeffreyThiessen Apr 29, 2020
bb45841
fix pep8 errors
JeffreyThiessen Apr 30, 2020
6dca4e7
fix broken div id's for integration tests
JeffreyThiessen Jun 1, 2020
8b98bd3
try to fix chromedriver once and for all
JeffreyThiessen Jun 2, 2020
b2e67b2
add more comments + touch of cleanup
JeffreyThiessen Jun 25, 2020
2193e88
fixed seplling
JeffreyThiessen Jun 26, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 16 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,14 @@ matrix:
- export PATH=${JAVA_HOME}/bin:$PATH
- java -version
- echo $JAVA_HOME
- ln -s /usr/lib/chromium-browser/chromedriver ~/bin/chromedriver
- CHROME_DRIVER_VERSION=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
- echo chromedriverversion
- echo $CHROME_DRIVER_VERSION
- wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P ~/
- unzip ~/chromedriver_linux64.zip -d ~/
- rm ~/chromedriver_linux64.zip
- sudo mv -f ~/chromedriver ~/bin/chromedriver
- sudo chmod 777 ~/bin/chromedriver
- env: "TEST_SUITE=integrationtestsdev"
python: 3.6
before_script:
Expand All @@ -40,7 +47,14 @@ matrix:
- export PATH=${JAVA_HOME}/bin:$PATH
- java -version
- echo $JAVA_HOME
- ln -s /usr/lib/chromium-browser/chromedriver ~/bin/chromedriver
- CHROME_DRIVER_VERSION=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`
- echo chromedriverversion
- echo $CHROME_DRIVER_VERSION
- wget -N https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip -P ~/
- unzip ~/chromedriver_linux64.zip -d ~/
- rm ~/chromedriver_linux64.zip
- sudo mv -f ~/chromedriver ~/bin/chromedriver
- sudo chmod 777 ~/bin/chromedriver
- env: "TEST_SUITE=pep8"
python: 3.6

Expand All @@ -54,7 +68,6 @@ services:
addons:
apt:
packages:
- chromium-chromedriver
- xvfb
chrome: stable

Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
Changes
=======

Beta 0.4.2
----------
Developer changes:
* Added support for cloud deployment by using the `iridauploader` available on `pip`
* Added version updater script to the `scripts` directory

Beta 0.4.1
----------
Developer changes:
Expand Down
183 changes: 183 additions & 0 deletions docs/developers/cloud.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@

# Deploying the uploader to the cloud

While there is not an end to end solution that you can deploy onto the cloud, the iridauploader does allow you to use it's modules to simplify your code for cloud deployment.


#### Why can't I just deploy straight to cloud?

The main difficulty is that each cloud storage solution maintains files differently, and it would not be feasible for us to support every cloud environment available.

## How to Deploy to cloud

The simplest way is to incorperate the `iridauploader` modules from `pip` / `PyPi` .

`pip install iridauploader`

Example for creating a new instance of the API, and a MiSeq Parser:

```python
import iridauploader.api as api
import iridauploader.parsers as parsers

api_instance = api.ApiCalls(client_id, client_secret, base_url, username, password, max_wait_time)
parser_instance = parsers.Parser.factory("miseq")
```

## Examples for deployment on Azure Cloud

In these examples we have the following setup:
* We are using an Azure Function App using Python
* Files are stored in blob storage containers (in our example `myblobcontainer`)
* We use a BlobTrigger to run when a new run is uploaded with the path identifier `myblobcontainer/{name}.csv`

Example `function.json` file:

```json
{
"scriptFile": "__init__.py",
"disabled": false,
"bindings": [
{
"name": "myblob",
"type": "blobTrigger",
"direction": "in",
"path": "myblobcontainer/{name}.csv",
"connection":"AzureWebJobsStorage"
}
]
}
```

For the following example, we have this simple setup at the top of our `__init__.py` function app file.

```python
from azure.storage.blob import BlobServiceClient
from azure.storage.blob import BlobClient
from azure.storage.blob import ContainerClient
import azure.functions as func

from iridauploader import parsers


# connect to our blob storage
connect_str = os.getenv('AZURE_STORAGE_CONNECTION_STRING')
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
# These strings could be fetched somehow, but this works for an example
container_name = "myblobcontainer"
container_client = blob_service_client.get_container_client(container_name)
```

### Miseq example

For this example, we will be getting the entire folder for a miseq run, as a set of blobs. When parsing directly from other sequencers, please consult the parser documentation for file structure differences.

```python
def main(myblob: func.InputStream):
logging.info('Python blob trigger function %s', myblob.name)

# download the sample sheet so it can be parsed
download_sample_sheet_file_path = os.path.join(local_path, local_file_name)
with open(download_sample_sheet_file_path, "wb") as download_file:
download_file.write(myblob.read())
logging.info("done downloading")

# get run directory (getting the middle portion)
# example 'myblobcontainer/miseq_run/SampleSheet.csv' -> 'miseq_run
run_directory_name = posixpath.split(posixpath.split(myblob.name)[0])[1]

# we are gonna use miseq for this example
my_parser = parsers.Parser.factory("miseq")
logging.info("built parser")

# This example was tested locally on a windows machine, so replacing \\ with / was needed for compatibility
relative_data_path = my_parser.get_relative_data_directory().replace("\\", "/")
full_data_dir = posixpath.join(
run_directory_name,
relative_data_path)

# list the blobs of the run directory
blob_list = list(container_client.list_blobs(full_data_dir))
file_list = []
# The file_blob_tuple_list could be useful when moving to the uploading stage in the case where
# you do not want to use the iridauploader.api module to upload to irida, otherwise it can be ignored
file_blob_tuple_list = []
for file_blob in blob_list:
file_name = remove_prefix(file_blob.name, full_data_dir)
file_list.append(file_name)
file_blob_tuple_list.append({"file_name": file_name, "blob": file_blob})

# TODO, put a try catch around this with the parser exceptions.
# We can catch errors within the samplesheet or missing files here
sequencing_run = my_parser.get_sequencing_run(
sample_sheet=download_sample_sheet_file_path,
run_data_directory=full_data_dir,
run_data_directory_file_list=file_list)
logging.info("built sequencing run")

# move to upload / error handling when the parser finds an error in the run


def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
raise Exception("should not happen")
```

### Directory example

In this example we will be using the basic file layout for a directory upload.

```
.directory_run
├── file_1.fastq.gz
├── file_2.fastq.gz
└── SampleList.csv
```

```python
def main(myblob: func.InputStream):
logging.info('Python blob trigger function %s', myblob.name)

# download the sample sheet
download_sample_sheet_file_path = os.path.join(local_path, local_file_name)
with open(download_sample_sheet_file_path, "wb") as download_file:
download_file.write(myblob.read())
logging.info("done downloading")

# get run directory (getting the middle portion)
# example 'myblobcontainer/directory_run/SampleSheet.csv' -> 'directory_run
run_directory_name = posixpath.split(posixpath.split(myblob.name)[0])[1]

# we are gonna use directory for this example
my_parser = parsers.Parser.factory("directory")
logging.info("built parser")

# list the blobs of the run directory
blob_list = list(container_client.list_blobs(run_directory_name))
file_list = []
# The file_blob_tuple_list could be useful when moving to the uploading stage in the case where
# you do not want to use the iridauploader.api module to upload to irida, otherwise it can be ignored
file_blob_tuple_list = []
for file_blob in blob_list:
file_name = remove_prefix(file_blob.name, run_directory_name)
# trim the leading
file_name = file_name.replace("/","")
file_list.append(file_name)
file_blob_tuple_list.append({"file_name": file_name, "blob": file_blob})

# TODO, put a try catch around this with the parser exceptions.
# We can catch errors within the samplesheet or missing files here
sequencing_run = my_parser.get_sequencing_run(
sample_sheet=download_sample_sheet_file_path,
run_data_directory_file_list=file_list)
logging.info("built sequencing run")

# move to upload / error handling when the parser finds an error in the run


def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
raise Exception("should not happen")
```
2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,3 +199,5 @@ Want to create a parser for a sequencer that we don't yet support or have an ide
[Information on the IRIDA python API](developers/api.md)

[Object Model Reference](developers/objects.md)

[Cloud Deployment](developers/cloud.md)
2 changes: 1 addition & 1 deletion iridauploader/core/cli_entry.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from . import api_handler, parsing_handler, logger, exit_return

VERSION_NUMBER = "0.4.1"
VERSION_NUMBER = "0.4.2"


def upload_run_single_entry(directory, force_upload=False):
Expand Down
1 change: 1 addition & 0 deletions iridauploader/parsers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from iridauploader.parsers.parsers import Parser
from iridauploader.parsers.parsers import supported_parsers
from iridauploader.parsers import exceptions
from iridauploader.parsers import common
110 changes: 110 additions & 0 deletions iridauploader/parsers/common.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
"""
This file has generic utility methods that can be used by all parsers

These methods can rely on the os module to function, and therefor not to be used with cloud environments.

They should be used as generic utilities for any new parser that is added to the project.
"""
import os
from csv import reader
import logging

from iridauploader.parsers import exceptions
from iridauploader import model


def get_csv_reader(sample_sheet_file):

"""
tries to create a csv.reader object which will be used to
parse through the lines in SampleSheet.csv
raises an error if:
sample_sheet_file is not an existing file
sample_sheet_file contains null byte(s)

arguments:
data_dir -- the directory that has SampleSheet.csv in it

returns a csv.reader object
"""

if os.path.isfile(sample_sheet_file):
csv_file = open(sample_sheet_file, "r")
# strip any trailing newline characters from the end of the line
# including Windows newline characters (\r\n)
csv_lines = [x.rstrip('\n') for x in csv_file]
csv_lines = [x.rstrip('\r') for x in csv_lines]

# open and read file in binary then send it to be parsed by csv's reader
csv_reader = reader(csv_lines)
else:
raise exceptions.SampleSheetError("Sample sheet cannot be parsed as a CSV file because it's not a regular file.",
sample_sheet_file)

return csv_reader


def find_directory_list(directory):
"""Find and return all directories in the specified directory.

Arguments:
directory -- the directory to find directories in

Returns: a list of directories including current directory
"""

# Checks if we can access to the given directory, return empty and log a warning if we cannot.
if not os.access(directory, os.W_OK):
raise exceptions.DirectoryError("The directory is not writeable, "
"can not upload samples from this directory {}".format(directory),
directory)

dir_list = next(os.walk(directory))[1] # Gets the list of directories in the directory
full_dir_list = []
for d in dir_list:
full_dir_list.append(os.path.join(directory, d))
return full_dir_list


def build_sequencing_run_from_samples(sample_list, metadata):
"""
Create a SequencingRun object with full project/sample/sequence_file structure

:param sample_list: List of Sample objects
:param metadata: metadata dict to add to the run
:return: SequencingRun
"""

logging.debug("Building SequencingRun from parsed data")

# create list of projects and add samples to appropriate project
project_list = []
for sample in sample_list:
project = None
for p in project_list:
if sample.get('sample_project') == p.id:
project = p
if project is None:
project = model.Project(id=sample.get('sample_project'))
project_list.append(project)

project.add_sample(sample)

sequence_run = model.SequencingRun(metadata, project_list)
logging.debug("SequencingRun built")
return sequence_run


def get_file_list(directory):
"""
Get the list of file names in the data directory

:param directory: directory to search for files
:return: list of file names in data directory
"""
# verify that directory exists
if not os.path.exists(directory):
raise exceptions.DirectoryError("Could not list files, as directory does not exist.", directory)
# Create a file list of the directory, only hit the os once
file_list = next(os.walk(directory))[2]
return file_list
Loading