Skip to content

Commit

Permalink
Merge pull request #3 from apcamargo/new-cli
Browse files Browse the repository at this point in the history
New CLI format
  • Loading branch information
apcamargo authored Sep 22, 2019
2 parents 62bf335 + affb99b commit 4bb892e
Show file tree
Hide file tree
Showing 8 changed files with 565 additions and 253 deletions.
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
- [Overview](#overview)
- [Documentation](#documentation)
- [Installation](#installation)
- [Download the pre-trained model](#download-the-pre-trained-model)
- [Download the pre-trained models](#download-the-pre-trained-models)
- [Usage](#usage)
- [`rnasamba-train`](#rnasamba-train)
- [`rnasamba-classify`](#rnasamba-classify)
- [`rnasamba train`](#rnasamba-train)
- [`rnasamba classify`](#rnasamba-classify)
- [Examples](#examples)
- [Citation](#citation)

Expand Down Expand Up @@ -55,14 +55,14 @@ In case you want to train your own model, you can follow the steps shown in the

## Usage

RNAsamba provides two commands: `rnasamba-train` and `rnasamba-classify`.
RNAsamba provides two commands: `rnasamba train` and `rnasamba classify`.

### `rnasamba-train`
### `rnasamba train`

`rnasamba-train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
`rnasamba train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.

```
usage: rnasamba-train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
usage: rnasamba train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
[-v {0,1,2,3}]
output_file coding_file noncoding_file
Expand Down Expand Up @@ -93,12 +93,12 @@ optional arguments:
epoch. (default: 0)
```

### `rnasamba-classify`
### `rnasamba classify`

`rnasamba-classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
`rnasamba classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.

```
usage: rnasamba-classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
usage: rnasamba classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
output_file fasta_file weights [weights ...]
Classify sequences from a input FASTA file.
Expand Down Expand Up @@ -126,13 +126,13 @@ optional arguments:
- Training a new classification model using *Mus musculus* data downloaded from GENCODE:

```
rnasamba-train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
rnasamba train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
```

- Classifying sequences using our pre-trained model (`full_length_weights.hdf5`) and saving the predicted proteins into a FASTA file:

```
rnasamba-classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
rnasamba classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
head classification.tsv
sequence_name coding_score classification
Expand Down
18 changes: 9 additions & 9 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,15 @@ curl -O https://raw.githubusercontent.com/apcamargo/RNAsamba/master/data/partial
Both models achieves high classification performance in transcripts from a variety of different species (see our [article](https://www.biorxiv.org/content/10.1101/620880v1)).

!!! warning ""
In case you want to train your own model, you should follow the steps described in the [`rnasamba-train`](#rnasamba-train) section.
In case you want to train your own model, you should follow the steps described in the [`rnasamba train`](#rnasamba-train) section.


## `rnasamba-train`
## `rnasamba train`

`rnasamba-train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
`rnasamba train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.

```
usage: rnasamba-train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
usage: rnasamba train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
[-v {0,1,2,3}]
output_file coding_file noncoding_file
Expand Down Expand Up @@ -58,12 +58,12 @@ optional arguments:
epoch. (default: 0)
```

## `rnasamba-classify`
## `rnasamba classify`

`rnasamba-classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
`rnasamba classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.

```
usage: rnasamba-classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
usage: rnasamba classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
output_file fasta_file weights [weights ...]
Classify sequences from a input FASTA file.
Expand Down Expand Up @@ -91,14 +91,14 @@ optional arguments:
- Training a new classification model using *Mus musculus* data downloaded from GENCODE:

```
rnasamba-train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
rnasamba train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
```

- Classifying sequences using our pre-trained model (`full_length_weights.hdf5`) and saving the predicted proteins into a FASTA file:

```
rnasamba-classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
rnasamba classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
```

```
Expand Down
170 changes: 122 additions & 48 deletions rnasamba/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,59 +23,133 @@

from rnasamba import RNAsambaClassificationModel, RNAsambaTrainModel

def classify(output_file, fasta_file, weights, protein_fasta, verbose):

def classify(args):
"""Classify sequences from a input FASTA file."""
classification = RNAsambaClassificationModel(fasta_file, weights, verbose=verbose)
classification.write_classification_output(output_file)
if protein_fasta:
classification.output_protein_fasta(protein_fasta)
classification = RNAsambaClassificationModel(
args.fasta_file, args.weights, verbose=args.verbose
)
classification.write_classification_output(args.output_file)
if args.protein_fasta:
classification.output_protein_fasta(args.protein_fasta)


def train(output_file, coding_file, noncoding_file, early_stopping, batch_size, epochs, verbose):
def train(args):
"""Train a classification model from training data and saves the weights into a HDF5 file."""
trained = RNAsambaTrainModel(coding_file, noncoding_file, early_stopping=early_stopping,
batch_size=batch_size, epochs=epochs, verbose=verbose)
trained.model.save_weights(output_file)
trained = RNAsambaTrainModel(
args.coding_file,
args.noncoding_file,
early_stopping=args.early_stopping,
batch_size=args.batch_size,
epochs=args.epochs,
verbose=args.verbose,
)
trained.model.save_weights(args.output_file)


def classify_cli(parser):
parser.set_defaults(func=classify)
parser.add_argument(
'output_file',
help='output TSV file containing the results of the classification.',
)
parser.add_argument(
'fasta_file', help='input FASTA file containing transcript sequences.'
)
parser.add_argument(
'weights',
nargs='+',
help='input HDF5 file(s) containing weights of a trained RNAsamba network (if more than a file is provided, an ensembling of the models will be performed).',
)
parser.add_argument(
'-p',
'--protein_fasta',
help='output FASTA file containing translated sequences for the predicted coding ORFs.',
)
parser.add_argument(
'-v',
'--verbose',
default=0,
type=int,
choices=[0, 1],
help='print the progress of the classification. 0 = silent, 1 = current step.',
)


def train_cli(parser):
parser.set_defaults(func=train)
parser.add_argument(
'output_file',
help='output HDF5 file containing weights of the newly trained RNAsamba network.',
)
parser.add_argument(
'coding_file',
help='input FASTA file containing sequences of protein-coding transcripts.',
)
parser.add_argument(
'noncoding_file',
help='input FASTA file containing sequences of noncoding transcripts.',
)
parser.add_argument(
'-s',
'--early_stopping',
default=0,
type=int,
help='number of epochs after lowest validation loss before stopping training (a fraction of 0.1 of the training set is set apart for validation and the model with the lowest validation loss will be saved).',
)
parser.add_argument(
'-b',
'--batch_size',
default=128,
type=int,
help='number of samples per gradient update.',
)
parser.add_argument(
'-e',
'--epochs',
default=40,
type=int,
help='number of epochs to train the model.',
)
parser.add_argument(
'-v',
'--verbose',
default=0,
type=int,
choices=[0, 1, 2, 3],
help='print the progress of the training. 0 = silent, 1 = current step, 2 = progress bar, 3 = one line per epoch.',
)

def classify_cli():
parser = argparse.ArgumentParser(description='Classify sequences from a input FASTA file.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('output_file',
help='output TSV file containing the results of the classification.')
parser.add_argument('fasta_file',
help='input FASTA file containing transcript sequences.')
parser.add_argument('weights',
nargs='+', help='input HDF5 file(s) containing weights of a trained RNAsamba network (if more than a file is provided, an ensembling of the models will be performed).')
parser.add_argument('-p', '--protein_fasta',
help='output FASTA file containing translated sequences for the predicted coding ORFs.')
parser.add_argument('-v', '--verbose',
default=0, type=int, choices=[0, 1],
help='print the progress of the classification. 0 = silent, 1 = current step.')
if len(sys.argv) < 2:
parser.print_help()
sys.exit(0)
args = parser.parse_args()
classify(**vars(args))

def train_cli():
parser = argparse.ArgumentParser(description='Train a new classification model.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('output_file',
help='output HDF5 file containing weights of the newly trained RNAsamba network.')
parser.add_argument('coding_file',
help='input FASTA file containing sequences of protein-coding transcripts.')
parser.add_argument('noncoding_file',
help='input FASTA file containing sequences of noncoding transcripts.')
parser.add_argument('-s', '--early_stopping',
default=0, type=int, help='number of epochs after lowest validation loss before stopping training (a fraction of 0.1 of the training set is set apart for validation and the model with the lowest validation loss will be saved).')
parser.add_argument('-b', '--batch_size',
default=128, type=int, help='number of samples per gradient update.')
parser.add_argument('-e', '--epochs',
default=40, type=int, help='number of epochs to train the model.')
parser.add_argument('-v', '--verbose',
default=0, type=int, choices=[0, 1, 2, 3],
help='print the progress of the training. 0 = silent, 1 = current step, 2 = progress bar, 3 = one line per epoch.')
if len(sys.argv) < 2:
def cli():
parser = argparse.ArgumentParser(
description='Coding potential calculation using deep learning.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
subparsers = parser.add_subparsers()
classify_parser = subparsers.add_parser(
'classify',
help='classify sequences from a input FASTA file.',
description='Classify sequences from a input FASTA file.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
classify_cli(classify_parser)
train_parser = subparsers.add_parser(
'train',
help='train a new classification model.',
description='Train a new classification model.',
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
train_cli(train_parser)
if len(sys.argv) == 1:
parser.print_help()
sys.exit(0)
elif len(sys.argv) == 2:
if sys.argv[1] == 'classify':
classify_parser.print_help()
sys.exit(0)
elif sys.argv[1] == 'train':
train_parser.print_help()
sys.exit(0)
args = parser.parse_args()
train(**vars(args))
args.func(args)
34 changes: 28 additions & 6 deletions rnasamba/core/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,32 @@ def __init__(self, fasta_file, maxlen=3000):
self._tokenized_sequences = sequences.read_fasta(fasta_file, tokenize=True)
self._nucleotide_sequences = sequences.read_fasta(fasta_file, tokenize=False)
self._aa_dict = {
'A': 4, 'C': 18, 'D': 12, 'E': 3, 'F': 14, 'G': 5, 'H': 16, 'I': 13, 'K': 9, 'L': 1,
'M': 19, 'N': 15, 'P': 6, 'Q': 11, 'R': 8, 'S': 2, 'T': 10, 'V': 7, 'W': 20, 'X': 21,
'Y': 17
'A': 4,
'C': 18,
'D': 12,
'E': 3,
'F': 14,
'G': 5,
'H': 16,
'I': 13,
'K': 9,
'L': 1,
'M': 19,
'N': 15,
'P': 6,
'Q': 11,
'R': 8,
'S': 2,
'T': 10,
'V': 7,
'W': 20,
'X': 21,
'Y': 17,
}
self._orfs = self.get_orfs()
self.protein_seqs = [orf[2] for orf in self._orfs]
self.maxlen = maxlen
self.protein_maxlen = int(maxlen/3)
self.protein_maxlen = int(maxlen / 3)
self.nucleotide_input = self.get_nucleotide_input()
self.kmer_frequency_input = self.get_kmer_frequency_input()
self.orf_indicator_input = self.get_orf_indicator_input()
Expand All @@ -49,7 +67,9 @@ def get_orfs(self):

def get_nucleotide_input(self):
nucleotide_input = [i[0] for i in self._tokenized_sequences]
nucleotide_input = pad_sequences(nucleotide_input, padding='post', maxlen=self.maxlen)
nucleotide_input = pad_sequences(
nucleotide_input, padding='post', maxlen=self.maxlen
)
return nucleotide_input

def get_kmer_frequency_input(self):
Expand All @@ -65,7 +85,9 @@ def get_protein_input(self):
for protein_seq in self.protein_seqs:
protein_numeric = [self._aa_dict[aa] for aa in protein_seq]
protein_input.append(protein_numeric)
protein_input = pad_sequences(protein_input, padding='post', maxlen=self.protein_maxlen)
protein_input = pad_sequences(
protein_input, padding='post', maxlen=self.protein_maxlen
)
return protein_input

def get_aa_frequency_input(self):
Expand Down
Loading

0 comments on commit 4bb892e

Please sign in to comment.