Merge pull request #3 from apcamargo/new-cli

New CLI format
apcamargo · Sep 22, 2019 · 4bb892e · 4bb892e
2 parents 62bf335 + affb99b
commit 4bb892e
Show file tree

Hide file tree

Showing 8 changed files with 565 additions and 253 deletions.
diff --git a/README.md b/README.md
@@ -3,10 +3,10 @@
 - [Overview](#overview)
 - [Documentation](#documentation)
 - [Installation](#installation)
-- [Download the pre-trained model](#download-the-pre-trained-model)
+- [Download the pre-trained models](#download-the-pre-trained-models)
 - [Usage](#usage)
-  - [`rnasamba-train`](#rnasamba-train)
-  - [`rnasamba-classify`](#rnasamba-classify)
+  - [`rnasamba train`](#rnasamba-train)
+  - [`rnasamba classify`](#rnasamba-classify)
 - [Examples](#examples)
 - [Citation](#citation)
 
@@ -55,14 +55,14 @@ In case you want to train your own model, you can follow the steps shown in the
 
 ## Usage
 
-RNAsamba provides two commands: `rnasamba-train` and `rnasamba-classify`.
+RNAsamba provides two commands: `rnasamba train` and `rnasamba classify`.
 
-### `rnasamba-train`
+### `rnasamba train`
 
-`rnasamba-train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
+`rnasamba train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
 
 ```
-usage: rnasamba-train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
+usage: rnasamba train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
                       [-v {0,1,2,3}]
                       output_file coding_file noncoding_file
 
@@ -93,12 +93,12 @@ optional arguments:
                         epoch. (default: 0)
 ```
 
-### `rnasamba-classify`
+### `rnasamba classify`
 
-`rnasamba-classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
+`rnasamba classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
 
 ```
-usage: rnasamba-classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
+usage: rnasamba classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
                          output_file fasta_file weights [weights ...]
 
 Classify sequences from a input FASTA file.
@@ -126,13 +126,13 @@ optional arguments:
 - Training a new classification model using *Mus musculus* data downloaded from GENCODE:
 
 ```
-rnasamba-train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
+rnasamba train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
 ```
 
 - Classifying sequences using our pre-trained model (`full_length_weights.hdf5`) and saving the predicted proteins into a FASTA file:
 
 ```
-rnasamba-classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
+rnasamba classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
 head classification.tsv
 
 sequence_name	coding_score	classification

diff --git a/docs/usage.md b/docs/usage.md
@@ -19,15 +19,15 @@ curl -O https://raw.githubusercontent.com/apcamargo/RNAsamba/master/data/partial
 Both models achieves high classification performance in transcripts from a variety of different species (see our [article](https://www.biorxiv.org/content/10.1101/620880v1)).
 
 !!! warning ""
-    In case you want to train your own model, you should follow the steps described in the [`rnasamba-train`](#rnasamba-train) section.
+    In case you want to train your own model, you should follow the steps described in the [`rnasamba train`](#rnasamba-train) section.
 
 
-## `rnasamba-train`
+## `rnasamba train`
 
-`rnasamba-train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
+`rnasamba train` is the command for training a new classification model from a training dataset and saving the network weights into an HDF5 file. The user can specify the batch size (`--batch_size`) and the number of training epochs (`--epochs`). The user can also choose to activate early stopping (`--early_stopping`), which reduces training time and can help avoiding overfitting.
 
 ```
-usage: rnasamba-train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
+usage: rnasamba train [-h] [-s EARLY_STOPPING] [-b BATCH_SIZE] [-e EPOCHS]
                       [-v {0,1,2,3}]
                       output_file coding_file noncoding_file
 
@@ -58,12 +58,12 @@ optional arguments:
                         epoch. (default: 0)
 ```
 
-## `rnasamba-classify`
+## `rnasamba classify`
 
-`rnasamba-classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
+`rnasamba classify` is the command for computing the coding potential of transcripts contained in an input FASTA file and classifying them into coding or non-coding. Optionally, the user can specify an output FASTA file (`--protein_fasta`) in which RNAsamba will write the translated sequences of the predicted coding ORFs. If multiple weight files are provided, RNAsamba will ensemble their predictions into a single output.
 
 ```
-usage: rnasamba-classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
+usage: rnasamba classify [-h] [-p PROTEIN_FASTA] [-v {0,1}]
                          output_file fasta_file weights [weights ...]
 
 Classify sequences from a input FASTA file.
@@ -91,14 +91,14 @@ optional arguments:
 - Training a new classification model using *Mus musculus* data downloaded from GENCODE:
 
 ```
-rnasamba-train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
+rnasamba train mouse_model.hdf5 -v 2 gencode.vM21.pc_transcripts.fa gencode.vM21.lncRNA_transcripts.fa
 
 ```
 
 - Classifying sequences using our pre-trained model (`full_length_weights.hdf5`) and saving the predicted proteins into a FASTA file:
 
 ```
-rnasamba-classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
+rnasamba classify -p predicted_proteins.fa classification.tsv input.fa full_length_weights.hdf5
 ```
 
 ```

diff --git a/rnasamba/cli.py b/rnasamba/cli.py
@@ -23,59 +23,133 @@
 
 from rnasamba import RNAsambaClassificationModel, RNAsambaTrainModel
 
-def classify(output_file, fasta_file, weights, protein_fasta, verbose):
+
+def classify(args):
     """Classify sequences from a input FASTA file."""
-    classification = RNAsambaClassificationModel(fasta_file, weights, verbose=verbose)
-    classification.write_classification_output(output_file)
-    if protein_fasta:
-        classification.output_protein_fasta(protein_fasta)
+    classification = RNAsambaClassificationModel(
+        args.fasta_file, args.weights, verbose=args.verbose
+    )
+    classification.write_classification_output(args.output_file)
+    if args.protein_fasta:
+        classification.output_protein_fasta(args.protein_fasta)
+
 
-def train(output_file, coding_file, noncoding_file, early_stopping, batch_size, epochs, verbose):
+def train(args):
     """Train a classification model from training data and saves the weights into a HDF5 file."""
-    trained = RNAsambaTrainModel(coding_file, noncoding_file, early_stopping=early_stopping,
-                                 batch_size=batch_size, epochs=epochs, verbose=verbose)
-    trained.model.save_weights(output_file)
+    trained = RNAsambaTrainModel(
+        args.coding_file,
+        args.noncoding_file,
+        early_stopping=args.early_stopping,
+        batch_size=args.batch_size,
+        epochs=args.epochs,
+        verbose=args.verbose,
+    )
+    trained.model.save_weights(args.output_file)
+
+
+def classify_cli(parser):
+    parser.set_defaults(func=classify)
+    parser.add_argument(
+        'output_file',
+        help='output TSV file containing the results of the classification.',
+    )
+    parser.add_argument(
+        'fasta_file', help='input FASTA file containing transcript sequences.'
+    )
+    parser.add_argument(
+        'weights',
+        nargs='+',
+        help='input HDF5 file(s) containing weights of a trained RNAsamba network (if more than a file is provided, an ensembling of the models will be performed).',
+    )
+    parser.add_argument(
+        '-p',
+        '--protein_fasta',
+        help='output FASTA file containing translated sequences for the predicted coding ORFs.',
+    )
+    parser.add_argument(
+        '-v',
+        '--verbose',
+        default=0,
+        type=int,
+        choices=[0, 1],
+        help='print the progress of the classification. 0 = silent, 1 = current step.',
+    )
+
+
+def train_cli(parser):
+    parser.set_defaults(func=train)
+    parser.add_argument(
+        'output_file',
+        help='output HDF5 file containing weights of the newly trained RNAsamba network.',
+    )
+    parser.add_argument(
+        'coding_file',
+        help='input FASTA file containing sequences of protein-coding transcripts.',
+    )
+    parser.add_argument(
+        'noncoding_file',
+        help='input FASTA file containing sequences of noncoding transcripts.',
+    )
+    parser.add_argument(
+        '-s',
+        '--early_stopping',
+        default=0,
+        type=int,
+        help='number of epochs after lowest validation loss before stopping training (a fraction of 0.1 of the training set is set apart for validation and the model with the lowest validation loss will be saved).',
+    )
+    parser.add_argument(
+        '-b',
+        '--batch_size',
+        default=128,
+        type=int,
+        help='number of samples per gradient update.',
+    )
+    parser.add_argument(
+        '-e',
+        '--epochs',
+        default=40,
+        type=int,
+        help='number of epochs to train the model.',
+    )
+    parser.add_argument(
+        '-v',
+        '--verbose',
+        default=0,
+        type=int,
+        choices=[0, 1, 2, 3],
+        help='print the progress of the training. 0 = silent, 1 = current step, 2 = progress bar, 3 = one line per epoch.',
+    )
 
-def classify_cli():
-    parser = argparse.ArgumentParser(description='Classify sequences from a input FASTA file.',
-                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('output_file',
-                        help='output TSV file containing the results of the classification.')
-    parser.add_argument('fasta_file',
-                        help='input FASTA file containing transcript sequences.')
-    parser.add_argument('weights',
-                        nargs='+', help='input HDF5 file(s) containing weights of a trained RNAsamba network (if more than a file is provided, an ensembling of the models will be performed).')
-    parser.add_argument('-p', '--protein_fasta',
-                        help='output FASTA file containing translated sequences for the predicted coding ORFs.')
-    parser.add_argument('-v', '--verbose',
-                        default=0, type=int, choices=[0, 1],
-                        help='print the progress of the classification. 0 = silent, 1 = current step.')
-    if len(sys.argv) < 2:
-        parser.print_help()
-        sys.exit(0)
-    args = parser.parse_args()
-    classify(**vars(args))
 
-def train_cli():
-    parser = argparse.ArgumentParser(description='Train a new classification model.',
-                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('output_file',
-                        help='output HDF5 file containing weights of the newly trained RNAsamba network.')
-    parser.add_argument('coding_file',
-                        help='input FASTA file containing sequences of protein-coding transcripts.')
-    parser.add_argument('noncoding_file',
-                        help='input FASTA file containing sequences of noncoding transcripts.')
-    parser.add_argument('-s', '--early_stopping',
-                        default=0, type=int, help='number of epochs after lowest validation loss before stopping training (a fraction of 0.1 of the training set is set apart for validation and the model with the lowest validation loss will be saved).')
-    parser.add_argument('-b', '--batch_size',
-                        default=128, type=int, help='number of samples per gradient update.')
-    parser.add_argument('-e', '--epochs',
-                        default=40, type=int, help='number of epochs to train the model.')
-    parser.add_argument('-v', '--verbose',
-                        default=0, type=int, choices=[0, 1, 2, 3],
-                        help='print the progress of the training. 0 = silent, 1 = current step, 2 = progress bar, 3 = one line per epoch.')
-    if len(sys.argv) < 2:
+def cli():
+    parser = argparse.ArgumentParser(
+        description='Coding potential calculation using deep learning.',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    subparsers = parser.add_subparsers()
+    classify_parser = subparsers.add_parser(
+        'classify',
+        help='classify sequences from a input FASTA file.',
+        description='Classify sequences from a input FASTA file.',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    classify_cli(classify_parser)
+    train_parser = subparsers.add_parser(
+        'train',
+        help='train a new classification model.',
+        description='Train a new classification model.',
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    train_cli(train_parser)
+    if len(sys.argv) == 1:
         parser.print_help()
         sys.exit(0)
+    elif len(sys.argv) == 2:
+        if sys.argv[1] == 'classify':
+            classify_parser.print_help()
+            sys.exit(0)
+        elif sys.argv[1] == 'train':
+            train_parser.print_help()
+            sys.exit(0)
     args = parser.parse_args()
-    train(**vars(args))
+    args.func(args)
diff --git a/rnasamba/core/inputs.py b/rnasamba/core/inputs.py
@@ -28,14 +28,32 @@ def __init__(self, fasta_file, maxlen=3000):
         self._tokenized_sequences = sequences.read_fasta(fasta_file, tokenize=True)
         self._nucleotide_sequences = sequences.read_fasta(fasta_file, tokenize=False)
         self._aa_dict = {
-            'A': 4, 'C': 18, 'D': 12, 'E': 3, 'F': 14, 'G': 5, 'H': 16, 'I': 13, 'K': 9, 'L': 1,
-            'M': 19, 'N': 15, 'P': 6, 'Q': 11, 'R': 8, 'S': 2, 'T': 10, 'V': 7, 'W': 20, 'X': 21,
-            'Y': 17
+            'A': 4,
+            'C': 18,
+            'D': 12,
+            'E': 3,
+            'F': 14,
+            'G': 5,
+            'H': 16,
+            'I': 13,
+            'K': 9,
+            'L': 1,
+            'M': 19,
+            'N': 15,
+            'P': 6,
+            'Q': 11,
+            'R': 8,
+            'S': 2,
+            'T': 10,
+            'V': 7,
+            'W': 20,
+            'X': 21,
+            'Y': 17,
         }
         self._orfs = self.get_orfs()
         self.protein_seqs = [orf[2] for orf in self._orfs]
         self.maxlen = maxlen
-        self.protein_maxlen = int(maxlen/3)
+        self.protein_maxlen = int(maxlen / 3)
         self.nucleotide_input = self.get_nucleotide_input()
         self.kmer_frequency_input = self.get_kmer_frequency_input()
         self.orf_indicator_input = self.get_orf_indicator_input()
@@ -49,7 +67,9 @@ def get_orfs(self):
 
     def get_nucleotide_input(self):
         nucleotide_input = [i[0] for i in self._tokenized_sequences]
-        nucleotide_input = pad_sequences(nucleotide_input, padding='post', maxlen=self.maxlen)
+        nucleotide_input = pad_sequences(
+            nucleotide_input, padding='post', maxlen=self.maxlen
+        )
         return nucleotide_input
 
     def get_kmer_frequency_input(self):
@@ -65,7 +85,9 @@ def get_protein_input(self):
         for protein_seq in self.protein_seqs:
             protein_numeric = [self._aa_dict[aa] for aa in protein_seq]
             protein_input.append(protein_numeric)
-        protein_input = pad_sequences(protein_input, padding='post', maxlen=self.protein_maxlen)
+        protein_input = pad_sequences(
+            protein_input, padding='post', maxlen=self.protein_maxlen
+        )
         return protein_input
 
     def get_aa_frequency_input(self):