RNA modification detection with Marathon RT Computational package for inferring positions of RNA modifications in target transcripts using sequencing data from a MarathonRT ModSeq experiment. The current method employs machine learning models trained on human 18S and 28S rRNAs for classification of 2’-O-methylations (Nm) or pseudouridines (ψ), in addition to flag positions consistent with the presence of some base modifications (m1A, m3U/ m1acp3Y, m7G).
This script takes in three text files produced by shapemapper (using the "--output-counted-mutations" flag) after processing sequencing data from a ModSeq experiment. The three files are: profile file (containing mutation rates from both Mg and Mn experimental samples), 'untreated' file (mutation counts from the Mg sample data), and 'modified' file (mutation counts from the Mn sample data). The outputs of this script are four arff files - one for each nucleotide type - primed for input into Weka.
INPUT: write the file path for the indicated input file at the indicated lines if using the .rmd script
profile file: line 24
Mg2+ sample mutation counts file: line 25
Mn2+ sample mutation counts file: line 26
uridines output: lines 225, 227, 233
cytidines output: lines 255, 257, 263
guanosines output: lines 285, 287, 293
adenosines output: lines 315, 217, 323
These files are trained models on 18S and 28S rRNA data for detection of Nm and pseudouridines. They are used in Weka for prediction. After prediction, Weka will output four arff files with the classifications. These are the inputs for the next step.
This script takes in the four outputted arff files from Weka, in addition to the original profile file from shapemapper and outputs a single csv file with position, nucleotide type, and modification status. Nm/ψ predictions are done using ML models, while base modification predictions (m1A, m3U/ m1acp3Y, m7G) are based on pre-determined mutation rate cutoffs.
INPUT: write the file path for the indicated input file at the indicated lines if using the .rmd script
adenosines input: line 23
cytidines input: line 24
guanosines input: line 25
uridines input: line 26
final classification output CSV file: line 57
The following are instructions on how to use the current MRT-ModSeq pipeline with sequencing data from a typical experiment, using both Mg and Mn cDNA samples. Sample files for 5.8S rRNA (libraries built from total RNA from Huh7.5 cells) are provided as an example.
- use raw sequencing reads from MRT-ModSeq sequencing experiment: R1/R2 fastq files from both ‘modified’ (Mn2+) and 'untreated' (Mg2+) experiments. Run the shapemapper program (Smola et al. 2015) using the following command:
shapemapper --nproc 8 --output-counted-mutations --name [job name] --target [alignment reference file (.fasta extension)] --out [output directory name] --modified --R1 [Mn2+ R1 file (.fasta or .fasta.gz)] --R2 [Mn2+ R2 file (.fasta or .fasta.gz)] --untreated --R1 [Mg2+ R1 file (.fasta or .fasta.gz)] --R2 [Mg2+ R2 file (.fasta or .fasta.gz)] --overwrite
- the output of the above shapemapper run is three files as follows: 1) profile, 2) 'untreated' file (Mg2+) and 3) 'modified' file (Mn2+). These constitute the inputs for the amplified prediction script. If using the .R script on command line, download and navigate to pylelab/MRTModSeq directory and run the following (example shown with sample data):
mkdir sample_outputs
cd sample_outputs
Rscript ../Amplified_Prediction_script.R "../sample_ecoli_23s_input_and_output/amplified_predictions_script_INPUTS/1mg2mn_ecoli_23s_ecoli_23s_april23_2021_profile.txt" "../sample_ecoli_23s_input_and_output/amplified_predictions_script_INPUTS/1mg2mn_ecoli_23s_Untreated_ecoli_23s_april23_2021_mutation_counts.txt" "../sample_ecoli_23s_input_and_output/amplified_predictions_script_INPUTS/1mg2mn_ecoli_23s_Modified_ecoli_23s_april23_2021_mutation_counts.txt"
- the output of the amplified prediction script is four arff files that, along with the model files, constitute the inputs for the Weka step.
- Open the Weka GUI and the Explorer Application.
- In the Preprocess tab, input the prediction file for the nucleotide you are analyzing. This will unlock the Classify tab.
- Change the Test Options to "Supplied Test Set" and choose the prediction arff file for the nucleotide you are analyzing.
- In the Classify tab, right click the left sidebar and choose load model.
- Right click the selection on the left sidebar and click "Re-evaluate model on current test set."
- Write click results tab on the left sidebar and click "Visualize classifier errors."
- Save the file in the resultant window. This saves as an arff file.
- Repeat the above steps for each nucleotide type. This will yield four output arff files that constitute the input for the 'output reordering' script. If using command line, run the following (example with sample data):
mkdir ordered_predictions
cd ordered_predictions
Rscript ../output_reordering.R "../sample_ecoli_23s_input_and_output/output_reordering_script_INPUTS/23S_As.arff" "../sample_ecoli_23s_input_and_output/output_reordering_script_INPUTS/23S_Cs.arff" "../sample_ecoli_23s_input_and_output/output_reordering_script_INPUTS/23S_Gs.arff" "../sample_ecoli_23s_input_and_output/output_reordering_script_INPUTS/23S_Us_cost.arff" "../sample_ecoli_23s_input_and_output/amplified_predictions_script_INPUTS/1mg2mn_ecoli_23s_ecoli_23s_april23_2021_profile.txt"
- the output for the ‘output reordering’ script is a single csv file with nucleotide positions and the predicted modification status.