diff --git a/.gitignore b/.gitignore index ad4a1f1..25a5a19 100644 --- a/.gitignore +++ b/.gitignore @@ -174,3 +174,5 @@ poetry.toml pyrightconfig.json # End of https://www.toptal.com/developers/gitignore/api/python + +.vscode diff --git a/README.md b/README.md index d9760e3..e9ce8f9 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,27 @@ -# Simulating single cell RNA library generation (scRNA-seq) +# scRNAsim: Simulating single cell RNA (scRNA-seq) library generation +The projects implements a simulation of single cell RNA sequencing (scRNA-seq), accounting for some common sources noise that complicate the analysis of the resulting data. + +### Setting up the virtual environment + +Create and activate the environment with necessary dependencies with Conda: + +```bash +conda env create -f environment.yml +conda activate scrnasim-toolz +``` + +### Tools + +The tools available in this repo are: +1. Transcript sampler +2. Structure generator +3. Sequence extractor +4. Priming site predictor +5. cDNA generator +6. Fragment selector +7. Read sequencer + +### Description Although all cells in a multicellular organism carry the same genomic information, they differ a lot in their function, due to the fact that they are equipped with distinct toolboxes of molecular functions, implemented by different proteins and RNAs. Thus, being able to detect and measure the abundance of gene products (RNAs and/or proteins) in individual cells holds the key to understanding how organisms are organized and function. In the past decade, much progress has been made in the development of technologies for single cell RNA sequencing. They make use of microfluidic devices that allow RNA-seq sample preparation for individual cells encapsulated in droplets, followed by pooling of the resulting DNA fragments and sequencing. The broadly used 10x Genomics technology uses oligo-dT primers to initiate the cDNA sequencing from the poly(A) tails of fragmented RNAs. Subsequent sequencing yields *libraries* of relatively short (100-200 nucleotides) *reads* that come predominantly from the 3’ ends of RNAs given the priming on the poly(A) tail. As in the ideal case (no amplification bias) each read came from the end of one mRNA, simply counting the reads that map to mRNAs of individual genes provides estimates of the expression levels of those genes within the respective cell. Currently, typical data sets cover thousands of genes in tens-to-hundreds of thousands of cells. However, we are still far from being able to prepare ideal libraries, for many reasons. First, as gene expression is a bursty, stochastic process, there will be fluctuations in the number of RNAs (corresponding to a given gene) that are present in any one cell at the time of sampling, even when the time-average of those RNA numbers were to be the same across cells. Secondly, the sample preparation steps are carried out by various enzymes with limited efficiency. This leads to substantial fluctuations in the number of molecules that are “captured” for a gene in a given cell, even if all cells were to have the same abundance of these molecules at the time of sampling. Third, the biochemical reactions that are part of sample preparation do not have absolute specificity. A clear example is the priming of the cDNA synthesis with oligo(dT): although the primer is intended for the poly(A) tails at the 3’ ends of RNAs, it is clear that the primer also binds to A-rich stretches that are internal to transcripts, and especially located in intronic regions. Finally, a conceptual issue with the single cell data is that one cannot apply the principle of averaging measurement values across replicate experiments to obtain more precise estimates, because we do not know which cells could be considered replicates of each other (if that is at all conceivable). diff --git a/scRNAsim_toolz/cdna_generator/README.md b/scRNAsim_toolz/cdna_generator/README.md index 19582f0..b05b734 100644 --- a/scRNAsim_toolz/cdna_generator/README.md +++ b/scRNAsim_toolz/cdna_generator/README.md @@ -1,23 +1,31 @@ # cDNA Generator module +## Usage +``` +usage: cdna-generator [-h] -ifa INPUT_FASTA -igtf INPUT_GTF -icpn INPUT_COPY_NUMBER -ofa OUTPUT_FASTA -ocsv OUTPUT_CSV [-v] + +Generate cDNA sequences based on primer probabilities. + +options: + -h, --help show this help message and exit + -ifa INPUT_FASTA, --input_fasta INPUT_FASTA + genome fasta file + -igtf INPUT_GTF, --input_gtf INPUT_GTF + gtf file + -icpn INPUT_COPY_NUMBER, --input_copy_number INPUT_COPY_NUMBER + input copy number (csv) file + -ofa OUTPUT_FASTA, --output_fasta OUTPUT_FASTA + output fasta file + -ocsv OUTPUT_CSV, --output_csv OUTPUT_CSV + output fasta file +``` +Example: +``` +cdna-generator -ifa tests/cdna_generator/files/transcript.fasta -igtf tests/cdna_generator/files/Example_GTF_Input.GTF -icpn tests/cdna_generator/files/copy_number_input.csv -ofa cdna_seq.fa -ocsv cdna_counts.csv +``` + +## Overview Generate cDNA based on mRNA transcript sequences and the coresponding priming probabilities. -## Example usage -A simple example can be run from the test_files directory: - - python ../cdna/cli.py -ifa yeast_example.fa -icpn copy_number_input.csv -igt Example_GTF_Input.GTF -ofa cDNA.fasta -ocsv cDNA.csv - -## Installation - - pip install . - -## Docker -A docker image is available, to fetch this image: - - docker pull ericdb/my-image - -To run a simple example using this image: - - docker run my-image python cdna/cli.py -ifa test_files/yeast_example.fa -icpn test_files/copy_number_input.csv -igt test_files/Example_GTF_Input.GTF -ofa test_files/cDNA.fasta -ocsv test_files/cDNA.csv ## License diff --git a/scRNAsim_toolz/cdna_generator/cdna.py b/scRNAsim_toolz/cdna_generator/cdna.py index a4f2c29..c30d892 100644 --- a/scRNAsim_toolz/cdna_generator/cdna.py +++ b/scRNAsim_toolz/cdna_generator/cdna.py @@ -32,8 +32,10 @@ def complement(res: str) -> str: def seq_complement(sequence: str) -> Optional[str]: - """Return the corresponding cDNA sequence by finding the complementary \ - base pairs and returning the reversed sequence. + """Return the corresponding cDNA sequence. + + Find the complementary base pairs and + returning the reversed sequence. Args: sequence: sequence to be converted into cDNA. @@ -78,75 +80,28 @@ def run(self) -> None: Returns: None """ - self.read_csv() - self.read_fasta() - self.read_gtf() + self.process_csv() + self.process_fasta() + self.process_gtf() self.add_sequences() self.add_complement() self.add_records() self.write_fasta() self.write_csv() - def add_records(self) -> None: - """Add data records to fasta file. - - Adds the copy number information to the fasta records. - - Returns: None - - """ - self.fasta_records = [] - for _, row in self.gtf_df.iterrows(): - if row["complement"] is not None: - copy_number = row["Transcript_Copy_Number"] - for _ in range(int(copy_number)): - record = SeqRecord( - Seq(row["complement"]), - row["cdna_ID"], - f"Transcript copy number: {copy_number}", - "", - ) - self.fasta_records.append(record) - - def add_sequences(self) -> None: - """Add the sequence for a given priming site. - - Returns: None - - """ - self.gtf_df["priming_site"] = self.gtf_df.apply( - lambda row: self.read_primingsite(row["seqname"], row["start"]), - axis=1, - ) - - def add_complement(self) -> None: - """Add the complementary cDNA sequence. - - Returns: None - - """ - self.gtf_df["complement"] = self.gtf_df["priming_site"].apply( - seq_complement - ) - - def read_primingsite(self, sequence: str, end: int) -> None: - """Read a fasta file from a given start character. - - Reads a fasta sequence with ID (sequence) and returns the - sequence starting from the index start. + def process_csv(self) -> None: + """Read a given copy number csv file. - Args: - sequence: sequence ID to be read. - end: end index of the priming site. + Wrapper for Pandas read_csv. Returns: None """ - if sequence not in self.fasta_dict.keys(): - return None - return self.fasta_dict[sequence].seq[:end] + df_csv = pd.read_csv(self.cpn, index_col=False) + df_csv = df_csv.reset_index() # make sure indexes pair with number of rows # noqa: E501 + self.csv_df = df_csv - def read_fasta(self) -> None: + def process_fasta(self) -> None: """Read a given fasta file. Wrapper for SeqIO.parse. @@ -154,23 +109,10 @@ def read_fasta(self) -> None: Returns: None """ - record = SeqIO.parse(self.fasta, "fasta") - records = list(record) + records = list(SeqIO.parse(self.fasta, "fasta")) self.fasta_dict = {x.name: x for x in records} - def read_csv(self) -> None: - """Read a given copy number csv file. - - Wrapper for Pandas read_csv. - - Returns: None - - """ - df_csv = pd.read_csv(self.cpn, index_col=False) - df_csv = df_csv.reset_index() # make sure indexes pair with number of rows # noqa: E501 - self.csv_df = df_csv - - def read_gtf(self) -> None: + def process_gtf(self) -> None: """Read and process the GTF file. Reads a GTF file and determines copy numbers from @@ -183,9 +125,7 @@ def read_gtf(self) -> None: # "feature", "seqname", "start", "end" # alongside the names of any optional keys # which appeared in the attribute column - gtf_df = read_gtf(self.gtf) - - gtf_df = gtf_df.to_pandas() # convert polars df to pandas df + gtf_df = read_gtf(self.gtf, result_type="pandas") # from gtfparse gtf_df["Binding_Probability"] = pd.to_numeric( gtf_df["Binding_Probability"] @@ -205,14 +145,14 @@ def read_gtf(self) -> None: else: count = 0 # reset count # CSV transcript ID - id_csv = str(row["seqname"]).split("_")[1] + id_csv = f"{id_}_{count}" # Calculate Normalized_Binding_Probability and add to GTF dataframe gtf_df.loc[index, "Normalized_Binding_Probability"] = ( row["Binding_Probability"] / df_norm_bind_prob[id_] ) # Calculate Normalized_Binding_Probability and add to GTF dataframe csv_transcript_copy_number = self.csv_df.loc[ - self.csv_df.iloc[:, 1] == int(id_csv), + self.csv_df["ID of transcript"] == id_csv, "Transcript copy number", ].iloc[0] # pop the first value in the frame gtf_df.loc[index, "Transcript_Copy_Number"] = round( @@ -227,6 +167,65 @@ def read_gtf(self) -> None: ].astype(int) self.gtf_df = gtf_df + def add_sequences(self) -> None: + """Add the sequence for a given priming site. + + Returns: None + + """ + self.gtf_df["priming_site"] = self.gtf_df.apply( + lambda row: self.read_primingsite(row["seqname"], row["start"]), + axis=1, + ) + + def read_primingsite(self, sequence: str, end: int) -> None: + """Read a fasta file from a given start character. + + Reads a fasta sequence with ID (sequence) and returns the + sequence starting from the index start. + + Args: + sequence: sequence ID to be read. + end: end index of the priming site. + + Returns: None + + """ + if sequence not in self.fasta_dict.keys(): + return None + return self.fasta_dict[sequence].seq[:end] + + def add_complement(self) -> None: + """Add the complementary cDNA sequence. + + Returns: None + + """ + self.gtf_df["complement"] = self.gtf_df["priming_site"].apply( + seq_complement + ) + + def add_records(self) -> None: + """Add data records to fasta file. + + Adds the copy number information to the fasta records. + + Returns: None + + """ + self.fasta_records = [] + for _, row in self.gtf_df.iterrows(): + if row["complement"] is not None: + copy_number = row["Transcript_Copy_Number"] + for _ in range(int(copy_number)): + record = SeqRecord( + Seq(row["complement"]), + row["cdna_ID"], + f"Transcript copy number: {copy_number}", + "", + ) + self.fasta_records.append(record) + def write_fasta(self) -> None: """Write cDNA fasta records to file. diff --git a/scRNAsim_toolz/cdna_generator/cli.py b/scRNAsim_toolz/cdna_generator/cli.py index 567763c..a504fe9 100644 --- a/scRNAsim_toolz/cdna_generator/cli.py +++ b/scRNAsim_toolz/cdna_generator/cli.py @@ -22,7 +22,7 @@ def main(): """ parser = argparse.ArgumentParser( - prog="cDNA generator", + prog="cdna-generator", description="Generate cDNA sequences based on primer probabilities.", ) parser.add_argument( diff --git a/scRNAsim_toolz/fragment_selector/README.md b/scRNAsim_toolz/fragment_selector/README.md index 1099b91..c3ded0c 100644 --- a/scRNAsim_toolz/fragment_selector/README.md +++ b/scRNAsim_toolz/fragment_selector/README.md @@ -1,16 +1,40 @@ -# Terminal fragment selecting +# Terminal fragment selector + +## Usage +``` +usage: fragment-selector [-h] --fasta FASTA --counts COUNTS -o OUTPUT [--mean MEAN] [--std STD] [-s SIZE] [--sep SEP] [-v] + +Takes as input FASTA file of cDNA sequences, a CSV/TSV with sequence counts, and mean and std. dev. of fragment lengths and 4 nucleotide probabilities for the cuts. Outputs most terminal fragment (within desired length +range) for each sequence. + +options: + -h, --help show this help message and exit + --fasta FASTA Path to FASTA file with cDNA sequences + --counts COUNTS Path to CSV/TSV file with sequence counts + -o OUTPUT, --output OUTPUT + output file path + --mean MEAN Mean fragment length (default: 300) + --std STD Standard deviation fragment length (defafult: 60) + -s SIZE, --size SIZE Chunk size for batch processing + --sep SEP Sequence counts file separator. +``` + +Example: + +`fragment-selector --fasta tests/fragment_selector/files/test.fasta --counts tests/fragment_selector/files/test.csv --mean 50 --output fragments.fa` +## Overview Simulating single cell RNA library generation (scRNA-seq) This repository is as part of the Uni Basel course . To test the accuracy of scRNA-seq data we generated the *synthetic data*. That is, we reconstruct the properties of the experimental data set and determine whether the computational analysis can recover properties of the data that was assumed in the simulation. This is never trivial since setting the ground truth is much needed in the computational method to evaluate the result. -# Synopsis + As part of the sub-project, we implemented python code for selecting terminal fragments. Detailed distribution used for the selecting fragments can be found below, summarised in [this paper](https://www.nature.com/articles/srep04532#MOESM1). > Next Generation Sequencing (NGS) technology is based on cutting DNA into small fragments and their massive parallel sequencing. The multiple overlapping segments termed “reads” are assembled into a contiguous sequence. To reduce sequencing errors, every genome region should be sequenced several dozen times. This sequencing approach is based on the assumption that genomic DNA breaks are random and sequence-independent. However, previously we showed that for the sonicated restriction DNA fragments the rates of double-stranded breaks depend on the nucleotide sequence. In this work we analyzed genomic reads from NGS data and discovered that fragmentation methods based on the action of the hydrodynamic forces on DNA, produce similar bias. Consideration of this non-random DNA fragmentation may allow one to unravel what factors and to what extent influence the non-uniform coverage of various genomic regions. As a whole project, we implemented a procedure for sampling reads from mRNA sequences, incorporating a few sources of “noise”. These include the presence of multiple transcript isoforms from a given gene, some that are incompletely spliced, stochastic binding of primers to RNA fragments and stochastic sampling of DNA fragments for sequencing. We will then use standard methods to estimate gene expression from the simulated data. We will repeat the process multiple times, each time corresponding to a single cell. We will then compare the estimates obtained from the simulated cells with the gene expression values assumed in the simulation. We will also try to explore which steps in the sample preparation have the largest impact on the accuracy of gene expression estimates. -# Usage + CLI arguments: - fasta (required): Path to FASTA file with cDNA sequences - counts (required): Path to CSV/TSV file with sequence counts @@ -26,20 +50,6 @@ CLI arguments: Output: - Text file with most terminal fragments for each sequence. -To install package, run - -``` -pip install . -``` - - -# Development - -To build Docker image, run - -``` -docker build -t terminal_fragment_selector . -``` # License diff --git a/scRNAsim_toolz/priming_site_predictor/README.md b/scRNAsim_toolz/priming_site_predictor/README.md index 484ae49..54a05b8 100644 --- a/scRNAsim_toolz/priming_site_predictor/README.md +++ b/scRNAsim_toolz/priming_site_predictor/README.md @@ -1,26 +1,33 @@ # Priming Site Predictor of Transcript Sequences -## Overview -Priming Site Predictor which uses a seed-and-extension algorithm (*RIblast*: https://github.com/fukunagatsu/RIblast) to *Predict Priming Sites* of oligo dT primers in target sequences. Furthermore, *Binding Energies* are calculated and classified with a threshold value. Additionally, the binding sites are associated with *Binding Probabilities* and stored in a *gtf file* for further processes. - -## Version -Version 0.1.0 (2022/11/15) - -## Installation from GitLab -Priming Site Predictor requires Python 3.9 or later. +## Usage +``` +usage: priming-site-predictor [-h] [-f FASTA_FILE] [-p PRIMER_SEQUENCE] [-e ENERGY_CUTOFF] [-r RIBLAST_OUTPUT] [-o OUTPUT_FILENAME] [-v] -Install Priming Site Predictor from GitLab using: +Compute potential priming sites using RIBlast. +options: + -h, --help show this help message and exit + -f FASTA_FILE, --fasta-file FASTA_FILE + Fasta-formatted file of transcript sequences + -p PRIMER_SEQUENCE, --primer-sequence PRIMER_SEQUENCE + Primer sequence + -e ENERGY_CUTOFF, --energy-cutoff ENERGY_CUTOFF + Energy cutoff for interactions + -r RIBLAST_OUTPUT, --riblast-output RIBLAST_OUTPUT + Path to RIBlast output file + -o OUTPUT_FILENAME, --output-filename OUTPUT_FILENAME + Path where the output gtf should be written ``` -git clone https://git.scicore.unibas.ch/zavolan_group/tools/priming-site-predictor.git -cd priming-site-predictor -``` -Create scRNA-seq-simulation conda environment: + +Example: ``` -conda env create --file environment.yml -conda activate scrna-seq-sim +priming-site-predictor --riblast-output tests/priming_site_predictor/files/RIBlast_output_example.txt --output-filename priming_sites.gtf ``` +## Overview +Priming Site Predictor which uses a seed-and-extension algorithm (*RIblast*: https://github.com/fukunagatsu/RIblast) to *Predict Priming Sites* of oligo dT primers in target sequences. Furthermore, *Binding Energies* are calculated and classified with a threshold value. Additionally, the binding sites are associated with *Binding Probabilities* and stored in a *gtf file* for further processes. + ## Usage ``` usage: primingsitepredictor [-h] [-f FASTA_FILE] [-p PRIMER_SEQUENCE] [-e ENERGY_CUTOFF] [-r RIBLAST_OUTPUT] [-o OUTPUT_FILENAME] diff --git a/scRNAsim_toolz/priming_site_predictor/psp.py b/scRNAsim_toolz/priming_site_predictor/psp.py index 190c18b..b5f14e6 100644 --- a/scRNAsim_toolz/priming_site_predictor/psp.py +++ b/scRNAsim_toolz/priming_site_predictor/psp.py @@ -130,7 +130,9 @@ def generate_gtf(self): 'Hybridization_Energy ' + f'"{row.iloc[6]}"; ' 'Interaction_Energy ' + f'"{row.iloc[7]}"; ' 'Number_of_binding_sites ' + f'"{row.iloc[12]}"; ' - 'Binding_Probability ' + f'"{row.iloc[14]}"\n' + # Not actually the binding probability, which is the + # normalized binding energy + 'Binding_Probability ' + f'"{row.iloc[13]}"\n' ) LOG.info("Generating output gtf file...") diff --git a/scRNAsim_toolz/read_sequencer/README.md b/scRNAsim_toolz/read_sequencer/README.md index 31db088..68a6958 100644 --- a/scRNAsim_toolz/read_sequencer/README.md +++ b/scRNAsim_toolz/read_sequencer/README.md @@ -1,53 +1,37 @@ # Read Sequencer -## Overview - -Read Sequencer is a python package to simulate sequencing. -It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file. - - -## Installation from github - -Read Sequencer requires Python 3.9 or later. - -Install Read Sequencer from Github using: - -``` -git clone https://git.scicore.unibas.ch/zavolan_group/tools/read-sequencer.git -cd read-sequencer -pip install . -``` - ## Usage - ``` -usage: readsequencer [-h] [-i INPUT] [-r READ_LENGTH] [-n N_RANDOM] [-s CHUNK_SIZE] output +usage: read-sequencer [-h] [-i INPUT] [-r READ_LENGTH] [-n N_RANDOM] [-s CHUNK_SIZE] [-v] output + Simulates sequencing of DNA sequences specified by an FASTA file. positional arguments: output path to FASTA file -optional arguments: +options: -h, --help show this help message and exit -i INPUT, --input INPUT path to FASTA file -r READ_LENGTH, --read-length READ_LENGTH read length for sequencing -n N_RANDOM, --n_random N_RANDOM - n random sequences. Just used if input fasta file is not specified. + n random sequences. Just used if inputfasta file is not specified. -s CHUNK_SIZE, --chunk-size CHUNK_SIZE chunk_size for batch processing +``` +Examples: +``` +read-sequencer -i tests/read_sequencer/files/50_seqs_50_1000_bp.fasta -r 100 sequenced_reads.fa +read-sequencer -n 50 -r 100 random_reads.fa ``` -## Docker +## Overview -The docker image is available on docker hub: https://hub.docker.com/r/grrchrr/readsequencer +Read Sequencer is a python package to simulate sequencing. +It reads fasta files, simulate sequencing with specified read length and writes the resulting sequences into a new fasta file. -``` -docker pull grrchrr/readsequencer -docker run readsequencer readsequencer --help -``` ## Contributors and Contact Information diff --git a/scRNAsim_toolz/read_sequencer/cli.py b/scRNAsim_toolz/read_sequencer/cli.py index b067c93..b376c20 100644 --- a/scRNAsim_toolz/read_sequencer/cli.py +++ b/scRNAsim_toolz/read_sequencer/cli.py @@ -17,7 +17,7 @@ def main(): """Use CLI arguments to simulate sequencing.""" parser = argparse.ArgumentParser( - prog="readsequencer", + prog="read-sequencer", description="Simulates sequencing of DNA sequences specified \ by an FASTA file.", ) diff --git a/scRNAsim_toolz/sequence_extractor/README.md b/scRNAsim_toolz/sequence_extractor/README.md index 4a5aff8..a86f14a 100644 --- a/scRNAsim_toolz/sequence_extractor/README.md +++ b/scRNAsim_toolz/sequence_extractor/README.md @@ -1,7 +1,31 @@ +# Sequence Extractor -# Extract transcript sequences +## Usage +``` +usage: sequence-extractor [-h] --mode {pre_bedtools,post_bedtools} [-i INPUT_FASTA_FILE] [-o OUTPUT_FILE_NAME] [-p POLY_A_LENGTH] [--input-gtf-file INPUT_GTF_FILE] [--output-bed-file OUTPUT_BED_FILE] [-v] + +extracts transcript sequences from genome sequence andouputs transcripts with PolyA tail added to them + +options: + -h, --help show this help message and exit + --mode {pre_bedtools,post_bedtools} + Select the mode of operation('pre_bedtools' or 'post_bedtools'). + -i INPUT_FASTA_FILE, --input-fasta-file INPUT_FASTA_FILE + Fasta-formatted file obtained from bedtools + -o OUTPUT_FILE_NAME, --output-file-name OUTPUT_FILE_NAME + Name of the output fasta file + -p POLY_A_LENGTH, --polyA-length POLY_A_LENGTH + Length of the polyA tail to be added (def: 250) + --input-gtf-file INPUT_GTF_FILE + Ordered and processed gtf file for 'pre_bedtools' mode. + --output-bed-file OUTPUT_BED_FILE + Bed file with only exons with strandednesstaken into account for 'pre_bedtools' mode. +``` + +Example: -### Project aim: +`sequence-extractor --mode post_bedtools --input-fasta-file tests/sequence_extractor/files/post_bedtools_test.fa --polyA-length 250 --output-file-name polyA_output.fa` +## Overview Given a gtf specification of transcript exon/intron structures and the genome sequence, construct the nucleotide sequence of the transcripts and add poly(A) tails. @@ -39,22 +63,6 @@ d. Append via e.g. str.join(), str.ljust() e. Output the final transcript sequences as a .fasta file. (Final Output) -## Installation - -TBA - - -## Usage/Examples - -```python script - transcript_sequence_extractor - --input_fasta_file `genome fasta file` - --input_gtf `gtf file` - --output_file_name `output fasta file` -``` - - - ## License [MIT](https://choosealicense.com/licenses/mit/) license, Copyright (c) 2022 Zavolan Lab, Biozentrum, University of Basel diff --git a/scRNAsim_toolz/structure_generator/README.md b/scRNAsim_toolz/structure_generator/README.md index e401050..ac1725b 100644 --- a/scRNAsim_toolz/structure_generator/README.md +++ b/scRNAsim_toolz/structure_generator/README.md @@ -1,19 +1,31 @@ -# Synopsis +# Structure generator -The human body contains a countless variety and diversity of cell types, states, and interactions. We wish to understand these tissues and the cell types at much deeper level. Single-cell RNA-seq (scRNA-seq) offers a look into what genes are being expressed at the level of individual cells. Overall this method allows one to identify cell types, find rare or unidentified cell types or states, identify genes that are differently expressed in different cell types, and explore changes in expression whilst including spatial, regulatory, and protein interactions. - -We hope that others would find use for this transcript_structure generator that allows one to take input gtf-files of specific gene transcripts and outputs a gtf-file containing intron/exon structures per input transcript. Moreover, one can specify a probability for intron-inclusion which is used to simulate incorrect splicing. +## Usage -# Installation +``` +usage: structure-generator [-h] [-p PROB_INCLUSION] [--log LOG] [-v] transcripts annotation -To install package, run +positional arguments: + transcripts Path to csv file with number of transcripts (ID,Count). + annotation Path to gtf-file with exon annotation. +options: + -h, --help show this help message and exit + -p PROB_INCLUSION, --prob-inclusion PROB_INCLUSION + Probability of intron inclusion. + --log LOG Level of logging. Can be one of ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"] ``` -pip install "setuptools>=62.1.0" -pip install . -``` -# Usage +Example: + +`structure-generator -p 0.3 tests/structure_generator/files/Transcript1.csv tests/structure_generator/files/Annotations2.gtf` +## Overview + +The human body contains a countless variety and diversity of cell types, states, and interactions. We wish to understand these tissues and the cell types at much deeper level. Single-cell RNA-seq (scRNA-seq) offers a look into what genes are being expressed at the level of individual cells. Overall this method allows one to identify cell types, find rare or unidentified cell types or states, identify genes that are differently expressed in different cell types, and explore changes in expression whilst including spatial, regulatory, and protein interactions. + +We hope that others would find use for this transcript_structure generator that allows one to take input gtf-files of specific gene transcripts and outputs a gtf-file containing intron/exon structures per input transcript. Moreover, one can specify a probability for intron-inclusion which is used to simulate incorrect splicing. + + Input: - csv-formatted file ("ID,Count") with counts for individual transcripts @@ -31,12 +43,12 @@ Output: To generate the sampled transcripts, run ``` -transcript-structure-generator --prob-inclusion [--log "INFO"] +structure-generator --prob-inclusion [--log "INFO"] ``` where the transcripts file should be csv-formatted, the annotation file gtf-formatted and the inclusion probability for introns a float in the range [0,1]. The log parameter is optional and can be one of `["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]`. The default is `INFO`. -Sample Transcripts and Annotation files can be found in the repository under main/tests/resources. +Sample Transcripts and Annotation files can be found in the repository under tests/. # License diff --git a/scRNAsim_toolz/transcript_sampler/README.md b/scRNAsim_toolz/transcript_sampler/README.md index bbd644b..c740b2e 100644 --- a/scRNAsim_toolz/transcript_sampler/README.md +++ b/scRNAsim_toolz/transcript_sampler/README.md @@ -1,29 +1,4 @@ -# Transcript Sampler - -## Overview -This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling. - -This workflow takes as input: -- Path to genome annotation file in gtf format -- Path to csv or tsv file with transcript IDs and expression levels -- Path to output sample gtf file -- Path to output sample transcript IDs and counts -- Integer of number of transcripts to sample - -The outputs are : -- trancript sample gtf file -- csv file containing sample transcript IDs and counts. - -## Installation from github -Transcript sampler requires Python 3.9 or later. - -Install Transcript sampler from Github using: - -``` -git clone https://git.scicore.unibas.ch/zavolan_group/tools/transcript-sampler.git -cd transcript-sampler -pip install . -``` +# Transcript sampler ## Usage ``` @@ -47,4 +22,20 @@ options: Example : -`transcript-sampler --input_gtf="tests/inputs/test.gtf" --input_csv="tests/inputs/expression.csv" --output_gtf="output_files/output.gtf" --output_csv="output_files/output.csv" --n_to_sample=100` +`transcript-sampler --input_gtf tests/transcript_sampler/files/test.gtf --input_csv tests/transcript_sampler/files/expression.csv --output_gtf sampled.gtf --output_csv sampled.csv --n_to_sample 100` + + +## Overview +This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling. + +This workflow takes as input: +- Path to genome annotation file in gtf format +- Path to csv or tsv file with transcript IDs and expression levels +- Path to output sample gtf file +- Path to output sample transcript IDs and counts +- Integer of number of transcripts to sample + +The outputs are : +- trancript sample gtf file +- csv file containing sample transcript IDs and counts. + \ No newline at end of file diff --git a/scRNAsim_toolz/transcript_sampler/match_explvl.py b/scRNAsim_toolz/transcript_sampler/match_explvl.py index a914d8c..b4408cc 100644 --- a/scRNAsim_toolz/transcript_sampler/match_explvl.py +++ b/scRNAsim_toolz/transcript_sampler/match_explvl.py @@ -31,7 +31,7 @@ def gtf_to_df(gtf_file: str) -> pd.DataFrame: Raises: None """ - df_gtf = read_gtf(gtf_file,).to_pandas() + df_gtf = read_gtf(gtf_file, result_type="pandas") df_gtf = df_gtf[df_gtf["feature"] == "transcript"] df_gtf = df_gtf[["gene_id", "transcript_id"]] df_gtf = df_gtf.rename(columns={ diff --git a/tests/cdna_generator/test_files/.gitkeep b/tests/cdna_generator/files/.gitkeep similarity index 100% rename from tests/cdna_generator/test_files/.gitkeep rename to tests/cdna_generator/files/.gitkeep diff --git a/tests/cdna_generator/test_files/Example_GTF_Input.GTF b/tests/cdna_generator/files/Example_GTF_Input.GTF similarity index 100% rename from tests/cdna_generator/test_files/Example_GTF_Input.GTF rename to tests/cdna_generator/files/Example_GTF_Input.GTF diff --git a/tests/cdna_generator/files/copy_number_input.csv b/tests/cdna_generator/files/copy_number_input.csv new file mode 100644 index 0000000..d9af710 --- /dev/null +++ b/tests/cdna_generator/files/copy_number_input.csv @@ -0,0 +1,7 @@ +ID of transcript,ID of parent transcript,Transcript copy number +Transcript_1_0,Transcript_1,12 +Transcript_1_1,Transcript_1,11 +Transcript_2_0,Transcript_2,33 +Transcript_3_0,Transcript_3,11 +Transcript_4_0,Transcript_4,55 +Transcript_5_0,Transcript_5,23 \ No newline at end of file diff --git a/tests/cdna_generator/test_files/transcript.fasta b/tests/cdna_generator/files/transcript.fasta similarity index 83% rename from tests/cdna_generator/test_files/transcript.fasta rename to tests/cdna_generator/files/transcript.fasta index bb37ee2..d1c83a1 100644 --- a/tests/cdna_generator/test_files/transcript.fasta +++ b/tests/cdna_generator/files/transcript.fasta @@ -1,8 +1,8 @@ ->1 +>Transcript_1 GAUAGCUAGAGGAUUCUCAGAGGAGAAGCUAGAGGAGCUAGAGGAGCUAGAGGAGCUAGAGGAGCUAGAGG ->2 +>Transcript_2 AGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAGCUAGAGGAGCUAGAGGAGCUAGAGG ->3 +>Transcript_3 AGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAGCUAGAGG ->4 +>Transcript_4 AGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAUAGCUAGAGGAGCUAGAGGAGCUAGAGG diff --git a/tests/cdna_generator/test_files/yeast_example.fa b/tests/cdna_generator/files/yeast_example.fa similarity index 100% rename from tests/cdna_generator/test_files/yeast_example.fa rename to tests/cdna_generator/files/yeast_example.fa diff --git a/tests/cdna_generator/test_files/cDNA.csv b/tests/cdna_generator/test_files/cDNA.csv deleted file mode 100644 index d88da42..0000000 --- a/tests/cdna_generator/test_files/cDNA.csv +++ /dev/null @@ -1,7 +0,0 @@ -cdna_ID,Transcript_Copy_Number -Transcript_1_0,8.0 -Transcript_1_1,4.0 -Transcript_2_0,11.0 -Transcript_3_0,33.0 -Transcript_4_0,11.0 -Transcript_5_0,55.0 diff --git a/tests/cdna_generator/test_files/cDNA.fasta b/tests/cdna_generator/test_files/cDNA.fasta deleted file mode 100644 index 2a8d9b5..0000000 --- a/tests/cdna_generator/test_files/cDNA.fasta +++ /dev/null @@ -1,86 +0,0 @@ ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_0 -AAAAAAAAAA ->Transcript_1_1 -ACATATGTGTGGATGATATAAGATTTGTCCCAAGGACGCTTCTCGAATAGTTCTTTTGTC -CCTAGATATTTCCTTCTCTAATTTCTTCTGTCTCTCCAAAACATCTTGTTCATTTTTGGG -TGGTGGCAGCATTTGTTCTTTATGTAGGAAAGCCTTTGTAGCACGGTTTACAATTTTTAC -TGCACAAACCTCATTAGTGTAACGATGTTTTGCCAGCTTCACTTTACCCATAGAACCTGC -ACCAACTGTTTCAACAAACTCCCAATCTCCTAACGATTTTCGATGAAACTGCTTAGGCAT -GCCCTGAGAAGACGAAACTCGACTCTGGCTGGTAGTATTTGGCTTTGGTGCATTCTCTCT -CGATTTACCTTCAAGTTCTACTTGTCTCTCCTTTTGCTCTGCTTGATGTGAGTTACCATT -ATTGGCGTATTTGATATCCGCCGGGGGCATTAGCGGTGTGTTCTGCTGCTGCTGCTGCTG -TGGACTTTTTCCCATCATTCTCAGCGTAGCGGGCGCCATAGTGCTTGGTTGTGTATGCAT -GCTGTTGCTTTCACTATTGCCATCATCCTGCTGGTTACCTCTGCCCATTGAGAAGGCAGT -ATTTACGTGATAATCATCCATAAAAAAAAAAAAAAAAAAA ->Transcript_1_1 -ACATATGTGTGGATGATATAAGATTTGTCCCAAGGACGCTTCTCGAATAGTTCTTTTGTC -CCTAGATATTTCCTTCTCTAATTTCTTCTGTCTCTCCAAAACATCTTGTTCATTTTTGGG -TGGTGGCAGCATTTGTTCTTTATGTAGGAAAGCCTTTGTAGCACGGTTTACAATTTTTAC -TGCACAAACCTCATTAGTGTAACGATGTTTTGCCAGCTTCACTTTACCCATAGAACCTGC -ACCAACTGTTTCAACAAACTCCCAATCTCCTAACGATTTTCGATGAAACTGCTTAGGCAT -GCCCTGAGAAGACGAAACTCGACTCTGGCTGGTAGTATTTGGCTTTGGTGCATTCTCTCT -CGATTTACCTTCAAGTTCTACTTGTCTCTCCTTTTGCTCTGCTTGATGTGAGTTACCATT -ATTGGCGTATTTGATATCCGCCGGGGGCATTAGCGGTGTGTTCTGCTGCTGCTGCTGCTG -TGGACTTTTTCCCATCATTCTCAGCGTAGCGGGCGCCATAGTGCTTGGTTGTGTATGCAT -GCTGTTGCTTTCACTATTGCCATCATCCTGCTGGTTACCTCTGCCCATTGAGAAGGCAGT -ATTTACGTGATAATCATCCATAAAAAAAAAAAAAAAAAAA ->Transcript_1_1 -ACATATGTGTGGATGATATAAGATTTGTCCCAAGGACGCTTCTCGAATAGTTCTTTTGTC -CCTAGATATTTCCTTCTCTAATTTCTTCTGTCTCTCCAAAACATCTTGTTCATTTTTGGG -TGGTGGCAGCATTTGTTCTTTATGTAGGAAAGCCTTTGTAGCACGGTTTACAATTTTTAC -TGCACAAACCTCATTAGTGTAACGATGTTTTGCCAGCTTCACTTTACCCATAGAACCTGC -ACCAACTGTTTCAACAAACTCCCAATCTCCTAACGATTTTCGATGAAACTGCTTAGGCAT -GCCCTGAGAAGACGAAACTCGACTCTGGCTGGTAGTATTTGGCTTTGGTGCATTCTCTCT -CGATTTACCTTCAAGTTCTACTTGTCTCTCCTTTTGCTCTGCTTGATGTGAGTTACCATT -ATTGGCGTATTTGATATCCGCCGGGGGCATTAGCGGTGTGTTCTGCTGCTGCTGCTGCTG -TGGACTTTTTCCCATCATTCTCAGCGTAGCGGGCGCCATAGTGCTTGGTTGTGTATGCAT -GCTGTTGCTTTCACTATTGCCATCATCCTGCTGGTTACCTCTGCCCATTGAGAAGGCAGT -ATTTACGTGATAATCATCCATAAAAAAAAAAAAAAAAAAA ->Transcript_1_1 -ACATATGTGTGGATGATATAAGATTTGTCCCAAGGACGCTTCTCGAATAGTTCTTTTGTC -CCTAGATATTTCCTTCTCTAATTTCTTCTGTCTCTCCAAAACATCTTGTTCATTTTTGGG -TGGTGGCAGCATTTGTTCTTTATGTAGGAAAGCCTTTGTAGCACGGTTTACAATTTTTAC -TGCACAAACCTCATTAGTGTAACGATGTTTTGCCAGCTTCACTTTACCCATAGAACCTGC -ACCAACTGTTTCAACAAACTCCCAATCTCCTAACGATTTTCGATGAAACTGCTTAGGCAT -GCCCTGAGAAGACGAAACTCGACTCTGGCTGGTAGTATTTGGCTTTGGTGCATTCTCTCT -CGATTTACCTTCAAGTTCTACTTGTCTCTCCTTTTGCTCTGCTTGATGTGAGTTACCATT -ATTGGCGTATTTGATATCCGCCGGGGGCATTAGCGGTGTGTTCTGCTGCTGCTGCTGCTG -TGGACTTTTTCCCATCATTCTCAGCGTAGCGGGCGCCATAGTGCTTGGTTGTGTATGCAT -GCTGTTGCTTTCACTATTGCCATCATCCTGCTGGTTACCTCTGCCCATTGAGAAGGCAGT -ATTTACGTGATAATCATCCATAAAAAAAAAAAAAAAAAAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA ->Transcript_2_0 -AAA diff --git a/tests/cdna_generator/test_files/copy_number_file.csv b/tests/cdna_generator/test_files/copy_number_file.csv deleted file mode 100644 index 6903251..0000000 --- a/tests/cdna_generator/test_files/copy_number_file.csv +++ /dev/null @@ -1,6 +0,0 @@ -ID of transcript, ID of parent transcript, Transcript copy number -1,1,12 -2,1,11 -3,2,33 -4,3,11 -5,4,55 diff --git a/tests/cdna_generator/test_files/copy_number_input.csv b/tests/cdna_generator/test_files/copy_number_input.csv deleted file mode 100644 index d3bfaaa..0000000 --- a/tests/cdna_generator/test_files/copy_number_input.csv +++ /dev/null @@ -1,6 +0,0 @@ -ID of transcript,ID of parent transcript,Transcript copy number -1,1,12 -2,1,11 -3,2,33 -4,3,11 -5,4,55 \ No newline at end of file diff --git a/tests/fragment_selector/test_files/test.csv b/tests/fragment_selector/files/test.csv similarity index 100% rename from tests/fragment_selector/test_files/test.csv rename to tests/fragment_selector/files/test.csv diff --git a/tests/fragment_selector/test_files/test.fasta b/tests/fragment_selector/files/test.fasta similarity index 100% rename from tests/fragment_selector/test_files/test.fasta rename to tests/fragment_selector/files/test.fasta diff --git a/tests/fragment_selector/test_files/test_tab.csv b/tests/fragment_selector/files/test_tab.csv similarity index 100% rename from tests/fragment_selector/test_files/test_tab.csv rename to tests/fragment_selector/files/test_tab.csv diff --git a/tests/priming_site_predictor/test_files/RIBlast_output_example.txt b/tests/priming_site_predictor/files/RIBlast_output_example.txt similarity index 100% rename from tests/priming_site_predictor/test_files/RIBlast_output_example.txt rename to tests/priming_site_predictor/files/RIBlast_output_example.txt diff --git a/tests/priming_site_predictor/test_files/output_transcripts_df.gtf b/tests/priming_site_predictor/files/output_transcripts_df.gtf similarity index 100% rename from tests/priming_site_predictor/test_files/output_transcripts_df.gtf rename to tests/priming_site_predictor/files/output_transcripts_df.gtf diff --git a/tests/priming_site_predictor/test_files/primer1.fasta b/tests/priming_site_predictor/files/primer1.fasta similarity index 100% rename from tests/priming_site_predictor/test_files/primer1.fasta rename to tests/priming_site_predictor/files/primer1.fasta diff --git a/tests/priming_site_predictor/test_files/test_fasta.fasta b/tests/priming_site_predictor/files/test_fasta.fasta similarity index 100% rename from tests/priming_site_predictor/test_files/test_fasta.fasta rename to tests/priming_site_predictor/files/test_fasta.fasta diff --git a/tests/priming_site_predictor/test_files/yeast_test_files/dbRNA_test.fa b/tests/priming_site_predictor/files/yeast_test_files/dbRNA_test.fa similarity index 100% rename from tests/priming_site_predictor/test_files/yeast_test_files/dbRNA_test.fa rename to tests/priming_site_predictor/files/yeast_test_files/dbRNA_test.fa diff --git a/tests/priming_site_predictor/test_files/yeast_test_files/queryRNA_test.fa b/tests/priming_site_predictor/files/yeast_test_files/queryRNA_test.fa similarity index 100% rename from tests/priming_site_predictor/test_files/yeast_test_files/queryRNA_test.fa rename to tests/priming_site_predictor/files/yeast_test_files/queryRNA_test.fa diff --git a/tests/read_sequencer/fasta_testfile/50_seqs_50_1000_bp.fasta b/tests/read_sequencer/files/50_seqs_50_1000_bp.fasta similarity index 100% rename from tests/read_sequencer/fasta_testfile/50_seqs_50_1000_bp.fasta rename to tests/read_sequencer/files/50_seqs_50_1000_bp.fasta diff --git a/tests/sequence_extractor/test_files/post_bedtools_test.fa b/tests/sequence_extractor/files/post_bedtools_test.fa similarity index 100% rename from tests/sequence_extractor/test_files/post_bedtools_test.fa rename to tests/sequence_extractor/files/post_bedtools_test.fa diff --git a/tests/sequence_extractor/test_files/test.bed b/tests/sequence_extractor/files/test.bed similarity index 100% rename from tests/sequence_extractor/test_files/test.bed rename to tests/sequence_extractor/files/test.bed diff --git a/tests/sequence_extractor/test_files/test.gtf b/tests/sequence_extractor/files/test.gtf similarity index 100% rename from tests/sequence_extractor/test_files/test.gtf rename to tests/sequence_extractor/files/test.gtf diff --git a/tests/sequence_extractor/test_files/test_1.fa b/tests/sequence_extractor/files/test_1.fa similarity index 100% rename from tests/sequence_extractor/test_files/test_1.fa rename to tests/sequence_extractor/files/test_1.fa diff --git a/tests/sequence_extractor/test_files/test_2.fa b/tests/sequence_extractor/files/test_2.fa similarity index 100% rename from tests/sequence_extractor/test_files/test_2.fa rename to tests/sequence_extractor/files/test_2.fa diff --git a/tests/structure_generator/test_files/Annotation1.gtf b/tests/structure_generator/files/Annotation1.gtf similarity index 100% rename from tests/structure_generator/test_files/Annotation1.gtf rename to tests/structure_generator/files/Annotation1.gtf diff --git a/tests/structure_generator/test_files/Annotations2.gtf b/tests/structure_generator/files/Annotations2.gtf similarity index 100% rename from tests/structure_generator/test_files/Annotations2.gtf rename to tests/structure_generator/files/Annotations2.gtf diff --git a/tests/structure_generator/test_files/Transcript1.csv b/tests/structure_generator/files/Transcript1.csv similarity index 100% rename from tests/structure_generator/test_files/Transcript1.csv rename to tests/structure_generator/files/Transcript1.csv diff --git a/tests/structure_generator/test_files/Transcript2.tsv b/tests/structure_generator/files/Transcript2.tsv similarity index 100% rename from tests/structure_generator/test_files/Transcript2.tsv rename to tests/structure_generator/files/Transcript2.tsv diff --git a/tests/transcript_sampler/inputs/expression.csv b/tests/transcript_sampler/files/expression.csv similarity index 100% rename from tests/transcript_sampler/inputs/expression.csv rename to tests/transcript_sampler/files/expression.csv diff --git a/tests/transcript_sampler/inputs/new_test.gtf b/tests/transcript_sampler/files/new_test.gtf similarity index 100% rename from tests/transcript_sampler/inputs/new_test.gtf rename to tests/transcript_sampler/files/new_test.gtf diff --git a/tests/transcript_sampler/inputs/test.gtf b/tests/transcript_sampler/files/test.gtf similarity index 100% rename from tests/transcript_sampler/inputs/test.gtf rename to tests/transcript_sampler/files/test.gtf diff --git a/tests/transcript_sampler/inputs/test_dict_repr_trans.txt b/tests/transcript_sampler/files/test_dict_repr_trans.txt similarity index 100% rename from tests/transcript_sampler/inputs/test_dict_repr_trans.txt rename to tests/transcript_sampler/files/test_dict_repr_trans.txt diff --git a/tests/transcript_sampler/inputs/test_gencode.vM31.annotation_intermediat_file.txt b/tests/transcript_sampler/files/test_gencode.vM31.annotation_intermediat_file.txt similarity index 100% rename from tests/transcript_sampler/inputs/test_gencode.vM31.annotation_intermediat_file.txt rename to tests/transcript_sampler/files/test_gencode.vM31.annotation_intermediat_file.txt diff --git a/tests/transcript_sampler/inputs/test_gene_exprL b/tests/transcript_sampler/files/test_gene_exprL similarity index 100% rename from tests/transcript_sampler/inputs/test_gene_exprL rename to tests/transcript_sampler/files/test_gene_exprL diff --git a/tests/transcript_sampler/inputs/test_gene_exprL_csv.csv b/tests/transcript_sampler/files/test_gene_exprL_csv.csv similarity index 100% rename from tests/transcript_sampler/inputs/test_gene_exprL_csv.csv rename to tests/transcript_sampler/files/test_gene_exprL_csv.csv diff --git a/tests/transcript_sampler/inputs/test_ref_output.tsv b/tests/transcript_sampler/files/test_ref_output.tsv similarity index 100% rename from tests/transcript_sampler/inputs/test_ref_output.tsv rename to tests/transcript_sampler/files/test_ref_output.tsv diff --git a/tests/transcript_sampler/inputs/.gitkeep b/tests/transcript_sampler/inputs/.gitkeep deleted file mode 100644 index e69de29..0000000