You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The human body contains a countless variety and diversity of cell types, states, and interactions. We wish to understand these tissues and the cell types at much deeper level. Single-cell RNA-seq (scRNA-seq) offers a look into what genes are being expressed at the level of individual cells. Overall this method allows one to identify cell types, find rare or unidentified cell types or states, identify genes that are differently expressed in different cell types, and explore changes in expression whilst including spatial, regulatory, and protein interactions.
We hope that others would find use for this transcript_structure generator that allows one to take input gtf-files of specific gene transcripts and outputs a gtf-file containing intron/exon structures per input transcript. Moreover, one can specify a probability for intron-inclusion which is used to simulate incorrect splicing.
Installation
To install package, run
pip install "setuptools>=62.1.0"
pip install .
Usage
Input:
csv-formatted file ("ID,Count") with counts for individual transcripts
probability of intron inclusion (float in range [0,1])
gtf-formatted file with exon coordinates of the transcripts included in the csv file
Output:
gtf-formatted file containing generated intron/exon structures per transcript
csv-formatted file ("NewTranscriptID,ID,Count") with
id of generated transcript
id of original transcript (without intron inclusions)
where the transcripts file should be csv-formatted, the annotation file gtf-formatted and the inclusion probability for introns a float in the range [0,1]. The log parameter is optional and can be one of ["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]. The default is INFO.
Sample Transcripts and Annotation files can be found in the repository under main/tests/resources.
Given a number transcripts to be sampled from each of a set of genes, generate their intron-exon structures and counts, allowing for some of the introns to be included in the transcripts.
Input:
Csv-formatted file ("ID,Count") with counts for individual transcripts.
Probability of an intron inclusion.
gtf-formatted file with the intron/exon coordinates of the transcripts represented in the count file.
Output:
Gtf-formatted file containing the unique intron/exon structures that have been generated.
Csv-formatted file ("NewTranscriptID,ID,Count") with the ID of the parent transcript (that did not have any intron inclusions) and then copy number of each unique transcript structure.
The structure of each transcript should be generated individually, using the same exons as there are in the input transcript, but allowing for the possibility of intron inclusion. This is done by walking along the introns implied by the intron/exon structure of the transcript and deciding whether to include them, with the specified probability for each intron. If an intron is selected as included, a new exon will be created, covering the selected intron and the exons preceding and succeeding it. The exon/intron structures of all transcripts will be written to a new gtf file, and the number of times each unique transcript form was generated is written to a new csv-formatted file.
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
Generate the exon/intron structure for each transcript. For each gene there is a reference set of exons (specified in input #3), but to account for the possibility that the transcript is not completely processed, introns are included in individual transcripts. This is done by going through all the possible introns of a transcript and choosing which ones to include (according to input #9). Then the generated structures are written to a gtf file (because new exons are effectively generated which look like exon_n;intron_n;exon_n+1) and the number of transcripts with each unique structure is also saved.
For each representative transcript, iterate through transcript, identify extrons and introns and if current position contains intron, generate random number between 0 and 1. Keep intron if random number < user defined threshold, else discard. Save reduced transcript.
Input:
Csv-formatted file ("ID,Count") with counts for individual transcripts
Probability of intron inclusion
gtf-formatted file with exon coordinates of the transcript included in the csv file
Question: how to find out possible introns per transcript from gtf file?
csv-formatted file ("NewTranscriptID,ID,Count") with
id of generated transcript
id of original transcript (without intron inclusions)
count
Read representative transcript numbers, generate sampled transcript numbers by using input as definition of distribution and generating discrete random variables from distribution (e.g. scipy.stats.rv_discrete). Save generated numbers as number of transcripts with each unique structure in new csv-formatted file.
The main challenge here is to properly compute and write out the exons that result from intron inclusion. It will be helpful for the transcript and exon names to be interpretable rather than random numbers.
Please also include input error-checking and appropriate tests (e.g. think of limit cases, e.g. probability of intron inclusion being 0 or 1, single exon genes etc.)
The text was updated successfully, but these errors were encountered:
README description
Synopsis
The human body contains a countless variety and diversity of cell types, states, and interactions. We wish to understand these tissues and the cell types at much deeper level. Single-cell RNA-seq (scRNA-seq) offers a look into what genes are being expressed at the level of individual cells. Overall this method allows one to identify cell types, find rare or unidentified cell types or states, identify genes that are differently expressed in different cell types, and explore changes in expression whilst including spatial, regulatory, and protein interactions.
We hope that others would find use for this transcript_structure generator that allows one to take input gtf-files of specific gene transcripts and outputs a gtf-file containing intron/exon structures per input transcript. Moreover, one can specify a probability for intron-inclusion which is used to simulate incorrect splicing.
Installation
To install package, run
Usage
Input:
Output:
To generate the sampled transcripts, run
where the transcripts file should be csv-formatted, the annotation file gtf-formatted and the inclusion probability for introns a float in the range [0,1]. The log parameter is optional and can be one of
["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG"]
. The default isINFO
.Sample Transcripts and Annotation files can be found in the repository under main/tests/resources.
Original issue description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/2
Generate transcript structures
Given a number transcripts to be sampled from each of a set of genes, generate their intron-exon structures and counts, allowing for some of the introns to be included in the transcripts.
Input:
Output:
The structure of each transcript should be generated individually, using the same exons as there are in the input transcript, but allowing for the possibility of intron inclusion. This is done by walking along the introns implied by the intron/exon structure of the transcript and deciding whether to include them, with the specified probability for each intron. If an intron is selected as included, a new exon will be created, covering the selected intron and the exons preceding and succeeding it. The exon/intron structures of all transcripts will be written to a new gtf file, and the number of times each unique transcript form was generated is written to a new csv-formatted file.
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
Generate the exon/intron structure for each transcript. For each gene there is a reference set of exons (specified in input #3), but to account for the possibility that the transcript is not completely processed, introns are included in individual transcripts. This is done by going through all the possible introns of a transcript and choosing which ones to include (according to input #9). Then the generated structures are written to a gtf file (because new exons are effectively generated which look like exon_n;intron_n;exon_n+1) and the number of transcripts with each unique structure is also saved.
Project design plan
https://git.scicore.unibas.ch/zavolan_group/tools/transcript-structure-generator/-/issues/1
For each representative transcript, iterate through transcript, identify extrons and introns and if current position contains intron, generate random number between 0 and 1. Keep intron if random number < user defined threshold, else discard. Save reduced transcript.
Input:
Output:
Read representative transcript numbers, generate sampled transcript numbers by using input as definition of distribution and generating discrete random variables from distribution (e.g. scipy.stats.rv_discrete). Save generated numbers as number of transcripts with each unique structure in new csv-formatted file.
The main challenge here is to properly compute and write out the exons that result from intron inclusion. It will be helpful for the transcript and exon names to be interpretable rather than random numbers.
Please also include input error-checking and appropriate tests (e.g. think of limit cases, e.g. probability of intron inclusion being 0 or 1, single exon genes etc.)
The text was updated successfully, but these errors were encountered: