You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling.
This workflow takes as input:
Path to genome annotation file in gtf format
Path to csv or tsv file with transcript IDs and expression levels
Path to output sample gtf file
Path to output sample transcript IDs and counts
Integer of number of transcripts to sample
The outputs are :
trancript sample gtf file
csv file containing sample transcript IDs and counts.
Installation from github
Transcript sampler requires Python 3.9 or later.
Install Transcript sampler from Github using:
git clone https://git.scicore.unibas.ch/zavolan_group/tools/transcript-sampler.git
cd transcript-sampler
pip install .
Usage
usage: transcript-sampler [-h] --input_gtf INPUT_GTF --input_csv INPUT_CSV --output_gtf OUTPUT_GTF --output_csv OUTPUT_CSV --n_to_sample N_TO_SAMPLE
Transcript sampler
options:
-h, --help show this help message and exit
--input_gtf INPUT_GTF
GTF file with genome annotation (default: None)
--input_csv INPUT_CSV
CSV or TSV file with transcripts and their expression level (default: None)
--output_gtf OUTPUT_GTF
Output path for the new GTF file of representative transcripts (default: None)
--output_csv OUTPUT_CSV
Output path for the new CSV file of representative transcripts and their sampled number (default: None)
--n_to_sample N_TO_SAMPLE
Total number of transcripts to sample (default: None)
Sample transcript counts given average expression levels
Given a total number transcripts, their relative abundance in a sample and the genome annotation, sample representative transcripts per gene, in proportion to their relative abundance levels.
Input:
Csv-formatted file ("ID,Level") with expression levels per gene (or per transcript).
Total number of transcripts to sample.
gtf-formatted file with the intron/exon coordinates of the transcripts represented in the expression file.
Output:
Gtf-formatted file of the sampled transcripts.
Csv-formatted file ("ID,Count") with the transcript copies for each representative transcript.
First, we pick a representative transcript for each gene in the annotation file. This transcript has the highest level of experimental support (lowest transcript support level value). If there are multiple such transcripts for a gene, the one that covers the largest genomic region is chosen (based on the coordinates of the exons).
Then, we sample transcript counts up to a specified total, in proportion to the gene expression levels given in the input 1. The expression levels can be provided either per transcript ID or per gene ID. If transcript expression levels are given, these transcripts are not guaranteed to be the representative ones, but the expression should be extracted per representative transcript. If the expression level is provided per gene, it needs to be assigned to the representative transcript as well. So, a dictionary of representative transcript ID : gene ID has to be build first. Then the expression of all transcripts associated with the gene should be cumulated on a per gene basis (if the expression values are not already provided per gene) and then the gene expression level should be transferred to the representative transcript and written out.
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
Pick the number of transcripts coming from each gene. As #input 1 we get a file with the expression level of individual transcripts from some real sample. For simplicity, we first pick a representative transcript per gene, e.g. with most annotation support (support level 1 or TSL=1). Then, given a total number of transcripts per cell (input #1, we generate, for each representative transcript, a Poisson sample given the average count from input #1.
The text was updated successfully, but these errors were encountered:
README description
Overview
This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling.
This workflow takes as input:
The outputs are :
Installation from github
Transcript sampler requires Python 3.9 or later.
Install Transcript sampler from Github using:
Usage
Example :
Original issue description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/1
Generate the "RNA content" of a single cell by sampling transcripts in proportion to the relative expression of levels of their corresponding genes (provided as input), up to a given total transcript count.
Inputs:
Output: Csv-formatted file ("GeneID,Count") with gene expression levels in a "cell"
Other issue description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/28
Sample transcript counts given average expression levels
Given a total number transcripts, their relative abundance in a sample and the genome annotation, sample representative transcripts per gene, in proportion to their relative abundance levels.
Input:
Output:
First, we pick a representative transcript for each gene in the annotation file. This transcript has the highest level of experimental support (lowest transcript support level value). If there are multiple such transcripts for a gene, the one that covers the largest genomic region is chosen (based on the coordinates of the exons).
Then, we sample transcript counts up to a specified total, in proportion to the gene expression levels given in the input 1. The expression levels can be provided either per transcript ID or per gene ID. If transcript expression levels are given, these transcripts are not guaranteed to be the representative ones, but the expression should be extracted per representative transcript. If the expression level is provided per gene, it needs to be assigned to the representative transcript as well. So, a dictionary of representative transcript ID : gene ID has to be build first. Then the expression of all transcripts associated with the gene should be cumulated on a per gene basis (if the expression values are not already provided per gene) and then the gene expression level should be transferred to the representative transcript and written out.
Pipeline overview description
https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
Pick the number of transcripts coming from each gene. As #input 1 we get a file with the expression level of individual transcripts from some real sample. For simplicity, we first pick a representative transcript per gene, e.g. with most annotation support (support level 1 or TSL=1). Then, given a total number of transcripts per cell (input #1, we generate, for each representative transcript, a Poisson sample given the average count from input #1.
The text was updated successfully, but these errors were encountered: