Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: #1 transcript sampler #21

Open
ninsch3000 opened this issue Oct 27, 2023 · 0 comments
Open

test: #1 transcript sampler #21

ninsch3000 opened this issue Oct 27, 2023 · 0 comments

Comments

@ninsch3000
Copy link
Collaborator

ninsch3000 commented Oct 27, 2023

README description

Overview

This workflow samples representative transcripts per gene, in proportion to their relative abundance levels. Sampling is done by Poisson sampling.

This workflow takes as input:

  • Path to genome annotation file in gtf format
  • Path to csv or tsv file with transcript IDs and expression levels
  • Path to output sample gtf file
  • Path to output sample transcript IDs and counts
  • Integer of number of transcripts to sample

The outputs are :

  • trancript sample gtf file
  • csv file containing sample transcript IDs and counts.

Installation from github

Transcript sampler requires Python 3.9 or later.

Install Transcript sampler from Github using:

git clone https://git.scicore.unibas.ch/zavolan_group/tools/transcript-sampler.git
cd transcript-sampler
pip install . 

Usage

usage: transcript-sampler [-h] --input_gtf INPUT_GTF --input_csv INPUT_CSV --output_gtf OUTPUT_GTF --output_csv OUTPUT_CSV --n_to_sample N_TO_SAMPLE

Transcript sampler

options:
  -h, --help            show this help message and exit
  --input_gtf INPUT_GTF
                        GTF file with genome annotation (default: None)
  --input_csv INPUT_CSV
                        CSV or TSV file with transcripts and their expression level (default: None)
  --output_gtf OUTPUT_GTF
                        Output path for the new GTF file of representative transcripts (default: None)
  --output_csv OUTPUT_CSV
                        Output path for the new CSV file of representative transcripts and their sampled number (default: None)
  --n_to_sample N_TO_SAMPLE
                        Total number of transcripts to sample (default: None)

Example :

transcript-sampler --input_gtf tests/transcript_sampler/files/test.gtf --input_csv tests/transcript_sampler/files/expression.csv --output_gtf sampled.gtf --output_csv sampled.csv --n_to_sample 100

Original issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/1
Generate the "RNA content" of a single cell by sampling transcripts in proportion to the relative expression of levels of their corresponding genes (provided as input), up to a given total transcript count.

Inputs:

  1. Csv-formatted file with gene expression levels "GeneID,Count"
  2. Total number of transcripts to sample for a single cell

Output: Csv-formatted file ("GeneID,Count") with gene expression levels in a "cell"

Other issue description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation/-/issues/28

Sample transcript counts given average expression levels

Given a total number transcripts, their relative abundance in a sample and the genome annotation, sample representative transcripts per gene, in proportion to their relative abundance levels.

Input:

  1. Csv-formatted file ("ID,Level") with expression levels per gene (or per transcript).
  2. Total number of transcripts to sample.
  3. gtf-formatted file with the intron/exon coordinates of the transcripts represented in the expression file.

Output:

  1. Gtf-formatted file of the sampled transcripts.
  2. Csv-formatted file ("ID,Count") with the transcript copies for each representative transcript.

First, we pick a representative transcript for each gene in the annotation file. This transcript has the highest level of experimental support (lowest transcript support level value). If there are multiple such transcripts for a gene, the one that covers the largest genomic region is chosen (based on the coordinates of the exons).

Then, we sample transcript counts up to a specified total, in proportion to the gene expression levels given in the input 1. The expression levels can be provided either per transcript ID or per gene ID. If transcript expression levels are given, these transcripts are not guaranteed to be the representative ones, but the expression should be extracted per representative transcript. If the expression level is provided per gene, it needs to be assigned to the representative transcript as well. So, a dictionary of representative transcript ID : gene ID has to be build first. Then the expression of all transcripts associated with the gene should be cumulated on a per gene basis (if the expression values are not already provided per gene) and then the gene expression level should be transferred to the representative transcript and written out.

Pipeline overview description

https://git.scicore.unibas.ch/zavolan_group/pipelines/scrna-seq-simulation
Pick the number of transcripts coming from each gene. As #input 1 we get a file with the expression level of individual transcripts from some real sample. For simplicity, we first pick a representative transcript per gene, e.g. with most annotation support (support level 1 or TSL=1). Then, given a total number of transcripts per cell (input #1, we generate, for each representative transcript, a Poisson sample given the average count from input #1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant