Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

MatthiasZepper · 2023-10-05T11:11:35Z

Description of feature

The documentation of the rnaseq pipeline that refers to running the pipeline on prokaryotic samples is unfortunately completely outdated. It specifies the required settings for featureCounts, even though that tool has been superseded by salmon for transcript quantification about 10 pipeline releases ago!

On Slack, Marine Cambon has already kindly written up what is needed to run the more recent versions of the pipeline successfully with prokaryotic RNA-seq. However, somebody needs to update the pipeline documentation accordingly. Since this requires only Markdown edits, I think it is a suitable task for the Hackathon?

For some general recommendations on how to write good technical documentation, see the website of the Diátaxis framework.

The text was updated successfully, but these errors were encountered:

d4straub · 2025-01-24T10:01:28Z

As of now (3.18.0) the documentation didnt change much.
Also, I couldn't run the pipeline following these suggestions recently. I was using data for E. coli strain BW25113 (4522 genes, 4419 CDS, 121 exon) of the NCBI RefSeq annotation from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000750555.1/.

The following seems important:

only "exon" are measured, therefore all features of interest need to have an "exon"
there may not be any special signs or maybe even not space in column number 2 (and often is)
gene_id are neccessary for each "exon" entry

The simplest solution I found for now was using the annotation in gff format and convert & simplify with gffread. Additionally gene_id have to be added then. This can be achieved as below.

gffread genomic.gff --keep-exon-attrs -F -T --force-exons | sed '/gene_id/!s/transcript_id\([^;]*\)/&; \0/;/transcript_id.*transcript_id/s/transcript_id/gene_id/' > genomic_gffread_force-exons_geneid.gtf

A similar discussion of the problem can be found in slack https://nfcore.slack.com/archives/CE8SSJV3N/p1679577061847429?thread_ts=1677835193.447669&cid=CE8SSJV3N

This solution should be applicable broadly, but I am not sure, but if that is indeed helpful then it might make sense to add that info to the docs.

pmoris · 2025-02-11T09:57:51Z

I just stumbled upon this issue while searching through Slack (see my own thread here).

I'd like to add the following info here:

My current understanding is that in cases where the automatic transcript file generation is lacking, this can be done manually, but during the STAR index creation the feature type needs to be explicitly set to the relevant one (--sjdbGTFfeatureExon =exon by default), in order for the trascriptome-aligned bam files to match any manually created transcript fasta (e.g. when your GTF lacks exons and you rely on CDS entries instead).
While this slack write-up by Marine Cambon indicates that rsem-prepare-reference uses the exon feature entries of the GTF file, but the authors of rsem mention that both (?) transcript and exon lines are used here: https://groups.google.com/forum/#!topic/rsem-users/GoWPrFzMHMI. I can't figure out which ones matter,
This appears to not just be a problem for prokaryotes, but also for eukaryotes where the GTF file consists of mainly gene and CDS features, the latter being the only mandatory feature (besides start_codon and stop_codon) according to https://agat.readthedocs.io/en/latest/gxf.html#gtf2-2. E.g., I encountered this for an (eukaryotic) GenBank annotation of P. vivax. In this case, it appears that gene is used as the top-level feature, with transcript only being used for a few t/rRNAs. ~~So that would mean I'd need to change the sjdbGTFtagExonParentTranscript flag to gene instead of the default transcript, instead of changing sjdbGTFfeatureExon.~~ As was pointed out by @tdanhorn on Slack: it's not the transcript lines (feature_type) in the GTF that are used, but the transcript_id attribute in the exon lines.

MatthiasZepper added first-timers-only Good for newcomers enhancement labels Oct 5, 2023

MatthiasZepper added this to Hackathon: November 2023 Oct 5, 2023

MatthiasZepper mentioned this issue Oct 5, 2023

Facilitate running the rnaseq pipeline on prokaryotic samples #1085

Open

JGawra assigned JGawra and unassigned JGawra Oct 16, 2023

MatthiasZepper removed this from Hackathon: November 2023 May 16, 2024

MatthiasZepper added this to Hackathon: May 2024 May 16, 2024

MatthiasZepper moved this to To do in Hackathon: May 2024 May 16, 2024

MatthiasZepper added this to Hackathon October 2024 Oct 21, 2024

MatthiasZepper moved this to Todo in Hackathon October 2024 Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

MatthiasZepper commented Oct 5, 2023

d4straub commented Jan 24, 2025 •

edited

Loading

pmoris commented Feb 11, 2025 •

edited

Loading

Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

Documentation update: Running the rnaseq pipeline on prokaryotic samples #1084

Comments

MatthiasZepper commented Oct 5, 2023

Description of feature

d4straub commented Jan 24, 2025 • edited Loading

pmoris commented Feb 11, 2025 • edited Loading

d4straub commented Jan 24, 2025 •

edited

Loading

pmoris commented Feb 11, 2025 •

edited

Loading