Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming R1 fast files in all aligners #333

Open
ChristopherMancuso opened this issue May 24, 2024 · 4 comments
Open

Trimming R1 fast files in all aligners #333

ChristopherMancuso opened this issue May 24, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@ChristopherMancuso
Copy link

Description of feature

I'm following up on a slack post that I put out 2 months ago at https://nfcore.slack.com/archives/CHN5BV5DW/p1712178056321859

I had a question about the fastq format needed for the different aligners. For everything I’m using nextflow v23.10.1 and scrnaseq v2.5.1. From the core at my work place both the R1 and R2 fastq files each have a length of 151 for the reads, instead of R1 being “trimmed” to just be only the barcode and umi (so like 28-ish bps depending on the protocol). When using --aligner cellranger this seems to be handled fine. However, when only switching --aligner to either alevin or star it doesn’t seem to handle that R1 read format well. For alevin the pipeline completes but the number of barcodes in barcodes.tsv is ~200k, which is roughly the number of reads, whereas the expected number of cells is ~5k. For star the pipeline fails at NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN with the error EXITING because of FATAL ERROR in input read file: the total length of barcode sequence is 151 not equal to expected 28. My questions are, is this known behavior of the pipeline? I would like to use alevin or star in the future, do I need preprocess R1 and if so, any help in doing that? Thanks!

the run command I use looks like this, just only changing the aligner argument

nextflow run nf-core/scrnaseq --input samplesheet.csv --fasta /biostats_share/mancchri/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa --gtf /biostats_share/mancchri/genomes/Homo_sapiens.GRCh38.111.gtf --aligner cellranger --protocol 10XV3 --outdir . -work-dir ./work -c config.conf -profile singularity -r 2.5.1
@ChristopherMancuso ChristopherMancuso added the enhancement New feature or request label May 24, 2024
@grst
Copy link
Member

grst commented May 27, 2024

I would have totally expected both alevin and star to only use the number of nucleotides from R1 that is specified in the chemistry definition. In this case you run with --protocol 10XV3, which should lead to the following arguments passed to the tools:

STARsolo

"10XV3": {
"protocol": "CB_UMI_Simple",
"extra_args": "--soloUMIlen 12",
"whitelist": "assets/whitelist/10x_V3_barcode_whitelist.txt.gz"

simpleaf

"10XV3": {
"protocol": "10xv3",
"whitelist": "assets/whitelist/10x_V3_barcode_whitelist.txt.gz"

@rob-p, is there additional trimming needed before running simpleaf?

@ChristopherMancuso
Copy link
Author

@grst Thanks for taking the time to comment! I'm a nextflow novice but I tried to follow how adding --protocol 10XV3 to the main run command changed the aligner options in the assets folder too and it looked like it should work, but didn't for the reasons above. I did enough googling and I think if I downloaded alevin or star as standalone software I could get it to work, but to be honest I like nextflow so much I'll just take the speed hit and use nextflow with cell ranger if this problem persists.

One extra thing is that I did try this pipeline on the test data for every aligner and it works just fine. I manually checked the fastq files used in the test run and R1 there was already "pre-trimmed" to just the umis and barcodes.

@grst
Copy link
Member

grst commented Jun 4, 2024

This should totally be supported. For me the question is if we can make alevin/starsolo directly work with the untrimmed reads (and what would be the appropriate command line options for that) or if we'd require an additional step that hard-trims the reads to the required length.

@ChristopherMancuso
Copy link
Author

I will try to look into this and post what I find out here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants