-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sc/snATAC-seq barcode correction #754
Comments
I’m not familiar with how sn/scATAC-seq data looks and admit I don’t have the time to find this out myself right now. I’ll comment a bit based on what I learned recently in a project in which I had to trim some scRNAseq data and if this doesn’t apply here, please get back to me with an explanation of how the sn/scATAC-seq reads are structured. For my project, R1 contained multiple index sequences (or barcodes) and a UMI. Let’s assume the first index is right at the 5' end. Given a list of index sequences in
The code is public. The If you don’t want the name of the adapter/index/barcode in the header, you can use
I would be open to extending the Cutadapt cannot read I1/I2 files at the moment, see issue #491. In my case, the I1 sequence was already in the read header (of R1 and R2), so I didn’t need to read the I1 file at all. |
Hi Marcel, Thanks much for the quick reply. Indeed what you describe makes perfect sense for snRNA-seq. For sc/snATAC-seq data the sequenced barcode is exclusively in the I1 data file as far as I can tell. If accessing that isn't readily supported it would understandably take some finagling. On some time scale it could be worth checking into, it'd be a nice enhancement. Thanks! Sara |
I would like to understand what exactly you would need Cutadapt to do. Could you provide a (toy) example of an input read triple (R1, R2, I1) plus the expected output? |
Hi Marcel, That's very kind. Indeed I can do so readily to illustrate. I should first say I slightly misspoke/mistyped. Technically the three files are labeled R1, R2 and R3, but the R1 and R3 become the paired input/output to Cutadapt. The middle file (I1 as I was thinking of it, technically R2 in the naming convention) contains the 16 bp barcode string that represents the cell ID. What is called for is to take the reverse complement of the last 16 nt from the R2 string, and compare that to the whitelisted set of barcodes, and apply a barcode correction to a Hamming distance of 1. This form of correction is standard of course, and would putatively be implemented amidst the demultiplexing code. Here are two example raw triplicate reads from the raw data:
The anticipated output format is two fastq files with the uncorrected and corrected barcode as read name tags, with appropriate sequence trimming (these examples weren't actually changed):
Hopefully this clarifies what I mean. Here the CR tag is the extracted reverse-complement of the last 16 nt from the R2 file, and the CB tag is the whitelisted barcode sequence it is mapped to. Naturally some nominal statistics about the correction and how many reads are lost and how many barcodes are used or not would also be helpful for QC and documentation. This is the jist of what I understand sc/snATAC-seq would require, at least for our 10x multiome data. The precise length, location and orientation of the barcode is what would vary in differing platforms/chemistries. Best, Sara |
Hi Marcel, A quick follow up, I hope the example .fastq lines above were helpful. I understand if it's not 100% doable off hand, but I'll nevertheless be intrigued if you happen to think it's possible at some point. Thank you again for obliging my inquiry! Best, Sara |
Hi Cutadapt team,
This is not a standard "issue" but an inquiry regarding sn/scATAC-seq data preprocessing (10x Genomics multiome). I'm using Cutadapt to trim snATAC-seq data reads before alignment with bwa-mem2, and it certainly works admirably, and the parallelization is nicely implemented. My concern relates to barcode assessment and correction.
In the workflow I've been testing barcode correction is done in a separate preprocessing step. The three .fastq files (R1 R2 and I1) are read into an executable and 1) the sequenced barcode is extracted from the correct position in the data for each read, 2) compared to a whitelist of anticipated barcodes (from 10x), and 3) the data are written out with the corrected/mapped barcodes as tags in the read names. Those output fastq files are passed to Cutadapt and ultimately the corrected barcode tags on the read names are used in the downstream analysis after alignment.
Such a barcode correction step appears distinct from the demultiplexing options described in the online documentation, and I am inferring this is not currently an anticipated usage for Cutadapt. Is that accurate? Nevertheless, it may be surprisingly easy to implement and a nice extension to be able to fully preprocess sn/scATAC-seq data in one step. Is this a functionality you are already implementing, or open to including? On what time scale? Thanks in advance!
Best regards,
Sara Knaack
The text was updated successfully, but these errors were encountered: