-
Notifications
You must be signed in to change notification settings - Fork 3
Interpretation of sequencing data obtained from NovoGene #30
Comments
I tried two different tools for demultiplexing the data; Both tools work, but both also give a lot of reads which are discarded. I am not sure yet whether this is because;
I need to look more closely at the data and the results to be sure what the cause it and how to solve it. |
Hi Greg, I had a look at the files from the Sabre demultiplexing and I did not really understand why the barcodes file (wt1_enzo_barcodes.txt) has only 2 of the 4 barcodes |
I followed the tutorial on the Sabre github page, and from what I understood you can only enter one barcode per sample. That is probably why it didn't get many of the pairs. Maybe I should run it a second time, now with the other (unused) barcodes on the file with all the unmatched reads. |
I am writing a small python code that loops over the sequencing reads and checks what the barcode is in each sample. I notice that
I guess this is why the other tools discard many of the reads. The question is what to do with those reads? This is a snippet from the output of the code for first 10 reads where first the headers of the reads are shown from the two files (those should be almost identical) and then the two barcodes that are found (if any) in the sequences are shown:
|
Hi Greg, Great that you could make a working Python code for this. Did you process the entire file? If so, could you please tell me where the results are stored (or copy the results to the N drive)? Then I can have a look at it. Would it be possible to separate out the cases where one of the barcodes is missing from those that seem to have mixed barcodes? |
I have not yet processed the entire file, only the first few hundred reads or so. I am trying to implement the demultiplexing myself, because it can't be that hard to do. It is just deciding the what to do with the cases where the barcodes are dubious. |
I'll put an update on the progress for demultiplexing the dataset.
However, the problem which I now have is that files are in a wrong order. Normally the alignment files for the same sample are ordered the same way (i.e. read_1 in the forward file pairs with read_1 in the reverse file, read_2 in forward with the read_2 in reverse etc.). But, this is not the case in demultiplexed files and therefore I am afraid that the alignment software is going to pair the wrong reads with each other and therefore the alignment will be inaccurate. What I now want to try is not to create four demultiplexed read files (what I now have), but two files where the forward and reverse reads are interleaved. So you end up with the following files:
This would than also have a wrong order initially, but this is potentially easier to solve. |
Hi Greg, do these files that you have now still have the headers connected to the reads? In that case it is fine and you do not need to worry about the order of files |
I have now two demultiplexed files that includes both the forward and reverse reads from a single sample. I want to check if the demultiplexing is good, because I didn't had the opportunity to check the data as the remote control of the Linux computer is not flawless and I have some hard time connecting to it. But there should be two files in the 'dataset' folder on the Linux machine that are demultiplexed. |
I have posted a question on Biostars regarding the demultiplexing of the dataset. |
I have found a potential solution for the demultiplexing problem. This solution searches for barcodes in the reads and stores both the forward and reverse reads in the same file (i.e. interleaved paired-end fastq file).
These steps are repeated with the barcodes from sample 2. When this is ran for the wt dataset, it finds 53756670 pairs for sample 1 (which is 88.9% of the total number of reads that contain a barcode from sample 1) and for sample 2 it finds 59550160 pairs (91.2% of the total number of reads that contain a barcode from sample 2). |
Here I present a more thorough overview of some of the steps I have been taking in the processing pipeline the SATAY data.
What the best trimming steps are is really depending on the data. The current steps might be improved, but how exactly I am not sure but possibly with trial-and-error I will see when the quality of the data after trimming is best.
I am not yet entirely convinced what this means and if this is something concerning or not. The output file of the alignment (the sam file) can be checked using samtools flagstat.
This shows that only about 54% of the reads are actually mapped. This is not much compared to the number of reads that were initially sequenced, but the absolute number of mapped reads is comparable to the number of mapped reads in the dataset from benoit. Maybe more worrying is that only 29% of the reads are mapped as actual pairs. This is also visible using IGV where many reads are colored (see figure below).
|
This is a follow up on point 4 raised in the previous comment on the flag handling in the transposon mapping python code. |
Hi Greg, nice that it worked, but the uneven distribution of reads throughout the genome is indeed troublesome. Do you maybe know why 10 million of the reads that were left after taking only those that contain the sequencing primer from Benoit could aligned? And how many mismatches did you allow in the alignment of the sequencing primer? It would be good to have the number of transposons that were mapped such that we can see if the lower amount of contrast between essential/non-essential genes is due to a difference in the number of transposon insertions or due to the amount of reads we have. What it looks like now is that we might have filtered for locations that survive this process of trimming and alignment, which could cause the peaks in your alignment plots. |
I have tried to determine the statistics similar to the ones Benoit shows in his paper, but I fail to reproduce the values presented in the paper. Probably I am doing something wrong here, but nevertheless here are the number as I have them so far for Enzo's dataset. I'll update the numbers as soon as figured out what's wrong.
I don't know, I haven't stored the rejected reads. I should recreate the dataset for this with the rejected reads. More interestingly, I have also plotted the number of transposon instead the number of reads. The number of transposons in less noisy compared to the read count. This show some similarities with Benoits dataset (see figure below, note the difference in y-scales). This clearly indicates the centromere and Ade2-gene bias in transposition. Also there is a large peak in chromosome 12 around bp 460000. I looked at this location on sgd, but I couldn't find anything interesting that would explain such a large peak. There are mainly non-coding regions and ribosomal RNA sequences and two small genes which I don't expect them to cause such a transposition bias. Also in the number of reads a high density of reads are visible. Note: I used these dataset for creating the above plot:
|
We got back 150 bp read data from NovoGene from the SATAY experiments. Based on a manual inspection of the data, we see that:
Each read also contains the sequence of the primers used for amplification of the DNA and part of the MiniDS sequence. Different samples can therefore be separated based on the barcode that can be found in the read (this barcode is not mentioned in the fastqc file and probably has to be used as input for demultiplexing the samples)
Each read pair can be identified from the fastqc file and can therefore be processed as a pair (which means you can match the forward and reverse reads from the same strand of DNA)
After removing the primer+MiniDS sequence, about 40-70 bp remains for alignment to a reference genomic DNA sequence, which may not be a lot (the protocol mentions 75 bp used for alignment). It might be beneficial to use 250 bp sequencing next time if we think that might significantly improve the alignment results.
The text was updated successfully, but these errors were encountered: