-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Duplicates in mirtop.tsv removes more lines than expected #174
Comments
Hi @DwinGrashof, can you check on this code and look at the table this generates to see if it seems better:
I can push this into dev if this seems to address the issue and not mess up with other things :) |
Hi @lpantano , Sorry for my late response, I've been busy during these holiday times. I've looked through the code and it does seem to work better in most cases than the current mirna code. hsa-miR-365a-3p and hsa-miR-365b-3p have the same counts with all the same UIDs in my dataset. For UID iso-22-FPJYUP6XO counts are found for both hsa-miR-107 and hsa-miR-103a-3p. Since these miRNA have both overlapping and unique UIDs, both are reported, but counts for hsa-miR-107 are much lower compared to when all its UIDs are summed. These "problems" are however very much debatable. Retaining counts from UIDs which are in essence ambiguous, where we do not know which miRNA is actually expressed, if not both. There are of course multiple methods to handle these kinds of problems. In the new R code, the first instance of a duplicated UID is retained and the other duplicated instances are removed. This is a rather arbitrary way of handling these ambiguities. Other options could be removing all not-unique UIDs after the This is just something to think about and I completely understand if you would rather keep it as coded in your previous post :) Best, |
Hi @DwinGrashof and @lpantano, This is due to how the function duplicated() works in R in vectors (in our case, the UIDs). The first occurrence of each element is marked as FALSE, therefore, it is the only one that is returned.
From the updated code from Lorena and the suggestions from Dwin, I propose to replace the line to remove duplicates for this one:
Let me know what you think. |
We are going to remove a lot of good data if we do that. There are many sequences that mapped to the same mature but coming from different precursors. We shouldn't remove those lines. The case of 365 family, I wouldn't be worried, since the sequence it is exactly the same, it would be nice to say that probably that miRNA, could be one or another. For the second case, I totally understand. It is a good point, I don't think we keep only uniques because it would remove a lot of data that will mislead the results. We need to think a better way to solve this without removing duplicates but trying to be fair to the expression of different possibles miRNA. I would suggest to leave this for Barcelona Summit. I can be in charge to solve this issue, I have done some other strategies that are better than this and I could adapt it to the pipeline. |
Description of the bug
There is a difference between the mirtop.tsv and mirna.tsv count files. The difference between these two files comes from the R script
collapse_mirtop.r
where the counts for isomirs per miRNA are summed after removing UID duplicates using the code:However, some of the duplicated UIDs are removed but I don't know why you would choose to do so. For example, in miRNA hsa-miR-7-5p I have a duplicated UID with the following stats and counts:
There are more examples like this for hsa-miR-7-5p, where duplicated UIDs with identical variants are removed, although they have quite some counts. This reduces the counts from 150 000 to 70 000 in mirtop and mirna files respectively.
Is the removal of these duplicated UIDs on purpose, or maybe a bug where the above UID should be:
Command used and terminal output
N/A
Relevant files
Added are two files:
mirtop.tsv <- the results from mirtop
mirna.tsv <- the deduplicates results after
collapse_mirtop.r
mirna_mirtop_files.zip
System information
No response
The text was updated successfully, but these errors were encountered: