Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing from sRNAbench data (II) #60

Open
JFsanchezherrero opened this issue Sep 30, 2019 · 3 comments
Open

parsing from sRNAbench data (II) #60

JFsanchezherrero opened this issue Sep 30, 2019 · 3 comments
Assignees
Labels

Comments

@JFsanchezherrero
Copy link

Hi there,

We (@lsumoy and I) have found an unexpected result from miRTop when parsing sRNAbench data. It is somehow related to the previous issue #53 but not entirely that is we generated a new one.

We came into this because we are working on an implementation of miRTop results to generate a matrix, as it could be useful for DE containing all information regarding canonical, mature, variants and license plate information. As stated before #53 (comment), we might be interested to contribute to the code.

We think the issue shown here must be fixed from other people familiarized with the miRTop code and so we are reporting it.

Expected behavior and actual behavior.

We have found that there is a conflict in the sum count of expression per microRNA isomir when parsed from sRNAbench to miRTop gff.

I have generated some tests and it all concludes that parsing is avoiding to include variant type "mv" (multiple variants) (among others) from microRNAannotation.txt and reads.annotation from sRNAbench. That is generating an imbalance when obtaining total counts:
e.g. hsa-miR-10b-5p
sRNAbench: 61136
miRTop: 58429

Steps to reproduce the problem.

I have included a couple of examples in these files with the example from below and other:
microRNAannotation.txt
reads.annotation.txt

jsanchez@cacau:test$ grep 'hsa-miR-10b-5p' microRNAannotation.txt | awk '{sum += $6} END {print sum}'
61136
jsanchez@cacau:test$ grep 'hsa-miR-10b-5p' microRNAannotation.txt | grep -v 'mv' | awk '{sum += $6} END {print sum}'
58601

I repeated the same command for the different variant types identified for this microRNA and sample:

lv3p: 53452
nta: 3233
mv: 2535
exact: 1391
lv5p: 439
exactNucVar: 84
mlv3p: 2

I can not cleary see what is going one and missing here. I can not reproduce the total sum count so I guess, among not counting mv variants, some others variants might be not included.

I also include here the gff file generated by miRTop.
miRTop.gff.txt

Specifications like the version of the project, operating system, or hardware.

We are running this on:
debian8.10
linux

python2.7
mirtop (0.3.17)

miRBase v22
genome-build-id: GRCh38
genome-build-accession: NCBI_Assembly:GCA_000001405.15

@lpantano
Copy link
Contributor

lpantano commented Oct 1, 2019

Thanks a lot for this. When I was implementing this, I realized some of the variants cannot be parsed to GFF. I can take a look into that since it is very important.

Can you paste the information mirtop print while running the conversion from sRNAbench to GFF?

@lpantano lpantano added the bug label Oct 1, 2019
@lpantano lpantano self-assigned this Oct 1, 2019
@JFsanchezherrero
Copy link
Author

Hi there Lorena,
They definitely seem important, at least for some samples.

We think that these variants must be complicated to add them into any previous given category, but it could be appropiate to include them in the gff, even with a common name such as "non classified". If done so, then, you would be able to sum all counts, at the level of variant, canonical or isomir, and it will be the same total number. Also, whenever you want you can always change the category of this variant or better classify them among others.

Here is the information generated during the creation of gff by mirtop. This is stored in run.log

INFO-mirtop.libs.logger(27): Run annotation
INFO-mirtop.libs.logger(47): Reads with isomiR information 19317
INFO-mirtop.libs.logger(131): Loaded 568 reads with 24382 hits
INFO-mirtop.libs.logger(132): Reads without precursor information: 1601
INFO-mirtop.libs.logger(134): Reads with MV as variant definition, not supported by GFF: 1829
INFO-mirtop.libs.logger(135): Hit Filtered by having > 3 changes: 0
INFO-mirtop.libs.logger(49): It took 0.060 minutes

I can read in the log that there are MV variants (1829) that could not be included but this number neither is the same as stated before (2535) that I can count from sRNAbench microRNAannotation.txt file. I guess, there is something else missing here.

Thank you very much in advance

@JFsanchezherrero
Copy link
Author

Hi there,

I have just realized that I made mistake in the previous comment.

The number reported in this run.log output when generating gff, the 1829 reads with MV variants must be single entries. I mean that each entry can have multiple reads mapping to a given miRNA with a variant type. This 1829 number is the result of parsing the sRNAbench result for all the miRNA identified in this sample and condition. This has nothing to do with the 2535 that are read counts misbalanced for a given example miRNA.

I have checked the total number of entries (= lines) in microRNAannotation file for this sample containing mv as a variant annotation, including others or single (e.g. mv, mv$lv3p, ...) and it accounts for 1821. (It is neither the same number reported but at least it is very close).

PreviousIy I only attached here in this issue, as an example, a couple of miRNA example annotations with a clear misbalance in total sum counts between sRNAbench and miRTop. If you feel like it is necessary I can send you whole files generated by sRNAbench, or at least the whole microRNAannotation and reads.annotation.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants