Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with the reporting of very low haplotype frequencies in *-out.txt files #60

Closed
sjmack opened this issue May 17, 2023 · 7 comments
Labels
duplicate This issue or pull request already exists

Comments

@sjmack
Copy link
Collaborator

sjmack commented May 17, 2023

A haplotype frequency that is sufficiently low to require that it is recorded in scientific notation in the *-out.xml file is not properly recorded in the *-out.txt file.

For example, the attached *-out.xml file includes three haplotypes with very low frequencies:

<haplotype name="05:01:01~131:01:01"><frequency>1.4450935155304534e-05</frequency></haplotype>
<haplotype name="06:02:01~835:01:01"><frequency>7.332759277231474e-09</frequency></haplotype>
<haplotype name="06:04:01~04:01:01"><frequency>9.068719188886876e-07</frequency></haplotype>

In the corresponding *-out.txt file, these frequencies are rounded to seven digits and reported without the exponent, which makes these frequencies appear to be larger than 1 in the *-out.txt file, as below:

05:01:01~131:01:01   1.4450935    
06:02:01~835:01:01   7.3327592         
06:04:01~04:01:01    9.0687191

Weird_Controls-out.txt
Weird_Controls-out.xml.txt

I see that I reported this earlier in 2017! It is still an issue.

@alexlancaster
Copy link
Owner

alexlancaster commented May 23, 2023

Hey Steve, thanks for the report (I didn't see an old github issue - either open or closed - about this, can you point the specific one?). What version are you currently using? Are you building the pre-release v1.0.0 out of github, or are these old 0.7.0 binaries? Can you add a complete .xml file so I can reproduce the stylesheet transformation?

Also a minimal .ini and .pop for a complete end-to-end test would be great (I can turn it into a test-case).

@alexlancaster
Copy link
Owner

I see what the issue is. As you noticed the XSLT just truncates the display, losing the scientific notation part.

This does raise the issue of how we handle the text output in a uniform way. While the problem is easy to state, the solution is a bit more subtle. The quick fix is that I can use the round-to() XSLT function which I already wrote to convert to decimal and then round to the same specified number of places as would fit in padding setting (this would be 8). Then the numbers would look correct, but anything that was 1e-8 or smaller would just show up as 0.00000000, so it's "lossy" in that sense. But this might not be such a big deal since presumably most people will not be really interested in haplotype that are that small.

The better solution would be to come up with a heuristic like:

  • if it's 1e-7 or bigger show as a decimal (padded to however wide we want to make the columns),
  • but if it's smaller it would show up in scientific notation with however many significant could fit in the same padded region

This is theoretically possible, but XSLT 1.0 doesn't have good support for scientific notation - there isn't a built-in function that can take a number in scientific notation and only show a specified number of significant figures - so it would probably mean custom function which might take a while to write.

Would the quick fix be sufficient for the time being? So at least the .txt output isn't wrong. Of course the full un-rounded output will always be present in the ..xml, so that's never lost.

@sjmack
Copy link
Collaborator Author

sjmack commented May 23, 2023

Hi Alex. I'm glad to see you found the original issue. I'm using the most recent version of PyPop (at least as of a month ago) available on this repo.

The most expedient solution would be to effectively round things to 0 as you suggest, for the reasons you suggest.

A useful alternative would be, again as you suggested, to round to sufficient digits to allow the exponent to be included in the string. So 1.4450935155304534e-05 would be reported as 1.445e-05 in the .txt output.

For now, I suppose the quick fix is the easiest one, but since today is the 2100th day since the issue was originally reported, perhaps taking a little more time to address it is okay. :D

@alexlancaster
Copy link
Owner

alexlancaster commented May 23, 2023

Hi Alex. I'm glad to see you found the original issue.

Actually I didn't find an original report on the issue tracker (unless it was buried in an unrelated bug report, or if it was reported by email). Do you have a ticket number?

I'm using the most recent version of PyPop (at least as of a month ago) available on this repo.

So you compiled ok from source? since no official binaries have been released yet.

The most expedient solution would be to effectively round things to 0 as you suggest, for the reasons you suggest.

I'll implement this one for the time being - it will probably also require updating all the unit tests that generate haplotypes, since it will mean all output will be rounded rather than truncated.

A useful alternative would be, again as you suggested, to round to sufficient digits to allow the exponent to be included in the string. So 1.4450935155304534e-05 would be reported as 1.445e-05 in the .txt output.

I'll open up another bug as an enhancement for that part, after I close this one.

For now, I suppose the quick fix is the easiest one, but since today is the 2100th day since the issue was originally reported, perhaps taking a little more time to address it is okay. :D

@sjmack
Copy link
Collaborator Author

sjmack commented May 23, 2023

Actually I didn't find an original report on the issue tracker (unless it was buried in an unrelated bug report, or if it was reported by email). Do you have a ticket number?

it is #38 "Problem with Haplostats Implementation". I wasn't as detailed in my original description, I suppose.

So you compiled ok from source? since no official binaries have been released yet.

I followed the instructions and everything worked fine. I do have to leave an obnoxiously-named file in my PyPop directory to remind myself activate the conda environment if I have restarted my machine since the last time I used pypop. I only had to bang my head on rock for 20 minutes one time before doing that.

I'll implement this one for the time being - it will probably also require updating all the until tests that generate haplotypes, since it will mean all output will be rounded rather than truncated.

That is probably fine. In the future though, it would be nice to be able to set an output parameter to specify if the user wanted the output rounded or to have epsilon-like frequencies automatically set to 0.

I'll open up another bug as an enhancement for that part, after I close this one.

Thanks! Perhaps you could also include setting a frequency-reporting parameter as an enhancement.

@alexlancaster
Copy link
Owner

I'm marking this one as a duplicate of the earlier one. We can continue discussion there until I close it.

@alexlancaster
Copy link
Owner

Duplicate of #38

@alexlancaster alexlancaster added the duplicate This issue or pull request already exists label May 23, 2023
@alexlancaster alexlancaster closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants