Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to scientific notation when frequencies can't be displayed as decimals #104

Closed
alexlancaster opened this issue Jul 30, 2023 · 4 comments · Fixed by #192
Closed

switch to scientific notation when frequencies can't be displayed as decimals #104

alexlancaster opened this issue Jul 30, 2023 · 4 comments · Fixed by #192
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@alexlancaster
Copy link
Owner

Issue #38 fixed the issue of truncating frequencies that eliminate the exponent when shown in scientific notation by forcing decimal notation. However, this means for extremely low frequency haplotypes, they might be shown as zero (e.g., 0.00). The better solution would be to truncate but preserve the scientific notation. This is non-trivial using XSLT, so opening up a new issue. Because the current temporary solution, at least makes the output is no longer incorrect, making this an enhancment.

Cc @sjmack

@alexlancaster
Copy link
Owner Author

alexlancaster commented Feb 6, 2024

hey @sjmack I had a little time to work on this again. It's almost impossible to fix in XSLT directly, but I was able to create a Python extension for xslt that implements the rules we want for this.

I've created a new branch for this: https://github.com/alexlancaster/pypop/tree/104-preserve-scientific-notation-when-truncating-viewable-haplotype-frequencies

I had to fix this more globally - not just within haplotype frequencies, but everwhere we have space with limited characters. I currently have a special round-to template for this. It currently rounds to a specified number of characters after the decimal point, but I've created the Python extension that adds some new rules for when to switch to scientific notation in the function format_number_fixed_width:

def format_number_fixed_width(context, *args):
num = float(args[0])
places = int(args[1])
zeros_before_sig_figs = num_zeros(num)
# need to reserve 4 characters for exponent
precision = places - 4 if places >= 4 else 0
if zeros_before_sig_figs >= places and zeros_before_sig_figs != inf:
#retval = "{0:.{1}E}".format(num, precision)
retval = format_float_scientific(num, exp_digits=1, precision=precision, trim='-')
else:
retval = "{0:.{1}f}".format(num, places)
return retval

So now if I set the "places" to, say 5, it will look like the following (there are now unit tests that check these examples and more):

input               formatted
0.032               0.03200
0.00000433          4.3e-6
0.000000            0.00000

This also affects the HW tables etc, which isn't so elegant because of the extra space that e- takes up, but it's probably better than being rounded to zero. e.g. here i

11:04P   1/1.1
 13:01P   2/1.6   1/0.6
-13:02P   0/0.4   0/0.3  0/4e-2
-13:03P   0/0.2   0/0.1  0/3e-2  0/6e-3
+13:02P   0/0.4   0/0.3   0/0.0
+13:03P   0/0.2   0/0.1   0/0.0   0/0.0
 14:01P   3/2.8   2/2.0   1/0.5   0/0.2   2/1.7
 15:01P 12/12.1   9/8.7   2/2.4   1/0.9 15/14.9 32/32.5
         11:04P  13:01P  13:02P  13:03P  14:01P  15:01P

So this will mean that a lot of unit and pipeline test outputs will need to be changed. How does it sound to you?

@alexlancaster alexlancaster changed the title preserve scientific notation when truncating viewable haplotype frequencies preserve scientific notation when truncating viewable frequencies Feb 6, 2024
@sjmack
Copy link
Collaborator

sjmack commented Feb 6, 2024

Thanks for addressing this @alexlancaster!

Is it possible to have "places" 'float' depending on the number of decimal places, for the formatted output to fill in the space/give the user/reviewer the most information when reviewing the .txt outputs (see below)?

For example:

input               formatted
0.0433              0.04330
0.04333             0.04333
0.000004333         4.33e-6
0.0000000004333     4.3e-10    

Even if not, this new implementation is definitely an improvement over the previous version. And I agree about not doing actually analyses with the .txt outputs; that's what the TSV outputs are there for.

I agree that the HW matrix is less elegant, but it is definitely more informative. Are the + and - in the far left column of the matrix supposed to be old (+) and new (-)? That seems counter intuitive.

How difficult would it be to dynamically space the columns in the HWE matrix in these .txt outputs to accommodate the maximum width of an o/e value pair or the allele name?

@alexlancaster
Copy link
Owner Author

alexlancaster commented Feb 6, 2024

Thanks for addressing this @alexlancaster!

Is it possible to have "places" 'float' depending on the number of decimal places, for the formatted output to fill in the space/give the user/reviewer the most information when reviewing the .txt outputs (see below)?

For example:

input               formatted
0.0433              0.04330
0.04333             0.04333
0.000004333         4.33e-6
0.0000000004333     4.3e-10    

Yes, this is the intention. It did a pretty good job of this before, but I just rejiggered it a bit so that it now maximizes the use of any given number of character "places". It's tricky because the exponent length can also vary. So it has to be dynamic: we calculate the precision based on the fixed size of the e character, plus the variable length exponent (typically 1 or 2 characters), and then use whatever is leftover for the precision. I've added your examples to the test cases, see below:

     # input                 output       places
        ('0.032',           '0.03200',  5), # pad out with leading zeros
        ('0.0433',          '0.04330',  5),
        ('0.04333',         '0.04333',  5),
        ('0.000004333',     '4.33e-6',  5), # converts to scientific notation to fit in the 5 character ('places') limit
        ('0.0000000004333', '4.3e-10',  5),
        ('0.00000433',      '0.000004', 6), # does not convert to scientific notation, because we have 6 characters
        ('0.00000491',      '0.000005', 6), # check rounding! 
        ('0.000000433',     '4.33e-7',  6), # again need scientific notation to fit
        ('0.000000',        '0.0000',   4), # handle zero as float, not sci notation

Feel free to add additional corner cases so we can make sure we capture all the variations you want to cover, and I'll put them in the unit test.

Even if not, this new implementation is definitely an improvement over the previous version. And I agree about not doing actually analyses with the .txt outputs; that's what the TSV outputs are there for.

I will add some text to the documentation as part of the PR.

I agree that the HW matrix is less elegant, but it is definitely more informative. Are the + and - in the far left column of the matrix supposed to be old (+) and new (-)? That seems counter intuitive.

Oh that's just an artifact of the order in which I did the diff (I just copy and pasted from the diff generated by the unit test, I should probably reorder it). Don't get hung up on that. I was just using the neat markdown feature where you can cut and paste diffs and have them be color coded.

How difficult would it be to dynamically space the columns in the HWE matrix in these .txt outputs to accommodate the maximum width of an o/e value pair or the allele name?

It's fiddly - it already does this partially, but it was tuned before we added scientific notation. I'd rather get this feature in, and maybe you can spawn another feature request for that? So long as it's readable and doesn't overlap other characters, that's what I care about most now.

@alexlancaster alexlancaster changed the title preserve scientific notation when truncating viewable frequencies switch to scientific notation when frequencies can't be displayed as decimals Feb 6, 2024
@alexlancaster
Copy link
Owner Author

@sjmack I've created PR #192 that should fix most things, please take a moment to review it. For the HW tables, I added an extra space to all cells to handle the exponential notation e-. This also has the net benefit of making the labels better spaced out, but the tables are a bit wider. I think it works. See some of the examples in the diffs for the updated "gold" output for the unit/pipeline tests to see what they look like:

https://github.com/alexlancaster/pypop/pull/192/files#diff-2dda9b66be4d7310af66262c7e2cb5dd3048c0bbc564790f8d334014c636a473

You can see all the differences in the diffs for the PR.

If it looks good, I'll merge it either in the next patch release (1.0.1), or if you think it's different enough to be considered a "minor" release - i.e. user visible changes, it might be 1.1.0. At one level it's a just bug fix, but it's also an enhancement and somewhat of a change in behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants