switch to scientific notation when frequencies can't be displayed as decimals #104

alexlancaster · 2023-07-30T17:56:44Z

Issue #38 fixed the issue of truncating frequencies that eliminate the exponent when shown in scientific notation by forcing decimal notation. However, this means for extremely low frequency haplotypes, they might be shown as zero (e.g., 0.00). The better solution would be to truncate but preserve the scientific notation. This is non-trivial using XSLT, so opening up a new issue. Because the current temporary solution, at least makes the output is no longer incorrect, making this an enhancment.

Cc @sjmack

The text was updated successfully, but these errors were encountered:

alexlancaster · 2024-02-06T04:43:58Z

hey @sjmack I had a little time to work on this again. It's almost impossible to fix in XSLT directly, but I was able to create a Python extension for xslt that implements the rules we want for this.

I've created a new branch for this: https://github.com/alexlancaster/pypop/tree/104-preserve-scientific-notation-when-truncating-viewable-haplotype-frequencies

I had to fix this more globally - not just within haplotype frequencies, but everwhere we have space with limited characters. I currently have a special round-to template for this. It currently rounds to a specified number of characters after the decimal point, but I've created the Python extension that adds some new rules for when to switch to scientific notation in the function format_number_fixed_width:

pypop/src/PyPop/xslt/__init__.py

Lines 52 to 65 in 032c6a0

    
           def format_number_fixed_width(context, *args): 
        
               num = float(args[0]) 
        
               places = int(args[1]) 
        
               zeros_before_sig_figs = num_zeros(num) 
        
               # need to reserve 4 characters for exponent 
        
               precision = places - 4 if places >= 4 else 0 
        
               if zeros_before_sig_figs >= places and zeros_before_sig_figs != inf: 
        
                   #retval = "{0:.{1}E}".format(num, precision) 
        
                   retval = format_float_scientific(num, exp_digits=1, precision=precision, trim='-') 
        
               else: 
        
                   retval = "{0:.{1}f}".format(num, places) 
        
               return retval

So now if I set the "places" to, say 5, it will look like the following (there are now unit tests that check these examples and more):

input               formatted
0.032               0.03200
0.00000433          4.3e-6
0.000000            0.00000

This also affects the HW tables etc, which isn't so elegant because of the extra space that e- takes up, but it's probably better than being rounded to zero. e.g. here i

11:04P   1/1.1
 13:01P   2/1.6   1/0.6
-13:02P   0/0.4   0/0.3  0/4e-2
-13:03P   0/0.2   0/0.1  0/3e-2  0/6e-3
+13:02P   0/0.4   0/0.3   0/0.0
+13:03P   0/0.2   0/0.1   0/0.0   0/0.0
 14:01P   3/2.8   2/2.0   1/0.5   0/0.2   2/1.7
 15:01P 12/12.1   9/8.7   2/2.4   1/0.9 15/14.9 32/32.5
         11:04P  13:01P  13:02P  13:03P  14:01P  15:01P

So this will mean that a lot of unit and pipeline test outputs will need to be changed. How does it sound to you?

sjmack · 2024-02-06T20:02:45Z

Thanks for addressing this @alexlancaster!

Is it possible to have "places" 'float' depending on the number of decimal places, for the formatted output to fill in the space/give the user/reviewer the most information when reviewing the .txt outputs (see below)?

For example:

input               formatted
0.0433              0.04330
0.04333             0.04333
0.000004333         4.33e-6
0.0000000004333     4.3e-10

Even if not, this new implementation is definitely an improvement over the previous version. And I agree about not doing actually analyses with the .txt outputs; that's what the TSV outputs are there for.

I agree that the HW matrix is less elegant, but it is definitely more informative. Are the + and - in the far left column of the matrix supposed to be old (+) and new (-)? That seems counter intuitive.

How difficult would it be to dynamically space the columns in the HWE matrix in these .txt outputs to accommodate the maximum width of an o/e value pair or the allele name?

alexlancaster · 2024-02-06T23:09:58Z

Thanks for addressing this @alexlancaster!

Is it possible to have "places" 'float' depending on the number of decimal places, for the formatted output to fill in the space/give the user/reviewer the most information when reviewing the .txt outputs (see below)?

For example:
input               formatted
0.0433              0.04330
0.04333             0.04333
0.000004333         4.33e-6
0.0000000004333     4.3e-10    

Yes, this is the intention. It did a pretty good job of this before, but I just rejiggered it a bit so that it now maximizes the use of any given number of character "places". It's tricky because the exponent length can also vary. So it has to be dynamic: we calculate the precision based on the fixed size of the e character, plus the variable length exponent (typically 1 or 2 characters), and then use whatever is leftover for the precision. I've added your examples to the test cases, see below:

     # input                 output       places
        ('0.032',           '0.03200',  5), # pad out with leading zeros
        ('0.0433',          '0.04330',  5),
        ('0.04333',         '0.04333',  5),
        ('0.000004333',     '4.33e-6',  5), # converts to scientific notation to fit in the 5 character ('places') limit
        ('0.0000000004333', '4.3e-10',  5),
        ('0.00000433',      '0.000004', 6), # does not convert to scientific notation, because we have 6 characters
        ('0.00000491',      '0.000005', 6), # check rounding! 
        ('0.000000433',     '4.33e-7',  6), # again need scientific notation to fit
        ('0.000000',        '0.0000',   4), # handle zero as float, not sci notation

Feel free to add additional corner cases so we can make sure we capture all the variations you want to cover, and I'll put them in the unit test.

Even if not, this new implementation is definitely an improvement over the previous version. And I agree about not doing actually analyses with the .txt outputs; that's what the TSV outputs are there for.

I will add some text to the documentation as part of the PR.

I agree that the HW matrix is less elegant, but it is definitely more informative. Are the + and - in the far left column of the matrix supposed to be old (+) and new (-)? That seems counter intuitive.

Oh that's just an artifact of the order in which I did the diff (I just copy and pasted from the diff generated by the unit test, I should probably reorder it). Don't get hung up on that. I was just using the neat markdown feature where you can cut and paste diffs and have them be color coded.

How difficult would it be to dynamically space the columns in the HWE matrix in these .txt outputs to accommodate the maximum width of an o/e value pair or the allele name?

It's fiddly - it already does this partially, but it was tuned before we added scientific notation. I'd rather get this feature in, and maybe you can spawn another feature request for that? So long as it's readable and doesn't overlap other characters, that's what I care about most now.

alexlancaster · 2024-02-08T04:39:26Z

@sjmack I've created PR #192 that should fix most things, please take a moment to review it. For the HW tables, I added an extra space to all cells to handle the exponential notation e-. This also has the net benefit of making the labels better spaced out, but the tables are a bit wider. I think it works. See some of the examples in the diffs for the updated "gold" output for the unit/pipeline tests to see what they look like:

https://github.com/alexlancaster/pypop/pull/192/files#diff-2dda9b66be4d7310af66262c7e2cb5dd3048c0bbc564790f8d334014c636a473

You can see all the differences in the diffs for the PR.

If it looks good, I'll merge it either in the next patch release (1.0.1), or if you think it's different enough to be considered a "minor" release - i.e. user visible changes, it might be 1.1.0. At one level it's a just bug fix, but it's also an enhancement and somewhat of a change in behaviour.

alexlancaster added the enhancement New feature or request label Jul 30, 2023

alexlancaster assigned alexlancaster and sjmack Jul 30, 2023

alexlancaster mentioned this issue Jul 30, 2023

HaploStats doesn't report correct low frequency haplotypes #38

Closed

alexlancaster added this to the Nice-To-Have milestone Jul 30, 2023

alexlancaster changed the title ~~preserve scientific notation when truncating viewable haplotype frequencies~~ preserve scientific notation when truncating viewable frequencies Feb 6, 2024

alexlancaster changed the title ~~preserve scientific notation when truncating viewable frequencies~~ switch to scientific notation when frequencies can't be displayed as decimals Feb 6, 2024

alexlancaster linked a pull request Feb 7, 2024 that will close this issue

switch to scientific notation when frequencies can't be displayed as decimals #192

Merged

alexlancaster closed this as completed in #192 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to scientific notation when frequencies can't be displayed as decimals #104

switch to scientific notation when frequencies can't be displayed as decimals #104

alexlancaster commented Jul 30, 2023

alexlancaster commented Feb 6, 2024 •

edited

Loading

sjmack commented Feb 6, 2024

alexlancaster commented Feb 6, 2024 •

edited

Loading

alexlancaster commented Feb 8, 2024

switch to scientific notation when frequencies can't be displayed as decimals #104

switch to scientific notation when frequencies can't be displayed as decimals #104

Comments

alexlancaster commented Jul 30, 2023

alexlancaster commented Feb 6, 2024 • edited Loading

sjmack commented Feb 6, 2024

alexlancaster commented Feb 6, 2024 • edited Loading

alexlancaster commented Feb 8, 2024

alexlancaster commented Feb 6, 2024 •

edited

Loading

alexlancaster commented Feb 6, 2024 •

edited

Loading