Skip to content

Commit

Permalink
switch to scientific notation when frequencies can't be displayed as …
Browse files Browse the repository at this point in the history
…decimals (#192)

* Python extension: convert to sci notation XSLT 1.0

* use new `format_number_fixed_width` Python extension:  use for the formatting if 'use_python_extensions' parameter is set, otherwise fallback to the previous processing (which skips scientific notation for small numbers and just truncates them)

* add more unit tests for the function

* handle cases when zero as floats (not sci notation)

* enable sci notation format extension by default

* update all unit test output and MD5 where appropriate

* pad the HW tables by an additional space in each cell to account for scientific notation
  • Loading branch information
alexlancaster authored Feb 9, 2024
1 parent 287b6e5 commit 6126a29
Show file tree
Hide file tree
Showing 18 changed files with 795 additions and 589 deletions.
101 changes: 101 additions & 0 deletions src/PyPop/xslt/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/usr/bin/env python

# This file is part of PyPop

# Copyright (C) 2024
# All Rights Reserved.

# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.

# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.

# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
# 02111-1307, USA.

# IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT,
# INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING
# LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS
# DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY
# OF SUCH DAMAGE.

# REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
# FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING
# DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS
# IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT,
# UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

"""
Python XSLT extensions for handling things outside the scope of XSLT 1.0
"""

from lxml import etree
from math import floor, log10, inf
from numpy import format_float_scientific

ns = etree.FunctionNamespace('http://pypop.org/lxml/functions')
ns.prefix = 'es'

def num_zeros(decimal):
return inf if decimal == 0 else -floor(log10(abs(decimal))) - 1

def exponent_len(num):
# length of exponent, e.g.
# "e-3', would be two characters ('-3')
# "e-10" would be 3, ('-10')
return len(str(floor(log10(num))))

@ns
def format_number_fixed_width(context, *args):

num = float(args[0])
places = int(args[1])
zeros_before_sig_figs = num_zeros(num)

if zeros_before_sig_figs >= places and zeros_before_sig_figs != inf:
# get exponent size
exponent_size = exponent_len(num)
# need to reserve space for 'e', plus exponent characters
total_exponent_size = 1 + exponent_size
# use all remaining characters for precision
precision = places - total_exponent_size if places >= total_exponent_size else 0
# now format it
retval = format_float_scientific(num, exp_digits=1, precision=precision, trim='-')
else:
retval = "{0:.{1}f}".format(num, places)
return retval

if __name__ == "__main__":

# some tests

ns['format_number_fixed_width'] = format_number_fixed_width

root = etree.XML('<a><b>0.0000043</b></a>')
doc = etree.ElementTree(root)

xslt = etree.XSLT(etree.XML('''
<stylesheet version="1.0"
xmlns="http://www.w3.org/1999/XSL/Transform"
xmlns:es="http://pypop.org/lxml/functions">
<output method="text" encoding="ASCII"/>
<template match="/">
<text>Yep [</text>
<value-of select="es:format_number_fixed_width(string(/a/b), 5)"/>
<text>]</text>
</template>
</stylesheet>
'''))

print(xslt(doc))


2 changes: 1 addition & 1 deletion src/PyPop/xslt/emhaplofreq.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -476,7 +476,7 @@ MODIFICATIONS.

<xsl:value-of select="$haplos-header"/>

<!-- loop through each haplotype by name -->
<!-- loop through each haplotype by frequency -->
<xsl:for-each select="haplotype">
<xsl:sort select="frequency" data-type="number" order="descending"/>

Expand Down
6 changes: 4 additions & 2 deletions src/PyPop/xslt/hardyweinberg.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -482,9 +482,11 @@ MODIFICATIONS.
</xsl:variable>

<!-- calculate the width required for each cell, this twice the maximum -->
<!-- length of the "observed" cell 'XXX' plus space needed for chars -->
<!-- length of the "observed" cell 'XXX' plus a space for scientific notation in expected -->
<!-- plus space needed for chars -->
<!-- e.g.: XXX/XXX.0 and a padding space -->
<xsl:variable name="cell-width-max" select="$observed-max * 2 + 4"/>
<!-- FIXME: this is a big kludgy, really should also compute the expected-max including sci notation -->
<xsl:variable name="cell-width-max" select="$observed-max * 2 + 1 + 4"/>

<!-- choose the greater of the allele name or cell-width-max for the -->
<!-- standard width -->
Expand Down
20 changes: 16 additions & 4 deletions src/PyPop/xslt/lib.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,11 @@ MODIFICATIONS.
-->
<xsl:stylesheet
version='1.0'
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:es="http://pypop.org/lxml/functions">

<xsl:param name="use-python-extensions" select="1"/>

<!-- contains a library of named templates not specific to any DTD or
XML schema -->

Expand Down Expand Up @@ -100,9 +103,18 @@ MODIFICATIONS.
<xsl:with-param name="length" select="$places + 2"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of
select="format-number((round($factor * $node) div $factor),
$format)"/>

<xsl:choose>
<xsl:when test="$use-python-extensions = 1">
<!-- if enabled, use Python extension to use scientific notation if necessary -->
<xsl:value-of select="es:format_number_fixed_width(string($node), $places)"/>
</xsl:when>
<xsl:otherwise>
<!-- otherwise, as a fallback, just round it (doesn't do the scientific notation) -->
<xsl:value-of
select="format-number((round($factor * $node) div $factor),$format)"/>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<!-- if not a number (NaN) return as text -->
<xsl:otherwise><xsl:value-of select="$node"/></xsl:otherwise>
Expand Down
9 changes: 6 additions & 3 deletions tests/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,19 +62,22 @@ def abspath_test_data(filename):
def filecmp_ignore_newlines(out_filename, gold_out_filename):

l1 = l2 = True
retval = True # default to match, unless there is a diff
# opening up files defaults to 'universal newlines' this ignores OS-specific newline differences
with open(out_filename, 'r') as f1, open(gold_out_filename, 'r') as f2:
while l1 and l2:
l1 = f1.readline()
l2 = f2.readline()
if l1 != l2:
# generate the full-diff
diff = unified_diff(open(out_filename, 'r').readlines(), open(gold_out_filename, 'r').readlines())
diff = unified_diff(open(gold_out_filename, 'r').readlines(), open(out_filename, 'r').readlines())
delta = ''.join(diff)
print (delta)

return False
return True
retval = False # mismatch
break

return retval

def filecmp_list_of_files(filename_list, gold_out_directory):

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,15 @@ Total 1.00000 16 | Total 1.00000 16
----------------------
Table of genotypes, format of each cell is: observed/expected.

01:01 0/0.0
02:01 0/0.3 1/0.5
02:10 0/0.1 0/0.5 0/0.1
02:18 1/0.1 0/0.3 0/0.1 0/0.0
03:01 0/0.1 0/0.5 1/0.3 0/0.1 0/0.1
25:01 0/0.1 1/0.5 0/0.3 0/0.1 0/0.3 0/0.1
32:04 0/0.2 0/0.8 1/0.4 0/0.2 1/0.4 1/0.4 0/0.3
68:14 0/0.1 1/0.3 0/0.1 0/0.1 0/0.1 0/0.1 0/0.2 0/0.0
01:01 02:01 02:10 02:18 03:01 25:01 32:04 68:14
01:01 0/3e-2
02:01 0/0.2 1/0.5
02:10 0/0.1 0/0.5 0/0.1
02:18 1/6e-2 0/0.2 0/0.1 0/3e-2
03:01 0/0.1 0/0.5 1/0.2 0/0.1 0/0.1
25:01 0/0.1 1/0.5 0/0.2 0/0.1 0/0.2 0/0.1
32:04 0/0.2 0/0.8 1/0.4 0/0.2 1/0.4 1/0.4 0/0.3
68:14 0/6e-2 1/0.2 0/0.1 0/6e-2 0/0.1 0/0.1 0/0.2 0/3e-2
01:01 02:01 02:10 02:18 03:01 25:01 32:04 68:14
[Cols: 1 to 8]

Observed Expected Chi-square DoF p-value
Expand All @@ -85,7 +85,7 @@ Table of genotypes, format of each cell is: observed/expected.
Common + lumped Value not calculated.

------------------------------------------------------------------------------------------
All heterozygotes 7 6.75 0.01 1 0.9233
All heterozygotes 7 6.75 9e-3 1 0.9233
------------------------------------------------------------------------------------------
Common heterozygotes by allele

Expand Down Expand Up @@ -124,16 +124,16 @@ Total 1.00000 20 | Total 1.00000 20
----------------------
Table of genotypes, format of each cell is: observed/expected.

01:02 0/0.4
02:025 1/0.4 0/0.1
03:07 0/0.8 0/0.4 1/0.4
06:05 0/0.4 0/0.2 1/0.4 0/0.1
07:12 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08:04 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.0
12:02 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15:07 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18:01 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.1 0/0.1 0/0.1 0/0.0
01:0202:025 03:07 06:05 07:12 08:04 12:02 15:07 18:01
01:02 0/0.4
02:025 1/0.4 0/0.1
03:07 0/0.8 0/0.4 1/0.4
06:05 0/0.4 0/0.2 1/0.4 0/0.1
07:12 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08:04 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/3e-2
12:02 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15:07 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18:01 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/5e-2 0/0.1 0/0.1 0/3e-2
01:02 02:025 03:07 06:05 07:12 08:04 12:02 15:07 18:01
[Cols: 1 to 9]

Observed Expected Chi-square DoF p-value
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ Total1.00000 16 | Total1.00000 16
----------------------
Table of genotypes, format of each cell is: observed/expected.

01 0/0.0
02 1/0.4 1/1.5
03 0/0.1 1/0.9 0/0.1
25 0/0.1 1/0.9 0/0.3 0/0.1
32 0/0.2 1/1.3 1/0.4 1/0.4 0/0.3
68 0/0.1 1/0.4 0/0.1 0/0.1 0/0.2 0/0.0
01 02 03 25 32 68
01 0/3e-2
02 1/0.4 1/1.5
03 0/0.1 1/0.9 0/0.1
25 0/0.1 1/0.9 0/0.2 0/0.1
32 0/0.2 1/1.3 1/0.4 1/0.4 0/0.3
68 0/6e-2 1/0.4 0/0.1 0/0.1 0/0.2 0/3e-2
01 02 03 25 32 68
[Cols: 1 to 6]

Observed Expected Chi-square DoF p-value
Expand Down Expand Up @@ -120,16 +120,16 @@ Total1.00000 20 | Total1.00000 20
----------------------
Table of genotypes, format of each cell is: observed/expected.

01 0/0.4
02 1/0.4 0/0.1
03 0/0.8 0/0.4 1/0.4
06 0/0.4 0/0.2 1/0.4 0/0.1
07 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.0
12 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.1 0/0.1 0/0.1 0/0.0
01 02 03 06 07 08 12 15 18
01 0/0.4
02 1/0.4 0/0.1
03 0/0.8 0/0.4 1/0.4
06 0/0.4 0/0.2 1/0.4 0/0.1
07 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/3e-2
12 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/5e-2 0/0.1 0/0.1 0/3e-2
01 02 03 06 07 08 12 15 18
[Cols: 1 to 9]

Observed Expected Chi-square DoF p-value
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,15 @@ Total 1.00000 16 | Total 1.00000 16
----------------------
Table of genotypes, format of each cell is: observed/expected.

01:01 0/0.0
02:01 0/0.3 1/0.5
02:10 0/0.1 0/0.5 0/0.1
02:18 1/0.1 0/0.3 0/0.1 0/0.0
03:012 0/0.1 0/0.5 1/0.3 0/0.1 0/0.1
25:01 0/0.1 1/0.5 0/0.3 0/0.1 0/0.3 0/0.1
32:04 0/0.2 0/0.8 1/0.4 0/0.2 1/0.4 1/0.4 0/0.3
68:14 0/0.1 1/0.3 0/0.1 0/0.1 0/0.1 0/0.1 0/0.2 0/0.0
01:01 02:01 02:10 02:1803:012 25:01 32:04 68:14
01:01 0/3e-2
02:01 0/0.2 1/0.5
02:10 0/0.1 0/0.5 0/0.1
02:18 1/6e-2 0/0.2 0/0.1 0/3e-2
03:012 0/0.1 0/0.5 1/0.2 0/0.1 0/0.1
25:01 0/0.1 1/0.5 0/0.2 0/0.1 0/0.2 0/0.1
32:04 0/0.2 0/0.8 1/0.4 0/0.2 1/0.4 1/0.4 0/0.3
68:14 0/6e-2 1/0.2 0/0.1 0/6e-2 0/0.1 0/0.1 0/0.2 0/3e-2
01:01 02:01 02:10 02:18 03:012 25:01 32:04 68:14
[Cols: 1 to 8]

Observed Expected Chi-square DoF p-value
Expand All @@ -85,7 +85,7 @@ Table of genotypes, format of each cell is: observed/expected.
Common + lumped Value not calculated.

------------------------------------------------------------------------------------------
All heterozygotes 7 6.75 0.01 1 0.9233
All heterozygotes 7 6.75 9e-3 1 0.9233
------------------------------------------------------------------------------------------
Common heterozygotes by allele

Expand Down Expand Up @@ -124,16 +124,16 @@ Total 1.00000 20 | Total 1.00000 20
----------------------
Table of genotypes, format of each cell is: observed/expected.

01:02 0/0.4
02:025 1/0.4 0/0.1
03:07 0/0.8 0/0.4 1/0.4
06:05 0/0.4 0/0.2 1/0.4 0/0.1
07:12 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08:04 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.0
12:02 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15:07 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18:01 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/0.1 0/0.1 0/0.1 0/0.0
01:0202:025 03:07 06:05 07:12 08:04 12:02 15:07 18:01
01:02 0/0.4
02:025 1/0.4 0/0.1
03:07 0/0.8 0/0.4 1/0.4
06:05 0/0.4 0/0.2 1/0.4 0/0.1
07:12 2/0.4 0/0.2 0/0.4 0/0.2 0/0.1
08:04 0/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/3e-2
12:02 0/0.4 1/0.2 0/0.4 0/0.2 0/0.2 1/0.1 0/0.1
15:07 0/0.4 0/0.2 1/0.4 1/0.2 0/0.2 0/0.1 0/0.2 0/0.1
18:01 1/0.2 0/0.1 0/0.2 0/0.1 0/0.1 0/5e-2 0/0.1 0/0.1 0/3e-2
01:02 02:025 03:07 06:05 07:12 08:04 12:02 15:07 18:01
[Cols: 1 to 9]

Observed Expected Chi-square DoF p-value
Expand Down
Loading

0 comments on commit 6126a29

Please sign in to comment.