Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HWP output issues: missing common genotypes and incorrect Guo & Thompson stats #17

Closed
alexlancaster opened this issue Jul 6, 2017 · 8 comments
Labels
bug Something isn't working, should not be used for new features: use "enhancement" for those

Comments

@alexlancaster
Copy link
Owner

Transferring from issue #4 comment #4 (comment) originally by @sjmack:

However, I have constructed a test data file (the controls from the BIGDAWG synthetic datafile) that reveals several issues with the current HW implementations (vs version 0.7.0).

I'm attaching two versions of this test file:
BIGDAWG_SynthControl_Data.pop.txt
and
BIGDAWG_SynthControl_Data_dash.pop.txt
And the associated .ini file:
WS_BDCtrl_Test_HW.ini.txt
Be sure to remove the .txt suffices.

The difference between the two datasets is that the *_dash.pop file has the colons converted to dashes. I did this so that I could compare the current developmental version of PyPop on my Mac to v0.7.0 running on my PC. I could only run the *_dash.pop file on my PC.

I have three set of results. The git. versions were generated using this development version of PyPop, and the 070. versions with the current release version.

First of all, you will notice that there are no Common Genotypes being generated with the development version. The results below show the dash datasets, but the same happens when colons are included for the developmental version.

Compare (Git):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
             Total           0        0.00
------------------------------------------------------------------------------

With (v0.7.0):

------------------------------------------------------------------------------
Common heterozygotes by allele
       01-01-01-01         152      152.25        0.00        0.9839      
       02-01-01-01          56       56.32        0.00        0.9658      
          02-05-01         132      131.94        0.00        0.9957      
       03-01-01-01         135      134.51        0.00        0.9662      
          03-01-03         102      102.18        0.00        0.9858      
       11-01-01-01          57       57.26        0.00        0.9723      
       11-01-01-02          57       57.26        0.00        0.9723      
          23-01-01          43       43.99        0.02        0.8814      
       24-02-01-01         145      144.70        0.00        0.9801      
          24-02-04          56       56.32        0.00        0.9658      
          25-01-01          38       37.28        0.01        0.9061      
          26-01-01          92       91.40        0.00        0.9501      
             26-08         105      104.85        0.00        0.9885      
       29-01-01-01          56       56.32        0.00        0.9658      
       29-02-01-02         102      102.18        0.00        0.9858      
       31-01-02-01          92       91.40        0.00        0.9501      
       31-01-02-02           9        8.96        0.00        0.9892      
          32-01-01          30       29.55        0.01        0.9342      
             32-02         183      182.44        0.00        0.9667      
          33-01-01           9        8.96        0.00        0.9892      
          68-01-01          76       76.81        0.01        0.9267      
             68-06         155      154.75        0.00        0.9838      

------------------------------------------------------------------------------
Common genotypes
-01-01:01-01-01-01           7        6.88        0.00        0.9621      
-01-01-01:02-05-01          11       11.76        0.05        0.8241      
-01-01:03-01-01-01          12       12.01        0.00        0.9975      
-01-01-01:03-01-03           9        8.95        0.00        0.9856      
-01-01:24-02-01-01          13       13.00        0.00        0.9989      
-01-01-01:26-01-01           8        7.95        0.00        0.9864      
 01-01-01-01:26-08           9        9.19        0.00        0.9488      
-01-01:29-02-01-02           9        8.95        0.00        0.9856      
-01-01:31-01-02-01           8        7.95        0.00        0.9864      
 01-01-01-01:32-02          16       16.82        0.04        0.8424      
-01-01-01:68-01-01           7        6.63        0.02        0.8847      
 01-01-01-01:68-06          14       14.00        0.00        0.9998      
 02-01-01-01:32-02           6        5.88        0.00        0.9590      
 02-05-01:02-05-01           5        5.03        0.00        0.9890      
-05-01:03-01-01-01          10       10.27        0.01        0.9318      
 02-05-01:03-01-03           8        7.65        0.02        0.9001      
-05-01:24-02-01-01          11       11.12        0.00        0.9702      
 02-05-01:26-01-01           7        6.80        0.01        0.9396      
    02-05-01:26-08           8        7.87        0.00        0.9617      
-05-01:29-02-01-02           8        7.65        0.02        0.9001      
-05-01:31-01-02-01           7        6.80        0.01        0.9396      
    02-05-01:32-02          14       14.38        0.01        0.9193      
 02-05-01:68-01-01           6        5.67        0.02        0.8893      
    02-05-01:68-06          12       11.98        0.00        0.9942      
-01-01:03-01-01-01           5        5.25        0.01        0.9146      
-01-01-01:03-01-03           8        7.81        0.00        0.9471      
-01-01:24-02-01-01          12       11.36        0.04        0.8493      
-01-01-01:26-01-01           7        6.95        0.00        0.9837      
 03-01-01-01:26-08           8        8.03        0.00        0.9911      
-01-01:29-02-01-02           8        7.81        0.00        0.9471      
-01-01:31-01-02-01           7        6.95        0.00        0.9837      
 03-01-01-01:32-02          15       14.69        0.01        0.9351      
-01-01-01:68-01-01           6        5.79        0.01        0.9299      
 03-01-01-01:68-06          12       12.23        0.00        0.9480      
-01-03:24-02-01-01           8        8.46        0.03        0.8741      
 03-01-03:26-01-01           5        5.17        0.01        0.9391      
    03-01-03:26-08           6        5.98        0.00        0.9941      
-01-03:29-02-01-02           6        5.82        0.01        0.9406      
-01-03:31-01-02-01           5        5.17        0.01        0.9391      
    03-01-03:32-02          11       10.94        0.00        0.9856      
    03-01-03:68-06           9        9.11        0.00        0.9715      
 11-01-01-01:32-02           6        5.98        0.00        0.9923      
 11-01-01-02:32-02           6        5.98        0.00        0.9923      
-01-01:24-02-01-01           6        6.15        0.00        0.9518      
-02-01-01:26-01-01           8        7.52        0.03        0.8613      
 24-02-01-01:26-08           9        8.70        0.01        0.9179      
-01-01:29-02-01-02           8        8.46        0.03        0.8741      
-01-01:31-01-02-01           8        7.52        0.03        0.8613      
 24-02-01-01:32-02          16       15.90        0.00        0.9807      
-02-01-01:68-01-01           6        6.27        0.01        0.9149      
 24-02-01-01:68-06          13       13.24        0.00        0.9474      
    24-02-04:32-02           6        5.88        0.00        0.9590      
    26-01-01:26-08           5        5.32        0.02        0.8905      
-01-01:29-02-01-02           5        5.17        0.01        0.9391      
    26-01-01:32-02          10        9.72        0.01        0.9296      
    26-01-01:68-06           8        8.10        0.00        0.9731      
 26-08:29-02-01-02           6        5.98        0.00        0.9941      
 26-08:31-01-02-01           5        5.32        0.02        0.8905      
       26-08:32-02          11       11.24        0.01        0.9420      
       26-08:68-06           9        9.36        0.01        0.9061      
 29-01-01-01:32-02           6        5.88        0.00        0.9590      
-01-02:31-01-02-01           5        5.17        0.01        0.9391      
 29-02-01-02:32-02          11       10.94        0.00        0.9856      
 29-02-01-02:68-06           9        9.11        0.00        0.9715      
 31-01-02-01:32-02          10        9.72        0.01        0.9296      
 31-01-02-01:68-06           8        8.10        0.00        0.9731      
       32-02:32-02          10       10.28        0.01        0.9300      
    32-02:68-01-01           8        8.10        0.00        0.9709      
       32-02:68-06          17       17.12        0.00        0.9770      
    68-01-01:68-06           7        6.75        0.01        0.9223      
       68-06:68-06           7        7.13        0.00        0.9624      
             Total         612      612.81
------------------------------------------------------------------------------

In addition, the stats being reported for the developmental version include errors; especially for the Chen and Diff tests, where obs and exp values are 0. The differences in the p-values for the mcmc results probably stem from the Markov-Chain, but I only did each one once, so I'm not certain.

Compare (Git):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0007*** 0.0007***
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0001*** 0.0001***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000*****
06-05-01+06-05-01 (0/0.000000) 0.0049** 0.0049**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12+03-02-12 (0/0.000000) 0.0017** 0.0017**
05-03-01-01+03-02-12 (0/0.000000) 0.0000***** 0.0000*****
05-03-01-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+05-03-01-01 (0/0.000000) 0.0000***** 0.0000*****
06-02-01+06-02-01 (0/0.000000) 0.0007*** 0.0007***
06-03-01+04-02-01 (0/0.000000) 0.0000***** 0.0000*****
06-05-01+05-03-01-01 (0/0.000000) 0.0000**** 0.0000****
06-05-01+06-05-01 (0/0.000000) 0.0016** 0.0016**

to (version 0.7.0):

3.3. Guo and Thompson HardyWeinberg output (mcmc) [DQB1]
--------------------------------------------------------
Total steps in MCMC: 1000000
Dememorization steps: 2000
Number of Markov chain samples: 1000
Markov chain sample size: 1000
Std. error:       0
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0000***** 0.0000*****
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000***** 0.0000*****
06-05-01:06-05-01:  (18/8.818363) 0.0011** 0.0011**


3.4. Guo and Thompson HardyWeinberg output (monte-carlo) [DQB1]
---------------------------------------------------------------
Steps in Monte-Carlo randomization: 100000
p-value (overall): 0.0000*****

Individual genotype p-values found to be significant
Genotype (observed/expected) [Chen's pval] [diff pval]
03-02-12:03-02-12:  (18/8.818363) 0.0017** 0.0017**
05-03-01-01:03-02-12:  (0/18.105788) 0.0000***** 0.0000*****
05-03-01-01:05-03-01-01:  (39/9.293663) 0.0000***** 0.0000*****
06-02-01:05-03-01-01:  (0/23.499002) 0.0000***** 0.0000*****
06-02-01:06-02-01:  (27/14.854291) 0.0007*** 0.0007***
06-03-01:04-02-01:  (8/0.287425) 0.0000***** 0.0000*****
06-05-01:05-03-01-01:  (0/18.105788) 0.0000**** 0.0000****
06-05-01:06-05-01:  (18/8.818363) 0.0016** 0.0016**

Here are the results:

BIGDAWG_SynthControl_Data-out.git.txt
BIGDAWG_SynthControl_Data-out.git.xml.txt
BIGDAWG_SynthControl_Data_dash-out.git.txt
BIGDAWG_SynthControl_Data_dash-out.git.xml.txt
BIGDAWG_SynthControl_Data_dash-out.070.txt
BIGDAWG_SynthControl_Data_dash-out.070.xml.txt

Again, remove the .txt from the .XML filenames.

@alexlancaster alexlancaster added the bug Something isn't working, should not be used for new features: use "enhancement" for those label Jul 6, 2017
@alexlancaster alexlancaster added this to the High milestone Jul 6, 2017
@alexlancaster alexlancaster changed the title HWP output isssues: missing common genotypes and incorrect Guo & Thompson stats HWP output issues: missing common genotypes and incorrect Guo & Thompson stats Jul 6, 2017
alexlancaster added a commit that referenced this issue Jul 11, 2017
adjust existing tests accordingly
add new test using .ini and .pop file from @sjmack
move process call into base test module to make command-line easier
@alexlancaster
Copy link
Owner Author

alexlancaster commented Jul 11, 2017

@sjmack: the problem had a common origin: generating the genotype output table, the fix I just pushed to master (i.e. just git pull and ./setup.py build again) should fix both the common genotype as well as the output in the Guo & Thompson. I added one of your files (the non-dash one) to the test suite (can you re-run py.test -s -v?). Is it worth also adding the dash version as well?

In any case, good catch! more test data files like this will improve our overall test coverage, and catch more issues like this. I would encourage @kosoegawa and others to please open up issues and add files like this.

@sjmack
Copy link
Collaborator

sjmack commented Jul 11, 2017

The dash version is only useful for testing against older versions of PyPop, but that is necessary for validation, so ....

I'll run some tests!

@alexlancaster
Copy link
Owner Author

I'll add to the test suite as well then, since it should handle dashes as well as colons. Right now, we would have problems only if you used the genotype separator (which I've hardcoded as a tilde ~) within the allele identifier. Eventually even this should be removed (see #14) so there would be no "special" character that you would have to avoid, but this would be a more major internal architectural change.

@alexlancaster
Copy link
Owner Author

@sjmack is this issue fixed from your POV? if so, I'll close.

@sjmack
Copy link
Collaborator

sjmack commented Jul 12, 2017

The missing common genotypes and incorrect stats issues are resolved; however Issue #19 discusses an additional issue with the common genotypes (which I should have noted using the *_dash.pop version of the data, but didn't).

py.test -s -v also shows some fails as py.test did earlier.

:pypop sjmack$ py.test -s -v
================================================================= test session starts ==================================================================
platform darwin -- Python 2.7.13, pytest-3.1.2, py-1.4.34, pluggy-0.4.0 -- /opt/local/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python
cachedir: .cache
rootdir: /Applications/PyPop/pypop, inifile:
collected 5 items 

tests/test_AlleleColon.py::test_AlleleColon_HardyWeinberg PASSED
tests/test_AlleleColon.py::test_AlleleColon_Emhaplofreq PASSED
tests/test_GenotypeCommon.py::test_GenotypeCommon_HardyWeinberg FAILED
tests/test_GenotypeCommon.py::test_GenotypeCommonDash_HardyWeinberg FAILED
tests/test_gthwe.py::test_gthwe SKIPPED

======================================================================= FAILURES =======================================================================
__________________________________________________________ test_GenotypeCommon_HardyWeinberg ___________________________________________________________

    def test_GenotypeCommon_HardyWeinberg():
        exit_code = base.run_pypop_process('./tests/data/WS_BDCtrl_Test_HW.ini', './tests/data/BIGDAWG_SynthControl_Data.pop')
        # check exit code
        assert exit_code == 0
        # compare with md5sum of output file
>       assert hashlib.md5(open("BIGDAWG_SynthControl_Data-out.txt", 'rb').read()).hexdigest() == '276263b0d0d9fc03b77826388d70510d'
E       AssertionError: assert 'c0d952cbc16e...3c849ea4e2095' == '276263b0d0d9f...826388d70510d'
E         - c0d952cbc16e90f5bd03c849ea4e2095
E         + 276263b0d0d9fc03b77826388d70510d

tests/test_GenotypeCommon.py:11: AssertionError
________________________________________________________ test_GenotypeCommonDash_HardyWeinberg _________________________________________________________

    def test_GenotypeCommonDash_HardyWeinberg():
        exit_code = base.run_pypop_process('./tests/data/WS_BDCtrl_Test_HW.ini', './tests/data/BIGDAWG_SynthControl_Data_dash.pop')
        # check exit code
        assert exit_code == 0
        # compare with md5sum of output file
>       assert hashlib.md5(open("BIGDAWG_SynthControl_Data_dash-out.txt", 'rb').read()).hexdigest() == 'b0f4247a2a67a65d0109b6448427cc28'
E       AssertionError: assert '93ff300c62a3...bd001ce8c592b' == 'b0f4247a2a67a...9b6448427cc28'
E         - 93ff300c62a3056c4cdbd001ce8c592b
E         + b0f4247a2a67a65d0109b6448427cc28

tests/test_GenotypeCommon.py:18: AssertionError
=================================================== 2 failed, 2 passed, 1 skipped in 157.40 seconds ====================================================

@alexlancaster
Copy link
Owner Author

I'll respin the tests as per my comment in #4 and if that works, I'll close this particular issue.

@alexlancaster
Copy link
Owner Author

Hi @sjmack, let me know if the tests look OK and I'll close.

@sjmack
Copy link
Collaborator

sjmack commented Jul 17, 2017

py.test: 11 passed, 1 skipped in 75.75 seconds.

Looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working, should not be used for new features: use "enhancement" for those
Projects
None yet
Development

No branches or pull requests

2 participants