Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible stochastic failure related to bad linkage matrix #59

Open
ebolyen opened this issue Aug 29, 2017 · 3 comments
Open

Possible stochastic failure related to bad linkage matrix #59

ebolyen opened this issue Aug 29, 2017 · 3 comments
Labels
type:bug Something is wrong.

Comments

@ebolyen
Copy link
Member

ebolyen commented Aug 29, 2017

Bug Description
It appears that it is possible for the visualizer to fail when it finds a "perfectly bad" split of the test and training data.

Expected Behavior
@nbokulich says that this is very unlikely to happen, but we should look into reproducing and raising a different error. If a user runs into this, they should be able to just re-run their command.

Screenshots

Test failure/traceback
_____________________ EstimatorsTests.test_maturity_index ______________________

self = <q2_sample_classifier.tests.test_classifier.EstimatorsTests testMethod=test_maturity_index>

    def test_maturity_index(self):
        maturity_index(self.temp_dir.name, self.table_ecam_fp, self.md_ecam_fp,
                       category='month', group_by='delivery', n_jobs=1,
>                      control='Vaginal', test_size=0.4)

test-env/lib/python3.5/site-packages/q2_sample_classifier/tests/test_classifier.py:250: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test-env/lib/python3.5/site-packages/q2_sample_classifier/classify.py:145: in maturity_index
    accuracy, output_dir, maz_stats=maz_stats)
test-env/lib/python3.5/site-packages/q2_sample_classifier/utilities.py:412: in _visualize_maturity_index
    g = _clustermap_from_dataframe(top, metadata, group_by, category)
test-env/lib/python3.5/site-packages/q2_sample_classifier/visuals.py:69: in _clustermap_from_dataframe
    row_cluster=False)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1301: in clustermap
    **kwargs)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1131: in plot
    row_linkage=row_linkage, col_linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1032: in plot_dendrograms
    axis=1, ax=self.ax_col_dendrogram, linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:746: in dendrogram
    label=label, rotate=rotate)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:567: in __init__
    self.dendrogram = self.calculate_dendrogram()
test-env/lib/python3.5/site-packages/seaborn/matrix.py:644: in calculate_dendrogram
    color_threshold=-np.inf)
test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:2296: in dendrogram
    is_valid_linkage(Z, throw=True, name='Z')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Z = array([], shape=(0, 4), dtype=float64), warning = False, throw = True
name = 'Z'

    def is_valid_linkage(Z, warning=False, throw=False, name=None):
        """
        Checks the validity of a linkage matrix.
    
        A linkage matrix is valid if it is a two dimensional array (type double)
        with :math:`n` rows and 4 columns.  The first two columns must contain
        indices between 0 and :math:`2n-1`. For a given row ``i``, the following
        two expressions have to hold:
    
        .. math::
    
            0 \\leq \\mathtt{Z[i,0]} \\leq i+n-1
            0 \\leq Z[i,1] \\leq i+n-1
    
        I.e. a cluster cannot join another cluster unless the cluster being joined
        has been generated.
    
        Parameters
        ----------
        Z : array_like
            Linkage matrix.
        warning : bool, optional
            When True, issues a Python warning if the linkage
            matrix passed is invalid.
        throw : bool, optional
            When True, throws a Python exception if the linkage
            matrix passed is invalid.
        name : str, optional
            This string refers to the variable name of the invalid
            linkage matrix.
    
        Returns
        -------
        b : bool
            True if the inconsistency matrix is valid.
    
        """
        Z = np.asarray(Z, order='c')
        valid = True
        name_str = "%r " % name if name else ''
        try:
            if type(Z) != np.ndarray:
                raise TypeError('Passed linkage argument %sis not a valid array.' %
                                name_str)
            if Z.dtype != np.double:
                raise TypeError('Linkage matrix %smust contain doubles.' % name_str)
            if len(Z.shape) != 2:
                raise ValueError('Linkage matrix %smust have shape=2 (i.e. be '
                                 'two-dimensional).' % name_str)
            if Z.shape[1] != 4:
                raise ValueError('Linkage matrix %smust have 4 columns.' % name_str)
            if Z.shape[0] == 0:
>               raise ValueError('Linkage must be computed on at least two '
                                 'observations.')
E                                ValueError: Linkage must be computed on at least two observations.

test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:1459: ValueError
==================== 1 failed, 12 passed in 150.50 seconds =====================

Comments
This is probably going to be difficult to fix.

References
It appears

@ebolyen ebolyen added the type:bug Something is wrong. label Aug 29, 2017
@jairideout
Copy link
Member

For now, we will set a seed in the unit tests to prevent stochastic failure, and revert that commit when this bug is fixed.

@ebolyen
Copy link
Member Author

ebolyen commented Aug 29, 2017

PR #60 is in to set the seed @jairideout mentioned. It is possible we won't be able to revert it (if say the solution is to just raise an explanatory error, but we'll learn more later).

jairideout pushed a commit that referenced this issue Aug 29, 2017
Temporary "fix" for #59. This prevents stochastic test failures.
@mortonjt
Copy link

One possibility is to have the error message print out a histogram of this perfectly bad split -- there is a chance the user specified the wrong column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something is wrong.
Projects
None yet
Development

No branches or pull requests

3 participants