Crash with metis 5.1.1 #106

traversaro · 2024-01-08T13:58:58Z

Solution to issue cannot be found in the documentation.

I checked the documentation.

Issue

The fix in #88 is creating problems, even if i solved the crash in import kwant; kwant.test().

See #87 (comment) for more details.

Installed packages

Environment info

The text was updated successfully, but these errors were encountered:

traversaro · 2024-01-08T14:02:27Z

@akhmerov just to understand, is this crash related to python -c "import kwant; kwant.test()" (that was in theory fixed in #88) or something else?

akhmerov · 2024-01-08T15:12:27Z

It's a different segfault caught by our updated tests (that's why kwant.test() passes now, while the unreleased version of tests crashes). Because it works with the same version of mumps and with different orderings though, I expect that the problem is not on our side.

Here is the crash in CI, but I'll provide a more self-contained example in a bit.

akhmerov · 2024-01-09T14:05:34Z

I have investigated the failure more, and I have arrived to the following reproducer that works on Ubuntu 23.10.

First Make an environment

name: mumps_bug
channels:
    - conda-forge
dependencies:
    - mumps-seq=5.2.1
    - metis=5.1.1
    - kwant=1.4.4
    - valgrind
    - pytest
    - pip
    - pip:
      - pytest-valgrind

Download this test file

contents

import itertools
import numpy as np
import pytest
from pytest import raises
from numpy.testing import assert_almost_equal

import kwant
from kwant._common import ensure_rng

import kwant.solvers.sparse
import kwant.solvers.mumps
no_mumps = False

mumps_solver_options = [
    {'nrhs': 10, 'ordering': 'metis'},
    {'nrhs': 10, 'sparse_rhs': True, 'ordering': 'metis'},
    {'nrhs': 2, 'ordering': 'metis', 'sparse_rhs': True},
]

solvers = list(itertools.chain(
    [("mumps", opts) for opts in mumps_solver_options],
))


def solver_id(s):
    solver_name, opts = s
    args = ", ".join(f"{k}={repr(v)}" for k, v in opts.items())
    return f"{solver_name}({args})"


@pytest.fixture(scope="function", params=mumps_solver_options)
def solver(request):
    solver_opts = request.param
    solver = kwant.solvers.mumps
    solver.options(**solver_opts)
    return solver


@pytest.fixture
def smatrix(solver):
    return solver.smatrix


@pytest.fixture
def greens_function(solver):
    return solver.greens_function


@pytest.fixture
def wave_function(solver):
    return solver.wave_function

@pytest.fixture(scope="function")
def twolead_builder():
    rng = ensure_rng(4)
    system = kwant.Builder()
    left_lead = kwant.Builder(kwant.TranslationalSymmetry((-1,)))
    right_lead = kwant.Builder(kwant.TranslationalSymmetry((1,)))
    for b, site in [(system, chain(0)), (system, chain(1)),
                    (left_lead, chain(0)), (right_lead, chain(0))]:
        h = rng.random_sample((n, n)) + 1j * rng.random_sample((n, n))
        h += h.conjugate().transpose()
        b[site] = h
    for b, hopp in [(system, (chain(0), chain(1))),
                    (left_lead, (chain(0), chain(1))),
                    (right_lead, (chain(0), chain(1)))]:
        b[hopp] = (10 * rng.random_sample((n, n)) +
                   1j * rng.random_sample((n, n)))
    system.attach_lead(left_lead)
    system.attach_lead(right_lead)
    return system

n = 5
chain = kwant.lattice.chain(norbs=n)
sq = square = kwant.lattice.square(norbs=n)

def test_output(twolead_builder, smatrix):
    fsyst = twolead_builder.finalized()

    result1 = smatrix(fsyst)
    s, modes1 = result1.data, result1.lead_info
    assert s.shape == 2 * (sum(len(i.momenta) for i in modes1) // 2,)
    s1 = result1.submatrix(1, 0)
    result2 = smatrix(fsyst, 0, (), [1], [0])
    s2, modes2 = result2.data, result2.lead_info
    assert s2.shape == (len(modes2[1].momenta) // 2,
                        len(modes2[0].momenta) // 2)
    assert_almost_equal(abs(s1), abs(s2))
    assert_almost_equal(np.dot(s.T.conj(), s),
                        np.identity(s.shape[0]))
    raises(ValueError, smatrix, fsyst, out_leads=[])
    modes = smatrix(fsyst).lead_info
    h = fsyst.leads[0].cell_hamiltonian()
    t = fsyst.leads[0].inter_cell_hopping()
    modes1 = kwant.physics.modes(h, t)[0]
    h = fsyst.leads[1].cell_hamiltonian()
    t = fsyst.leads[1].inter_cell_hopping()
    modes2 = kwant.physics.modes(h, t)[0]
    raise


def test_smatrix_shape(smatrix):
    chain = kwant.lattice.chain(norbs=1)

    system = kwant.Builder()
    lead0 = kwant.Builder(kwant.TranslationalSymmetry((-1,)))
    lead1 = kwant.Builder(kwant.TranslationalSymmetry((1,)))
    for b, site in [(system, chain(0)), (system, chain(1)),
                    (system, chain(2))]:
        b[site] = 2
    lead0[chain(0)] = lambda site: lead0_val
    lead1[chain(0)] = lambda site: lead1_val

    for b, hopp in [(system, (chain(0), chain(1))),
                    (system, (chain(1), chain(2))),
                    (lead0, (chain(0), chain(1))),
                    (lead1, (chain(0), chain(1)))]:
        b[hopp] = -1
    system.attach_lead(lead0)
    system.attach_lead(lead1)
    fsyst = system.finalized()

    lead0_val = 4
    lead1_val = 4
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (0, 0)

    lead0_val = 2
    lead1_val = 2
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (1, 1)

    lead0_val = 4
    lead1_val = 2
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (1, 0)

    lead0_val = 2
    lead1_val = 4
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (0, 1)

def test_reflection_no_open_modes(greens_function):
    # Build system
    syst = kwant.Builder()
    lead = kwant.Builder(kwant.TranslationalSymmetry((-1, 0)))
    syst[(square(i, j) for i in range(3) for j in range(3))] = 4
    lead[(square(0, j) for j in range(3))] = 4
    syst[square.neighbors()] = -1
    lead[square.neighbors()] = -1
    syst.attach_lead(lead)
    syst.attach_lead(lead.reversed())
    syst = syst.finalized()

    # Sanity check; no open modes at 0 energy
    _, m = syst.leads[0].modes(energy=0)
    assert m.nmodes == 0

    assert np.isclose(greens_function(syst).transmission(0, 0), 0)

Place the file in an empty folder and activate the environment. Observe that running py.test in that folder, while sometimes finishes (disregard the errors, they are not relevant), sometimes segfaults with

================================================================================= test session starts ==================================================================================
platform linux -- Python 3.12.1, pytest-7.4.4, pluggy-1.3.0
rootdir: /home/anton/tmp/mumps_bug
plugins: valgrind-0.2.0
collected 9 items                                                                                                                                                                      

test_bug.py FFF...FFFatal Python error: Segmentation fault

Current thread 0x00007f6926919740 (most recent call first):
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/mumps.py", line 243 in analyze
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/mumps.py", line 320 in factor
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/solvers/mumps.py", line 104 in _factorized
...TRUNCATED

Furthermore, running valgrind using PYTHONMALLOC=malloc valgrind --show-leak-kinds=definite --log-file=valgrind-output py.test --valgrind --valgrind-log=valgrind-output gives (after a fairly long wait) this error that looks relevant:

________________________________________________________________________ test_reflection_no_open_modes[solver0] ________________________________________________________________________
[VALGRIND ERROR+LEAK]

Valgrind detected both an error(s) and a leak(s):

**3904598** 
**3904598** **********************************************************************
**3904598** test_bug.py::test_reflection_no_open_modes[solver0]
**3904598** **********************************************************************
==3904598== 
==3904598== More than 100 errors detected.  Subsequent errors
==3904598== will still be recorded, but in less detail than before.
==3904598== Conditional jump or move depends on uninitialised value(s)
==3904598==    at 0x53A159F8: libmetis__genmmd (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A16CBC: libmetis__MMDOrder (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A17090: libmetis__MlevelNestedDissection (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A175DB: METIS_NodeND (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53963A5F: __mumps_ana_ord_wrappers_MOD_mumps_metis_nodend_mixedto32 (in /home/anton/micromamba/envs/mumps_bug/lib/libmumps_common_seq-5.2.1.so)
==3904598==    by 0x53756C60: __zmumps_ana_aux_m_MOD_zmumps_ana_f (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x5384CA2F: zmumps_ana_driver_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538D3700: zmumps_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538D8ABD: zmumps_f77_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538CFB25: zmumps_c (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x53713DE9: __pyx_pw_5kwant_6linalg_6_mumps_6zmumps_5call (in /home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/_mumps.cpython-312-x86_64-linux-gnu.so)
==3904598==    by 0x32F93E: UnknownInlinedFun (pycore_call.h:92)
==3904598==    by 0x32F93E: PyObject_Vectorcall (call.c:325)

akhmerov · 2024-01-09T14:08:39Z

Also: installing metis=5.1.0 makes the segfault disappear and removes the Valgrind error (the leak stays, but it's likely irrelevant).

akhmerov · 2024-01-09T20:24:49Z

Investigating the issues with Metis 5.1.1, it seems unavoidable. I suggest skipping 5.1.1 and waiting until 5.2.1 arrives to the feedstock (conda-forge/metis-feedstock#41 if it succeeds).

traversaro · 2024-01-09T20:51:37Z

Thanks a lot for the thorough investigation @akhmerov, I totally agree. At this point, considering also the other failures we are seeing with metis 5.1.1 (conda-forge/gtsam-feedstock#21) we could consider reverting the migration to 5.1.1 at the conda-forge level, and stick to 5.1.0 until metis 5.2.1 is available. This has the downside that dgl will not be installable side-by-side with other conda-forge packages that depend on metis, but if anyone really needs that they can invest in the work either in packaging metis 5.2.1 or ensuring that the package of interest build for both metis 5.1.0 and 5.1.1 .

Any opinion on this @conda-forge/metis @conda-forge/dgl @conda-forge/mumps ?

mikemhenry · 2024-01-12T04:07:26Z

I'll try and take a look into this more tomorrow, thank you so much for investigating!

hmacdope · 2024-01-14T23:57:28Z

As a partial update @traversaro @akhmerov we have some friends at Quansight looking into fixing the METIS 5.2.1 build (hopefully). Will keep you posted.

akhmerov · 2024-01-15T19:53:57Z

This is becoming a blocker for using the feedstock on windows. There:

Version 5.2.1 misplaces .lib files so that mumps isn't found by meson
Version 5.6.2 lacks mumps_int_def.h (Missing header file #100)

(I'm just working on a feedstock over here: conda-forge/staged-recipes#25042)

hmacdope · 2024-01-15T22:06:21Z

The basic blocker is KarypisLab/GKlib#23 (comment).

If we don't get a response soon we will do a release targeting latest sha.

akhmerov · 2024-01-15T23:32:28Z

I'm a bit concerned about counting on that, see the evaluation of Metis 5.2 by SuiteSparse DrTimothyAldenDavis/SuiteSparse#291 (comment)

akhmerov · 2024-01-18T12:47:34Z

Would it be an appropriate solution to have build variants for metis 5.1.0 and 5.2.1 (when it's available)? Looking at #88, the only difference is whether to apply the patch.

On the other hand only packaging for a library that wasn't tested (metis 5.2.1) or even isn't releasable right now seems like a potential for a lot of pain for the users.

traversaro · 2024-01-18T13:44:04Z

Personally I am in favor of stopping the metis 5.1.1 migration and switching back to metis 5.1.0 here and in the other migrated feedstocks (see https://conda-forge.org/status/#metis511). Once metis 5.2.1 is ready, we can try it and if it works fine proceed with the 5.2.1 migration, I am not sure if any other @conda-forge/mumps @conda-forge/metis have any other opinion. I would be happy to do the necessary PRs to stop the migration and revert migrated repos to 5.1.0 .

akhmerov · 2024-01-19T09:03:03Z

Since the packages are now effectively broken, I would really appreciate that.

minrk · 2024-01-19T12:38:23Z

conda-forge/conda-forge-pinning-feedstock#5396 halts the migration. Rebuilds can start once that lands. Looks like only 10 packages. @Traverso feel free to ping me on any un-migrate PRs

traversaro · 2024-01-19T19:10:11Z

More examples of metis 5.1.1 problems in the wild: ami-iit/bipedal-locomotion-framework#799 .

traversaro added the bug label Jan 8, 2024

traversaro mentioned this issue Jan 8, 2024

Needs pin for metis #87

Closed

1 task

traversaro mentioned this issue Jan 9, 2024

Rebuild for metis511 conda-forge/moab-feedstock#91

Closed

akhmerov mentioned this issue Jan 18, 2024

Add python-mumps from pypi conda-forge/staged-recipes#25042

Merged

10 tasks

minrk mentioned this issue Jan 19, 2024

halt metis 5.1.1 migration conda-forge/conda-forge-pinning-feedstock#5396

Merged

akhmerov mentioned this issue Jan 19, 2024

revert to metis 5.1.0 #108

Merged

5 tasks

minrk closed this as completed in #108 Jan 19, 2024

traversaro mentioned this issue Jan 19, 2024

Gravity task and distance task QPIK test fails on windows ami-iit/bipedal-locomotion-framework#799

Closed

minrk mentioned this issue Mar 12, 2024

mark mumps builds with metis 5.1.1 as broken conda-forge/admin-requests#959

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash with metis 5.1.1 #106

Crash with metis 5.1.1 #106

traversaro commented Jan 8, 2024

traversaro commented Jan 8, 2024

akhmerov commented Jan 8, 2024

akhmerov commented Jan 9, 2024

akhmerov commented Jan 9, 2024

akhmerov commented Jan 9, 2024

traversaro commented Jan 9, 2024

mikemhenry commented Jan 12, 2024

hmacdope commented Jan 14, 2024

akhmerov commented Jan 15, 2024 •

edited

Loading

hmacdope commented Jan 15, 2024

akhmerov commented Jan 15, 2024 •

edited

Loading

akhmerov commented Jan 18, 2024

traversaro commented Jan 18, 2024 •

edited

Loading

akhmerov commented Jan 19, 2024

minrk commented Jan 19, 2024 •

edited

Loading

traversaro commented Jan 19, 2024

Crash with metis 5.1.1 #106

Crash with metis 5.1.1 #106

Comments

traversaro commented Jan 8, 2024

Solution to issue cannot be found in the documentation.

Issue

Installed packages

Environment info

traversaro commented Jan 8, 2024

akhmerov commented Jan 8, 2024

akhmerov commented Jan 9, 2024

akhmerov commented Jan 9, 2024

akhmerov commented Jan 9, 2024

traversaro commented Jan 9, 2024

mikemhenry commented Jan 12, 2024

hmacdope commented Jan 14, 2024

akhmerov commented Jan 15, 2024 • edited Loading

hmacdope commented Jan 15, 2024

akhmerov commented Jan 15, 2024 • edited Loading

akhmerov commented Jan 18, 2024

traversaro commented Jan 18, 2024 • edited Loading

akhmerov commented Jan 19, 2024

minrk commented Jan 19, 2024 • edited Loading

traversaro commented Jan 19, 2024

akhmerov commented Jan 15, 2024 •

edited

Loading

akhmerov commented Jan 15, 2024 •

edited

Loading

traversaro commented Jan 18, 2024 •

edited

Loading

minrk commented Jan 19, 2024 •

edited

Loading