Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with metis 5.1.1 #106

Closed
1 task done
traversaro opened this issue Jan 8, 2024 · 16 comments · Fixed by #108
Closed
1 task done

Crash with metis 5.1.1 #106

traversaro opened this issue Jan 8, 2024 · 16 comments · Fixed by #108
Labels

Comments

@traversaro
Copy link
Contributor

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

The fix in #88 is creating problems, even if i solved the crash in import kwant; kwant.test().

See #87 (comment) for more details.

Installed packages

.

Environment info

.
@traversaro traversaro added the bug label Jan 8, 2024
@traversaro traversaro mentioned this issue Jan 8, 2024
1 task
@traversaro
Copy link
Contributor Author

@akhmerov just to understand, is this crash related to python -c "import kwant; kwant.test()" (that was in theory fixed in #88) or something else?

@akhmerov
Copy link
Contributor

akhmerov commented Jan 8, 2024

It's a different segfault caught by our updated tests (that's why kwant.test() passes now, while the unreleased version of tests crashes). Because it works with the same version of mumps and with different orderings though, I expect that the problem is not on our side.

Here is the crash in CI, but I'll provide a more self-contained example in a bit.

@akhmerov
Copy link
Contributor

akhmerov commented Jan 9, 2024

I have investigated the failure more, and I have arrived to the following reproducer that works on Ubuntu 23.10.

First Make an environment

name: mumps_bug
channels:
    - conda-forge
dependencies:
    - mumps-seq=5.2.1
    - metis=5.1.1
    - kwant=1.4.4
    - valgrind
    - pytest
    - pip
    - pip:
      - pytest-valgrind

Download this test file

contents
import itertools
import numpy as np
import pytest
from pytest import raises
from numpy.testing import assert_almost_equal

import kwant
from kwant._common import ensure_rng

import kwant.solvers.sparse
import kwant.solvers.mumps
no_mumps = False

mumps_solver_options = [
    {'nrhs': 10, 'ordering': 'metis'},
    {'nrhs': 10, 'sparse_rhs': True, 'ordering': 'metis'},
    {'nrhs': 2, 'ordering': 'metis', 'sparse_rhs': True},
]

solvers = list(itertools.chain(
    [("mumps", opts) for opts in mumps_solver_options],
))


def solver_id(s):
    solver_name, opts = s
    args = ", ".join(f"{k}={repr(v)}" for k, v in opts.items())
    return f"{solver_name}({args})"


@pytest.fixture(scope="function", params=mumps_solver_options)
def solver(request):
    solver_opts = request.param
    solver = kwant.solvers.mumps
    solver.options(**solver_opts)
    return solver


@pytest.fixture
def smatrix(solver):
    return solver.smatrix


@pytest.fixture
def greens_function(solver):
    return solver.greens_function


@pytest.fixture
def wave_function(solver):
    return solver.wave_function

@pytest.fixture(scope="function")
def twolead_builder():
    rng = ensure_rng(4)
    system = kwant.Builder()
    left_lead = kwant.Builder(kwant.TranslationalSymmetry((-1,)))
    right_lead = kwant.Builder(kwant.TranslationalSymmetry((1,)))
    for b, site in [(system, chain(0)), (system, chain(1)),
                    (left_lead, chain(0)), (right_lead, chain(0))]:
        h = rng.random_sample((n, n)) + 1j * rng.random_sample((n, n))
        h += h.conjugate().transpose()
        b[site] = h
    for b, hopp in [(system, (chain(0), chain(1))),
                    (left_lead, (chain(0), chain(1))),
                    (right_lead, (chain(0), chain(1)))]:
        b[hopp] = (10 * rng.random_sample((n, n)) +
                   1j * rng.random_sample((n, n)))
    system.attach_lead(left_lead)
    system.attach_lead(right_lead)
    return system

n = 5
chain = kwant.lattice.chain(norbs=n)
sq = square = kwant.lattice.square(norbs=n)

def test_output(twolead_builder, smatrix):
    fsyst = twolead_builder.finalized()

    result1 = smatrix(fsyst)
    s, modes1 = result1.data, result1.lead_info
    assert s.shape == 2 * (sum(len(i.momenta) for i in modes1) // 2,)
    s1 = result1.submatrix(1, 0)
    result2 = smatrix(fsyst, 0, (), [1], [0])
    s2, modes2 = result2.data, result2.lead_info
    assert s2.shape == (len(modes2[1].momenta) // 2,
                        len(modes2[0].momenta) // 2)
    assert_almost_equal(abs(s1), abs(s2))
    assert_almost_equal(np.dot(s.T.conj(), s),
                        np.identity(s.shape[0]))
    raises(ValueError, smatrix, fsyst, out_leads=[])
    modes = smatrix(fsyst).lead_info
    h = fsyst.leads[0].cell_hamiltonian()
    t = fsyst.leads[0].inter_cell_hopping()
    modes1 = kwant.physics.modes(h, t)[0]
    h = fsyst.leads[1].cell_hamiltonian()
    t = fsyst.leads[1].inter_cell_hopping()
    modes2 = kwant.physics.modes(h, t)[0]
    raise


def test_smatrix_shape(smatrix):
    chain = kwant.lattice.chain(norbs=1)

    system = kwant.Builder()
    lead0 = kwant.Builder(kwant.TranslationalSymmetry((-1,)))
    lead1 = kwant.Builder(kwant.TranslationalSymmetry((1,)))
    for b, site in [(system, chain(0)), (system, chain(1)),
                    (system, chain(2))]:
        b[site] = 2
    lead0[chain(0)] = lambda site: lead0_val
    lead1[chain(0)] = lambda site: lead1_val

    for b, hopp in [(system, (chain(0), chain(1))),
                    (system, (chain(1), chain(2))),
                    (lead0, (chain(0), chain(1))),
                    (lead1, (chain(0), chain(1)))]:
        b[hopp] = -1
    system.attach_lead(lead0)
    system.attach_lead(lead1)
    fsyst = system.finalized()

    lead0_val = 4
    lead1_val = 4
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (0, 0)

    lead0_val = 2
    lead1_val = 2
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (1, 1)

    lead0_val = 4
    lead1_val = 2
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (1, 0)

    lead0_val = 2
    lead1_val = 4
    s = smatrix(fsyst, 1.0, (), [1], [0]).data
    assert s.shape == (0, 1)

def test_reflection_no_open_modes(greens_function):
    # Build system
    syst = kwant.Builder()
    lead = kwant.Builder(kwant.TranslationalSymmetry((-1, 0)))
    syst[(square(i, j) for i in range(3) for j in range(3))] = 4
    lead[(square(0, j) for j in range(3))] = 4
    syst[square.neighbors()] = -1
    lead[square.neighbors()] = -1
    syst.attach_lead(lead)
    syst.attach_lead(lead.reversed())
    syst = syst.finalized()

    # Sanity check; no open modes at 0 energy
    _, m = syst.leads[0].modes(energy=0)
    assert m.nmodes == 0

    assert np.isclose(greens_function(syst).transmission(0, 0), 0)

Place the file in an empty folder and activate the environment. Observe that running py.test in that folder, while sometimes finishes (disregard the errors, they are not relevant), sometimes segfaults with

================================================================================= test session starts ==================================================================================
platform linux -- Python 3.12.1, pytest-7.4.4, pluggy-1.3.0
rootdir: /home/anton/tmp/mumps_bug
plugins: valgrind-0.2.0
collected 9 items                                                                                                                                                                      

test_bug.py FFF...FFFatal Python error: Segmentation fault

Current thread 0x00007f6926919740 (most recent call first):
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/mumps.py", line 243 in analyze
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/mumps.py", line 320 in factor
  File "/home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/solvers/mumps.py", line 104 in _factorized
...TRUNCATED

Furthermore, running valgrind using PYTHONMALLOC=malloc valgrind --show-leak-kinds=definite --log-file=valgrind-output py.test --valgrind --valgrind-log=valgrind-output gives (after a fairly long wait) this error that looks relevant:

________________________________________________________________________ test_reflection_no_open_modes[solver0] ________________________________________________________________________
[VALGRIND ERROR+LEAK]

Valgrind detected both an error(s) and a leak(s):

**3904598** 
**3904598** **********************************************************************
**3904598** test_bug.py::test_reflection_no_open_modes[solver0]
**3904598** **********************************************************************
==3904598== 
==3904598== More than 100 errors detected.  Subsequent errors
==3904598== will still be recorded, but in less detail than before.
==3904598== Conditional jump or move depends on uninitialised value(s)
==3904598==    at 0x53A159F8: libmetis__genmmd (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A16CBC: libmetis__MMDOrder (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A17090: libmetis__MlevelNestedDissection (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53A175DB: METIS_NodeND (in /home/anton/micromamba/envs/mumps_bug/lib/libmetis.so)
==3904598==    by 0x53963A5F: __mumps_ana_ord_wrappers_MOD_mumps_metis_nodend_mixedto32 (in /home/anton/micromamba/envs/mumps_bug/lib/libmumps_common_seq-5.2.1.so)
==3904598==    by 0x53756C60: __zmumps_ana_aux_m_MOD_zmumps_ana_f (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x5384CA2F: zmumps_ana_driver_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538D3700: zmumps_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538D8ABD: zmumps_f77_ (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x538CFB25: zmumps_c (in /home/anton/micromamba/envs/mumps_bug/lib/libzmumps_seq-5.2.1.so)
==3904598==    by 0x53713DE9: __pyx_pw_5kwant_6linalg_6_mumps_6zmumps_5call (in /home/anton/micromamba/envs/mumps_bug/lib/python3.12/site-packages/kwant/linalg/_mumps.cpython-312-x86_64-linux-gnu.so)
==3904598==    by 0x32F93E: UnknownInlinedFun (pycore_call.h:92)
==3904598==    by 0x32F93E: PyObject_Vectorcall (call.c:325)

@akhmerov
Copy link
Contributor

akhmerov commented Jan 9, 2024

Also: installing metis=5.1.0 makes the segfault disappear and removes the Valgrind error (the leak stays, but it's likely irrelevant).

@akhmerov
Copy link
Contributor

akhmerov commented Jan 9, 2024

Investigating the issues with Metis 5.1.1, it seems unavoidable. I suggest skipping 5.1.1 and waiting until 5.2.1 arrives to the feedstock (conda-forge/metis-feedstock#41 if it succeeds).

@traversaro
Copy link
Contributor Author

Thanks a lot for the thorough investigation @akhmerov, I totally agree. At this point, considering also the other failures we are seeing with metis 5.1.1 (conda-forge/gtsam-feedstock#21) we could consider reverting the migration to 5.1.1 at the conda-forge level, and stick to 5.1.0 until metis 5.2.1 is available. This has the downside that dgl will not be installable side-by-side with other conda-forge packages that depend on metis, but if anyone really needs that they can invest in the work either in packaging metis 5.2.1 or ensuring that the package of interest build for both metis 5.1.0 and 5.1.1 .

Any opinion on this @conda-forge/metis @conda-forge/dgl @conda-forge/mumps ?

@mikemhenry
Copy link

I'll try and take a look into this more tomorrow, thank you so much for investigating!

@hmacdope
Copy link

As a partial update @traversaro @akhmerov we have some friends at Quansight looking into fixing the METIS 5.2.1 build (hopefully). Will keep you posted.

@akhmerov
Copy link
Contributor

akhmerov commented Jan 15, 2024

This is becoming a blocker for using the feedstock on windows. There:

  • Version 5.2.1 misplaces .lib files so that mumps isn't found by meson
  • Version 5.6.2 lacks mumps_int_def.h (Missing header file #100)

(I'm just working on a feedstock over here: conda-forge/staged-recipes#25042)

@hmacdope
Copy link

The basic blocker is KarypisLab/GKlib#23 (comment).

If we don't get a response soon we will do a release targeting latest sha.

@akhmerov
Copy link
Contributor

akhmerov commented Jan 15, 2024

I'm a bit concerned about counting on that, see the evaluation of Metis 5.2 by SuiteSparse DrTimothyAldenDavis/SuiteSparse#291 (comment)

@akhmerov
Copy link
Contributor

Would it be an appropriate solution to have build variants for metis 5.1.0 and 5.2.1 (when it's available)? Looking at #88, the only difference is whether to apply the patch.

On the other hand only packaging for a library that wasn't tested (metis 5.2.1) or even isn't releasable right now seems like a potential for a lot of pain for the users.

@traversaro
Copy link
Contributor Author

traversaro commented Jan 18, 2024

Personally I am in favor of stopping the metis 5.1.1 migration and switching back to metis 5.1.0 here and in the other migrated feedstocks (see https://conda-forge.org/status/#metis511). Once metis 5.2.1 is ready, we can try it and if it works fine proceed with the 5.2.1 migration, I am not sure if any other @conda-forge/mumps @conda-forge/metis have any other opinion. I would be happy to do the necessary PRs to stop the migration and revert migrated repos to 5.1.0 .

@akhmerov
Copy link
Contributor

Since the packages are now effectively broken, I would really appreciate that.

@minrk
Copy link
Member

minrk commented Jan 19, 2024

conda-forge/conda-forge-pinning-feedstock#5396 halts the migration. Rebuilds can start once that lands. Looks like only 10 packages. @Traverso feel free to ping me on any un-migrate PRs

@traversaro
Copy link
Contributor Author

More examples of metis 5.1.1 problems in the wild: ami-iit/bipedal-locomotion-framework#799 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants