Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble solving unsymmetric system in serial #54

Open
rleegates opened this issue Nov 20, 2019 · 7 comments
Open

Trouble solving unsymmetric system in serial #54

rleegates opened this issue Nov 20, 2019 · 7 comments

Comments

@rleegates
Copy link

I have some trouble solving a moderately large numerically unsymmetric, structurally system without mpirun on OS X. The solve is part of an iterative procedure which works fine most of the time but fails undeterministically (when exactly it fails depends on whether the compiled julia code changes). Before the problem occurs, I check whether A*b can be computed, in order to verify that the matrix is structurally sound. Then, after entering the MUMPS analysis step, MUMPS reports zero density and zero structural symmetry. On some computers it doesn't even get that far and segfaults immediately, on others it continues, reporting a MUMPS error with INFO(1) = 1, reporting the number of nonzeros correctly. According to the MUMPS manual, this implies that the I, J - indices are out of range. However, if the matrix were broken, I would assume A*b to fail. Here is my solver code:

import MUMPS
import MPI

comm = MPI.COMM_WORLD

function solve_mumps(A::SparseMatrixCSC{Float64, Int}, b::Vector{Float64})
	if !MPI.Initialized()
		MPI.Init()
	end
	A*b
	warn("Success: $(nnz(A))")
	MPI.Barrier(comm)
	mumps = MUMPS.Mumps{Float64}(MUMPS.mumps_unsymmetric, MUMPS.get_icntl(ooc=true, verbose=true), MUMPS.default_cntl64)
	MUMPS.associate_matrix!(mumps, A)
	MUMPS.associate_rhs!(mumps, b)
	MPI.Barrier(comm)
	MUMPS.factorize!(mumps)
	MPI.Barrier(comm)
	MUMPS.solve!(mumps)
	MPI.Barrier(comm)
	x = MUMPS.get_solution(mumps)
	MUMPS.finalize(mumps)
	MPI.Barrier(comm)
	return vec(x)
end

The segmentation fault looks like this:

Entering DMUMPS 5.2.1 from C interface with JOB, N, NNZ =   4      569480       62978289
      executing #MPI =      1, without OMP

 =================================================
 MUMPS compiled with option -Dparmetis
 This MUMPS version includes code for SAVE_RESTORE
 =================================================
L U Solver for unsymmetric matrices
Type of parallelism: Working host

 ****** ANALYSIS STEP ********


signal (11): Segmentation fault: 11
in expression starting at ***/refine.jl:225
dmumps_ana_gnew_ at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)

All those Barriers were inserted in an attempt to fix the problem. Any ideas? My next step would be to extract the matrix and rhs, save them to HDF5 and to do tests on this matrix outside the main code, perhaps producing an MWE. However, I thought the error might be obvious, and therefore decided to ask first.

@dpo
Copy link
Member

dpo commented Nov 20, 2019

Does the same happen if you don't store the matrix out of core?

@rleegates
Copy link
Author

Yes it does. When it does run (on another computer) I get the following output:

WARNING: Success: 63007365

Entering DMUMPS 5.2.1 from C interface with JOB, N, NNZ =   4      569480       63007365
      executing #MPI =      1, without OMP

 =================================================
 MUMPS compiled with option -Dparmetis
 This MUMPS version includes code for SAVE_RESTORE
 =================================================
L U Solver for unsymmetric matrices
Type of parallelism: Working host

 ****** ANALYSIS STEP ********

 ... Structural symmetry (in percent)=    0
 Average density of rows/columns =    0

** Error/warning return ** from Analysis *  INFO(1:2)=   1        63007365
 Scaling will be computed during analysis
Compute maximum matching (Maximum Transversal):  5
 ... JOB =  5: MAXIMIZE PRODUCT DIAGONAL AND SCALE

signal (11): Segmentation fault: 11

@rleegates
Copy link
Author

Coming back to this, let us consider this MWE (just the readme example with a different matrix, no mpirun):

using MUMPS
using MPI
MPI.Init()
mumps = Mumps{Float64}(mumps_unsymmetric, default_icntl, default_cntl64)  # Real, general unsymmetric
A = sprand(570000,570000,0.00019); rhs = rand(570000)       # Happens on all cores
A=A+A'
A+=5I
associate_matrix!(mumps, A)
factorize!(mumps)
associate_rhs!(mumps, rhs)
solve!(mumps)
x = get_solution(mumps)
finalize(mumps)
MPI.Finalize()

which upon reaching factorize! gives

Entering DMUMPS 5.2.1 from C interface with JOB, N, NNZ =   4      570000      124018036
      executing #MPI =      1, without OMP

 =================================================
 MUMPS compiled with option -Dparmetis
 This MUMPS version includes code for SAVE_RESTORE
 =================================================
L U Solver for unsymmetric matrices
Type of parallelism: Working host

 ****** ANALYSIS STEP ********


signal (11): Segmentation fault: 11
in expression starting at no file:0
dmumps_ana_gnew_ at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
__dmumps_ana_aux_m_MOD_dmumps_ana_f at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
dmumps_ana_driver_ at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
dmumps_ at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
dmumps_f77_ at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
dmumps_c at /usr/local/Cellar/brewsci-mumps/5.2.1/lib/libdmumps.dylib (unknown line)
macro expansion at /Volumes/rgates/.julia/dev/MUMPS/src/MUMPS.jl:23 [inlined]
factorize! at /Volumes/rgates/.julia/dev/MUMPS/src/MUMPS_lib.jl:75
jl_fptr_trampoline at /Users/osx/buildbot/slave/package_osx64/build/src/gf.c:1829
do_call at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:324
eval_stmt_value at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:363 [inlined]
eval_body at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:686
jl_interpret_toplevel_thunk_callback at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:799
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x11a4d235f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /Users/osx/buildbot/slave/package_osx64/build/src/interpreter.c:808
jl_toplevel_eval_flex at /Users/osx/buildbot/slave/package_osx64/build/src/toplevel.c:831
jl_toplevel_eval_in at /Users/osx/buildbot/slave/package_osx64/build/src/builtins.c:633
eval at ./boot.jl:319
eval_user_input at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:85
macro expansion at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:117 [inlined]
#28 at ./task.jl:262
jl_apply at /Users/osx/buildbot/slave/package_osx64/build/src/./julia.h:1538 [inlined]
start_task at /Users/osx/buildbot/slave/package_osx64/build/src/task.c:268
Allocations: 9498111 (Pool: 9496305; Big: 1806); GC: 29
[Coruscant:03325] *** Process received signal ***
[Coruscant:03325] Signal: Segmentation fault: 11 (11)
[Coruscant:03325] Signal code: Address not mapped (1)
[Coruscant:03325] Failing at address: 0x1437bd000
[Coruscant:03325] [ 0] 0   libsystem_platform.dylib            0x00007fff5d233f5a _sigtramp + 26
[Coruscant:03325] *** End of error message ***
Segmentation fault: 11

Can you reproduce the issue? Any ideas?

@rleegates
Copy link
Author

rleegates commented Nov 28, 2019 via email

@dpo
Copy link
Member

dpo commented Nov 28, 2019

I'll try to look into the merger with mumps3.jl in the next few days.

@dpo
Copy link
Member

dpo commented Nov 28, 2019

Are you using the same MUMPS library with this package and MUMPS3.jl?

@rleegates
Copy link
Author

rleegates commented Nov 28, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants