Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Segmentation fault occurs on libarrow load when using the pyarrow 17.0.0 arm64 wheel #44342

Closed
vyasr opened this issue Oct 8, 2024 · 30 comments
Assignees
Labels
Milestone

Comments

@vyasr
Copy link
Contributor

vyasr commented Oct 8, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Under some very specific set of circumstances, importing pyarrow 17.0.0 from an arm wheel triggers a segmentation fault. The error comes from the jemalloc function background_thread_entry that is statically linked into libarrow.so. I can see libarrow.so being opened via strace, and when I run under gdb I see the following backtrace:

[Detaching after vfork from child process 895]
[New Thread 0xfffe18fff1d0 (LWP 960)]
--Type <RET> for more, q to quit, c to continue without paging--c

Thread 128 "jemalloc_bg_thd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffe18fff1d0 (LWP 960)]
0x0000fffe1b2d2844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700

(gdb) backtrace
#0  0x0000fffe122f1844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700
#1  0x0000ffff94a3a624 in start_thread (arg=0xfffe122f17e0 <background_thread_entry>) at pthread_create.c:477
#2  0x0000ffff94b3562c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
(gdb) bt full
#0  0x0000fffe122f1844 in background_thread_entry () from /pyenv/versions/3.12.6/lib/python3.12/site-packages/pyarrow/libarrow.so.1700
No symbol table info available.
#1  0x0000ffff94a3a624 in start_thread (arg=0xfffe122f17e0 <background_thread_entry>) at pthread_create.c:477
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {281466655208940, 281474517841168, 281474517841166, 281473175642112, 281474517841167, 281466691852256,
                281466655209680, 281466655207888, 281473175646208, 281466655207888, 281466655205808, 118832585594287181, 0, 118832583903213793, 0, 0, 0,
                0, 0, 0, 0, 0}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#2  0x0000ffff94b3562c in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
No locals.

This error is quite difficult to reproduce. In addition to only observing this this particular issue with the pyarrow 17.0.0 release (the issue vanishes I downgrade to an earlier version) and only when testing on arm architectures, it is also highly sensitive to the exact order of prior operations. In my application I load multiple Python extension modules before importing pyarrow, and the order of those imports affects whether or not this issue manifests. The cases where the issue arises do manifest reliably, so it is not a flaky error, but simply adding an unrelated extra import or reordering unrelated imports is often sufficient to make the problem vanish. I attempted to rebuild libarrow.so using the same flags used to build the wheel (I can't be sure that I got them all right though, I based my compilation on the flags in https://github.com/apache/arrow/blob/main/ci/scripts/python_wheel_manylinux_build.sh). and then preload the library, but that too caused the segmentation fault to disappear, so it's also unlikely that I can get debug symbols into the build in any useful way. I am attempting to reduce this to an MWE in rapidsai/cudf#17022, but I am not very hopeful in it being reduced all that far.

Component(s)

Python

@kou
Copy link
Member

kou commented Oct 9, 2024

Could you also share thread apply all bt full result?

Is there any other Python extension module that also uses jemalloc?

@vyasr
Copy link
Contributor Author

vyasr commented Oct 9, 2024

The output is quite large, so I've attached it in a file.
gdb.txt

None of the extensions that I built use jemalloc, but it's possible that something else being loaded into the environment does (e.g. numpy or scipy).

@kou
Copy link
Member

kou commented Oct 9, 2024

Thanks but sorry.
I couldn't find any hints in the thread apply all bt full result...

@pitrou
Copy link
Member

pitrou commented Oct 9, 2024

Hi @vyasr , jemalloc_bg_thd is a jemalloc thread. When searching online, there seem to be issues with jemalloc on Linux aarch64, see jemalloc/jemalloc#467 for example.

I would recommend you switch to mimalloc instead of jemalloc, see https://arrow.apache.org/docs/cpp/memory.html#default-memory-pool

Note that mimalloc becomes the default in 18.0.0 as well (see #43254).

On our side, perhaps we should simply disable jemalloc on Linux aarch64 wheels? @raulcd

@vyasr
Copy link
Contributor Author

vyasr commented Oct 9, 2024

Thanks but sorry.
I couldn't find any hints in the thread apply all bt full result...

No problem @kou, I know these kinds of issues can be a huge pain to track down, especially from this limited information.

If it helps, you can see the error in this GHA run on this PR.

When searching online, there seem to be issues with jemalloc on Linux aarch64, see jemalloc/jemalloc#467 for example.

@pitrou thanks for finding that! That makes sense since it certainly seems like the underlying issue comes from jemalloc and is not arrow-specific.

I would recommend you switch to mimalloc instead of jemalloc

Good idea, at least for testing. I'm testing that now in this GH workflow. The arm wheel-tests-cudf job is the one to look out for, let's see if using mimalloc bypasses the issue. That being said:

Note that mimalloc becomes the default in 18.0.0 as well (see #43254). On our side, perhaps we should simply disable jemalloc on Linux aarch64 wheels?

This seems like the right long-term solution if your suggestion to try mimalloc works for me above. pyarrow is a common enough dependency that a user could end up having pyarrow loaded in their environment without even realizing it, and if the import alone is sufficient to trigger the seg fault it would be quite challenging for the average user to debug. Making mimalloc the default seems sufficient to me since IMHO it's reasonable to expect a user explicitly setting the allocator to recognize this as a potential cause, but I wouldn't be opposed to disabling jemalloc altogether on arm either.

@vyasr
Copy link
Contributor Author

vyasr commented Oct 9, 2024

Hmm, @pitrou I still see segfaults in the job that I linked above. Am I configuring the allocator in the correct way in rapidsai/cudf@635b5e0? If so, that suggests that there is an issue with jemalloc that occurs by simply loading the relevant parts of the binary even if no allocation subroutine is invoked, in which case building aarch64 wheels without jemalloc is definitely the way to go because this is beyond the realm of user configuration.

kou added a commit to kou/arrow that referenced this issue Oct 10, 2024
@kou
Copy link
Member

kou commented Oct 10, 2024

It seems that the jemalloc/jemalloc#467 problem was solved by #10940 .

@kou
Copy link
Member

kou commented Oct 10, 2024

Could you try nightly wheel that use mimalloc by default?
https://arrow.apache.org/docs/developers/python.html#installing-nightly-packages

@pitrou
Copy link
Member

pitrou commented Oct 10, 2024

If so, that suggests that there is an issue with jemalloc that occurs by simply loading the relevant parts of the binary even if no allocation subroutine is invoked, in which case building aarch64 wheels without jemalloc is definitely the way to go because this is beyond the realm of user configuration.

Ah, that might be the case indeed, if the crash occurs right when importing PyArrow :(

@pitrou pitrou added the Priority: Blocker Marks a blocker for the release label Oct 10, 2024
@pitrou pitrou added this to the 18.0.0 milestone Oct 10, 2024
@raulcd
Copy link
Member

raulcd commented Oct 10, 2024

@vyasr is there any way to validate the issue has gone away with the nightly wheels?
https://anaconda.org/scientific-python-nightly-wheels/pyarrow/files

@vyasr
Copy link
Contributor Author

vyasr commented Oct 10, 2024

I am happy to test out a nightly wheel, but unfortunately I'm not confident that it will tell us anything conclusive. As I mentioned above, in my use case I had a lot of difficulty constructing a true MWE because even small changes like defining a new variable, moving around my imports, or moving imports from one file into another but preserving the order (which still has some effect due to the logic for loading the importing module itself) were sufficient to change whether the error appeared or not, which suggests that some sort of process memory corruption is occurring when the DSO is loaded. As a result, since I assume the nightly wheels will have accumulated many changes since the 17.0.0 release, even if I don't observe the same error it may just be that the error is now simply being hidden by other changes. I can try a few different iterations with different modifications to my scripts to see what happens, though.

@pitrou
Copy link
Member

pitrou commented Oct 10, 2024

Er, are you telling us that it's not simply import pyarrow that triggers the crash?

@vyasr
Copy link
Contributor Author

vyasr commented Oct 10, 2024

If you're asking whether python -c "import pyarrow" will trigger the crash, then no, that does not crash for me. Quoting from above:

This error is quite difficult to reproduce. In addition to only observing this this particular issue with the pyarrow 17.0.0 release (the issue vanishes I downgrade to an earlier version) and only when testing on arm architectures, it is also highly sensitive to the exact order of prior operations. In my application I load multiple Python extension modules before importing pyarrow, and the order of those imports affects whether or not this issue manifests. The cases where the issue arises do manifest reliably, so it is not a flaky error, but simply adding an unrelated extra import or reordering unrelated imports is often sufficient to make the problem vanish.

None of the modules that I directly control do any sort of relevant stateful initialization on import, but I cannot guarantee that the same is true for the other modules, so it is entirely possible that something in the stack (e.g. scipy) is doing some sort of initialization of a memory pool that introduces conflicting jemalloc symbols, or some other similar problem (it wouldn't actually be a symbol collision since IIUC libarrow does not make any of its jemalloc symbols publicly visible, but that's illustrative of the class of problems I mean). So roughly speaking, I have

import foo
import bar
... # Other imports
import pyarrow # this seg faults

and changing the sequence of import foo and import bar can change whether the seg fault appears.

@pitrou
Copy link
Member

pitrou commented Oct 10, 2024

So, perhaps there's nothing particular that we should do in PyArrow?

@pitrou
Copy link
Member

pitrou commented Oct 10, 2024

(at least if you could git bisect and find out when precisely the issue starts happening with PyArrow, that could perhaps give a clue)

@vyasr
Copy link
Contributor Author

vyasr commented Oct 10, 2024

Well OK, to my (pleasant) surprise upgrading to the latest nightly did not make the error vanish (well I suppose not pleasant that I have a seg fault, but at least pleasant that there's something reproducible happening):

root@g242-p33-0009:/repo# python -c "import cupy; import cudf;"
Segmentation fault (core dumped)
root@g242-p33-0009:/repo# python
iPython 3.12.7 (main, Oct  4 2024, 15:35:43) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'18.0.0.dev445'
>>> 

The backtrace is the same, still in jemalloc_bg_thd.

So, perhaps there's nothing particular that we should do in PyArrow?

I think compiling out jemalloc or recompiling using the appropriate page size for arm could still make sense. While I haven't been able to reduce my example much further yet, the fact that pyarrow < 17.0.0 works while 17.0.0 and 18 alphas both fail indicate that something meaningful has changed there in the pyarrow binary and anyone could hit it.

(at least if you could git bisect and find out when precisely the issue starts happening with PyArrow, that could perhaps give a clue)

I would be happy to try that, but I would also need to be able to build pyarrow wheels that are equivalent to the build process you have. As I mentioned above

attempted to rebuild libarrow.so using the same flags used to build the wheel (I can't be sure that I got them all right though, I based my compilation on the flags in https://github.com/apache/arrow/blob/main/ci/scripts/python_wheel_manylinux_build.sh). and then preload the library, but that too caused the segmentation fault to disappear

Since the latest pyarrow nightlies fail for me, that suggests that I was indeed not compiling exactly equivalent C++ to what you produce (or perhaps I was but there's also something in the Python build that's relevant since I simply LD_PRELOADed libarrow.so). The nightly index linked above unfortunately doesn't go back far enough for me to install nightlies in between 16.1 and 17 to see where the issue might have arisen.

@kou kou changed the title Segmentation fault occurs on libarrow load when using the pyarrow 17.0.0 arm64 wheel [Python] Segmentation fault occurs on libarrow load when using the pyarrow 17.0.0 arm64 wheel Oct 11, 2024
kou added a commit to kou/arrow that referenced this issue Oct 11, 2024
jemalloc may have a problem on ARM.
See also: apache#44342
kou added a commit to kou/arrow that referenced this issue Oct 11, 2024
jemalloc may have a problem on ARM.
See also: apache#44342
@kou
Copy link
Member

kou commented Oct 11, 2024

Could you try https://github.com/ursacomputing/crossbow/actions/runs/11285538259#artifacts (download the "wheel" artifact) that disables jemalloc?

@pitrou
Copy link
Member

pitrou commented Oct 11, 2024

Also, can you tell us which hardware exactly you're using, and what the default page size is?

And it would be nice if you could try to disassemble at the point of the crash.

@raulcd
Copy link
Member

raulcd commented Oct 13, 2024

I am re-running some CI jobs but this is currently the only blocker to create the initial Release Candidate for 18.0.0. Should we merge disabling jemalloc by default on ARM and create the first RC? Should I create the first RC and potentially add this as a patch release if it solves the issue? @kou @pitrou ?

@kou
Copy link
Member

kou commented Oct 13, 2024

Let's merge GH-44380 and create the first RC!

@pitrou
Copy link
Member

pitrou commented Oct 14, 2024

Note that, for now, this is the only report about segfaults on Linux aarch64, so we're not sure if it's really a problem in general or specific to that use case. Ideally I would like answers to the questions in #44342 (comment) specifically :-)

It's probably ok to disable jemalloc at least for 18.0.0, though.

raulcd pushed a commit that referenced this issue Oct 14, 2024
### Rationale for this change

jemalloc may have a problem on ARM.
See also: #44342

### What changes are included in this PR?

* Disable jemalloc by default on ARM.
* Disable jemalloc for manylinux wheel for ARM.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #44342

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@raulcd
Copy link
Member

raulcd commented Oct 14, 2024

Issue resolved by pull request 44380
#44380

@raulcd raulcd closed this as completed Oct 14, 2024
raulcd pushed a commit that referenced this issue Oct 14, 2024
### Rationale for this change

jemalloc may have a problem on ARM.
See also: #44342

### What changes are included in this PR?

* Disable jemalloc by default on ARM.
* Disable jemalloc for manylinux wheel for ARM.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* GitHub Issue: #44342

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
@raulcd
Copy link
Member

raulcd commented Oct 14, 2024

I've merged disabling jemalloc by default on ARM to move 18.0.0 forward. We can re-open this issue once we get feedback or we can open a follow up one if necessary

@vyasr
Copy link
Contributor Author

vyasr commented Oct 14, 2024

Installing the version with jemalloc disabled does seem to fix the problem. I installed the artifact from https://github.com/ursacomputing/crossbow/actions/runs/11286890348 (slightly different than the link posted above because I'm on Python 3.12) and tested it out, then downgraded again to be sure:

root@g242-p33-0009:/repo/new_wheel# pip install pyarrow-18.0.0.dev452-cp312-cp312-manylinux_2_28_aarch64.whl 
Looking in indexes: https://pypi.org/simple, https://pypi.anaconda.org/rapidsai-wheels-nightly/simple, https://pypi.nvidia.com
Processing ./pyarrow-18.0.0.dev452-cp312-cp312-manylinux_2_28_aarch64.whl
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 17.0.0
    Uninstalling pyarrow-17.0.0:
      Successfully uninstalled pyarrow-17.0.0
Successfully installed pyarrow-18.0.0.dev452
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
root@g242-p33-0009:/repo/new_wheel# cd ..
root@g242-p33-0009:/repo# python -c "import cupy; import cudf;"
root@g242-p33-0009:/repo# pip install pyarrow==17.0.0
Looking in indexes: https://pypi.org/simple, https://pypi.anaconda.org/rapidsai-wheels-nightly/simple, https://pypi.nvidia.com
Collecting pyarrow==17.0.0
  Using cached pyarrow-17.0.0-cp312-cp312-manylinux_2_28_aarch64.whl.metadata (3.3 kB)
Requirement already satisfied: numpy>=1.16.6 in /pyenv/versions/3.12.7/lib/python3.12/site-packages (from pyarrow==17.0.0) (2.0.2)
Using cached pyarrow-17.0.0-cp312-cp312-manylinux_2_28_aarch64.whl (38.7 MB)
Installing collected packages: pyarrow
  Attempting uninstall: pyarrow
    Found existing installation: pyarrow 18.0.0.dev452
    Uninstalling pyarrow-18.0.0.dev452:
      Successfully uninstalled pyarrow-18.0.0.dev452
Successfully installed pyarrow-17.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
root@g242-p33-0009:/repo# python -c "import cupy; import cudf;"
Segmentation fault (core dumped)

So that certainly seems promising. Once pyarrow 18 is released our CI will pick it up automatically, so we'll see if the problem recurs in any way in future builds.

@vyasr
Copy link
Contributor Author

vyasr commented Oct 14, 2024

Also, can you tell us which hardware exactly you're using, and what the default page size is?

Here is some information, let me know if you would like anything else. The page size is 4kb, so jemalloc/jemalloc#467 doesn't immediately seem to be implicated to me, but I haven't done much more than skim that issue.

root@g242-p33-0009:/repo# lscpu
Architecture:                       aarch64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
CPU(s):                             80
On-line CPU(s) list:                0-79
Thread(s) per core:                 1
Core(s) per socket:                 80
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          ARM
Model:                              1
Model name:                         Neoverse-N1
Stepping:                           r3p1
Frequency boost:                    disabled
CPU max MHz:                        3000.0000
CPU min MHz:                        1000.0000
BogoMIPS:                           50.00
L1d cache:                          5 MiB
L1i cache:                          5 MiB
L2 cache:                           80 MiB
NUMA node0 CPU(s):                  0-79
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; CSV2, BHB
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
root@g242-p33-0009:/repo# getconf PAGE_SIZE
4096

And it would be nice if you could try to disassemble at the point of the crash.

Here is the gdb disassembly output. I don't know enough about jemalloc to debug this without spending a bit more time to familiarize myself unfortunately, but perhaps it will be meaningful to you.

@pitrou
Copy link
Member

pitrou commented Oct 14, 2024

Hmm, I've tried to understand the disassembly output (not an expert, sorry). I think the crash is happening in this function:
https://github.com/jemalloc/jemalloc/blob/02251c0070969e526cae3dde6d7b2610a4ed87ef/include/jemalloc/internal/tsd.h#L119-L135

Perhaps you could try look for similar issues in the jemalloc issue tracker, and/or to open an issue there? Feel free to notify me.

@vyasr
Copy link
Contributor Author

vyasr commented Oct 14, 2024

I took a look through the issue tracker but didn't see any that really seemed quite right. I'll take another look tomorrow and if I can't find anything I will open a new issue, link here, and tag you.

@vyasr
Copy link
Contributor Author

vyasr commented Oct 18, 2024

I opened jemalloc/jemalloc#2739 for further discussion on the jemalloc side.

@h-vetinari
Copy link
Contributor

Just stumbled over this. In conda-forge we build with

CMAKE_ARGS="${CMAKE_ARGS} -DARROW_JEMALLOC_LG_PAGE=16"

and that seems to work fine? I remember @xhochy mentioning that we cannot unvendor jemalloc in conda-forge due to "special options" being required. After reading some of the references here, I'm now assuming this is due to jemalloc/jemalloc#467.

@pitrou
Copy link
Member

pitrou commented Nov 13, 2024

We already pass ARROW_JEMALLOC_LG_PAGE=16 when building Aarch64 wheels, so I don't think that one is the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants