Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault possibly caused by changes to MaxOut in v8.1.1 #771

Closed
ascillitoe opened this issue Sep 22, 2022 · 15 comments
Closed

Segfault possibly caused by changes to MaxOut in v8.1.1 #771

ascillitoe opened this issue Sep 22, 2022 · 15 comments
Labels
third-party Related to third-party packages

Comments

@ascillitoe
Copy link

ascillitoe commented Sep 22, 2022

Problem

We have recently (since Sep 9th) been experiencing intermittent seg faults in our (Alibi's) Windows CI. These occur in some of our tests that use the en_core_web_md pipeline.

We believe we have narrowed down the cause to the Tok2Vec component. Since our errors started on the same day v8.1.1 was released we are wondering if the changes to MaxOut are the cause (#702)? Or perhaps the move to blis v0.9 (#736)?

We've struggled to come up with a MWE, but for a comparison, we've repeated our CI 30x with v8.1.0 and 30x with v8.1.1. With v8.1.0 we experience no failures, whilst with v8.1.1 we experience seg faults >20% of the time.

Error traceback

From this CI workflow:

Windows fatal exception: access violation

Thread 0x00001a38 (most recent call first):
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 316 in wait
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 581 in wait
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\tqdm\_monitor.py", line 60 in run
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 980 in _bootstrap_inner
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\threading.py", line 937 in _bootstrap

Current thread 0x00000760 (most recent call first):
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in <listcomp>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\concatenate.py", line 44 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 291 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\layers\chain.py", line 55 in forward
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\thinc\model.py", line 315 in predict
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\pipeline\tok2vec.py", line 125 in predict
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\language.py", line 1020 in __call__
  File "D:\a\alibi\alibi\alibi\explainers\tests\test_anchor_text.py", line 326 in test_lm_stopwords_punctuation
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line 192 in pytest_pyfunc_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\python.py", line [176](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:177)1 in runtest
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 166 in pytest_runtest_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 259 in <lambda>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 338 in from_call
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 258 in call_runtest_hook
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 219 in call_and_report
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 130 in runtestprotocol
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\runner.py", line 111 in pytest_runtest_protocol
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 347 in pytest_runtestloop
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 322 in _main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 268 in wrap_session
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\main.py", line 315 in pytest_cmdline_main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_callers.py", line 39 in _multicall
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_manager.py", line 80 in _hookexec
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\pluggy\_hooks.py", line 265 in __call__
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line 164 in main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\_pytest\config\__init__.py", line [187](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:188) in console_main
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\Scripts\pytest.exe\__main__.py", line 7 in <module>
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 87 in _run_code
  File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\runpy.py", line 197 in _run_module_as_main
D:\a\_temp\511c6489-14db-4df4-820f-7b5[188](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:189)c57acc.sh: line 2:  1[224](https://github.com/SeldonIO/alibi/actions/runs/3025334987/jobs/4867687688#step:7:225) Segmentation fault      pytest -m "not tf1" alibi
..........................................XXX......ssss.......
Error: Process completed with exit code 139.

Platform, versions etc

  • Platform: Github runner image windows-2022, version 20220905.1
  • Python versions tested: 3.9.12, 3.9.13
  • pip env: see here
@ascillitoe
Copy link
Author

ascillitoe commented Sep 22, 2022

As further info, looking at the traceback again:

File "C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\spacy\ml\staticvectors.py", line 56 in forward

@RobertSamoilescu has noticed the failure appears to originate in staticvectors.py here. We're therefore wondering if the recent change to gemm (explosion/cython-blis#72) is the cause? Please let us know if it would be better to open an issue over on cython-blis, or if we can provide more info.

@danieldk
Copy link
Contributor

danieldk commented Sep 23, 2022

@RobertSamoilescu has noticed the failure appears to originate in staticvectors.py here. We're therefore wondering if the recent change to gemm (explosion/cython-blis#72) is the cause?

Unlikely, using uninitialized memory erroneously wouldn't cause segfaults, only garbage output. We found a bunch of issues in BLIS 0.9 where gemm would read out-of-bound. These were fixed upstream and the fixes were added to cython-blis. Since we thought the issues in BLIS were fixed, we changed the upper pin to include cython-blis 0.9.x in Thinc 8.1.1. But it seems like there are more memory issues 😢.

Thanks for reporting this!

@njsmith
Copy link

njsmith commented Sep 23, 2022

Can you check whether you still see the same issue in an environment that has thinc 0.8.1 + blis 0.7.8? That would let us confirm for sure whether the issue is coming from thinc or blis.

@ascillitoe
Copy link
Author

Hi @danieldk and @njsmith, thanks for the quick responses. That's a nice idea re blis 0.7.8, we shall do that now and come back to you.

@ascillitoe
Copy link
Author

ascillitoe commented Sep 26, 2022

I've run some more CI runs here. The results are as follows:

  • CI #430 to #460 is with thinc 8.1.1 and blis 0.7.8. There were no failures in 30x runs.
  • CI #462 to #491 is with thinc 8.1.1 and blis 0.9.1. There were 5x failures in 30x runs.

The error is rather intermittent so not 100% conclusive but this does suggest an issue with more recent blis versions.

@svlandeg
Copy link
Member

Hi! We've released thinc 8.1.2 today which restricts blis to <0.8.0 again instead of <0.10.0. Thanks again for your report!

@ascillitoe
Copy link
Author

Thanks for the update @svlandeg!

@svlandeg svlandeg added the third-party Related to third-party packages label Sep 29, 2022
@shadeMe
Copy link
Collaborator

shadeMe commented Sep 30, 2022

@ascillitoe Could you reproduce the crash with the following env variable and post the stdout output of the run: BLIS_ARCH_DEBUG=1? This will help us narrow down which BLIS kernel is faulty - you should something along the lines of libblis: selecting sub-configuration 'haswell'. in the log.

@shadeMe shadeMe reopened this Sep 30, 2022
@ascillitoe
Copy link
Author

@ascillitoe Could you reproduce the crash with the following env variable and post the stdout output of the run: BLIS_ARCH_DEBUG=1? This will help us narrow down which BLIS kernel is faulty - you should something along the lines of libblis: selecting sub-configuration 'haswell'. in the log.

Sure thing, I'll have a go!

@ascillitoe
Copy link
Author

ascillitoe commented Sep 30, 2022

Hi again. I've set some more CI runs going here (508 to 518).

Interesting, it looks like all the runs that fail (like this one) have:

libblis: selecting sub-configuration 'haswell'.

whereas the runs that pass (like this one) have:

libblis: Hardware has 2 FMA units; using 'skx' sub-config.
libblis: selecting sub-configuration 'skx'.

So it looks like it might be the haswell kernels?

@shadeMe
Copy link
Collaborator

shadeMe commented Sep 30, 2022

Thanks for the quick response! Yeah, the OOB access is in one of the haswell kernels, which confirms our suspicion.

danieldk added a commit to danieldk/nixpkgs that referenced this issue Dec 21, 2022
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
jsoo1 pushed a commit to awakesecurity/nixpkgs that referenced this issue Dec 14, 2023
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
jsoo1 pushed a commit to awakesecurity/nixpkgs that referenced this issue Dec 22, 2023
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
jsoo1 pushed a commit to awakesecurity/nixpkgs that referenced this issue Jan 13, 2024
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
tm-drtina pushed a commit to awakesecurity/nixpkgs that referenced this issue Feb 27, 2024
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
tm-drtina pushed a commit to awakesecurity/nixpkgs that referenced this issue Apr 8, 2024
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
tm-drtina pushed a commit to awakesecurity/nixpkgs that referenced this issue May 16, 2024
The blis 0.9.x packages cause segmentation faults:

explosion/thinc#771 (comment)

For this reason, Thinc/spaCy (the only users of this package in
nixpkgs) have reverted to 0.7.x until this issue is resolved.
@dotlambda
Copy link

The version constraint for BLIS was updated in efda3b4. Does that mean this issue was fixed?

@honnibal
Copy link
Member

honnibal commented Sep 11, 2024

Thanks for bringing this up. I didn't test to confirm whether this has been fixed when I updated the Blis release to build against numpy v2.

I'll have a look at the Blis (upstream) tracker.

@honnibal
Copy link
Member

honnibal commented Sep 11, 2024

Okay Blis 1.0 includes @shadeMe and @danieldk 's patch for the OOB error. So I need to release a new version of the Python package that vendors the updated version.

Edit: Ah actually, we're building against our fork. So yes the latest release should include the fix to that OOB issue.

@honnibal
Copy link
Member

Closing this as it should be fixed. Please rereport if the problem comes up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third-party Related to third-party packages
Projects
None yet
Development

No branches or pull requests

7 participants