-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Segfault during pyarrow.dataset.write_dataset
with dataset source read with pre_buffer=True
#38438
Comments
I was able to reproduce this on my M1 Mac. Steps:
This was able to run one case, and generate a segfault on the second case. Logs look the same as here. |
dataset_serialize
benchmark on Ubuntu, x86_64dataset_serialize
benchmark
Here is a more stripped-down example: import pathlib
import shutil
import tempfile
import uuid
import pyarrow.dataset
tempdir = tempfile.TemporaryDirectory()
# First, download https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com/2009/01/data.parquet
source_ds = pyarrow.dataset.dataset("data.parquet")
for n_rows in [561000, 5610000]:
for serialization_format in ["parquet", "arrow", "feather", "csv"]:
data = source_ds.head(
n_rows,
# # Uncomment this and the segfault does not happen!
# fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
# pre_buffer=False
# ),
)
out_dir = pathlib.Path(tempdir.name) / str(uuid.uuid4())
# This is where the segfault happens, during one of the loops
print(f"Writing to {serialization_format}")
pyarrow.dataset.write_dataset(
data=data,
format=serialization_format,
base_dir=out_dir,
existing_data_behavior="overwrite_or_ignore",
)
print("Done")
shutil.rmtree(out_dir) I ran this a few times, and there are a mix of symptoms. See:
|
dataset_serialize
benchmarkpyarrow.dataset.write_dataset
Edited the above example to be slightly more minimal. |
Do we have some stack for the coredump? |
I need some help generating the stack as I'm fairly new to Arrow development. Could you point me to the correct flags? I think I built in release mode so I'll try switching to debug mode first. In the meantime I'd appreciate if anyone else can confirm that the above example produces a segfault on their machine. |
Without rebuilding (I'm still on release mode) I used
|
Nice, let me use the same way to debug here.. I'm hard to re-produce this using arrow-13.0 in my MacOs...I'll try to build the latest one |
Yeah, as discussed in voltrondata-labs/arrow-benchmarks-ci#166 I think the cause is #37854, which wasn't in the 13.0.0 release yet. |
Trying to write the same logic with C++ but didn't find out the reason...
From Also, in your lldb, data segment fault during |
Damn, I finally re-produce the bug in C++, let me find out why. It might be a bit hard for me, because ASAN report a weird reason( |
pyarrow.dataset.write_dataset
pyarrow.dataset.write_dataset
with dataset source read with pre_buffer=True
Thanks @austin3dickey and @mapleFU for the investigation! |
…et (#38466) ### Rationale for this change Origin mentioned #38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: #38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Thanks @mapleFU ! I confirmed that your PR fixed the broken benchmark. 👍 |
…et (#38466) ### Rationale for this change Origin mentioned #38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: #38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
… dataset (apache#38466) ### Rationale for this change Origin mentioned apache#38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: apache#38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
Please see voltrondata-labs/arrow-benchmarks-ci#166 for further detail.
Since #37854, the arrow-benchmarks-ci runners have been attempting to run the
dataset-serialize
benchmark on an x86_64 Ubuntu runner using Python 3.8. Each time, somewhere between 0 and 3 cases succeed before we seeFatal Python error: Segmentation fault
.Component(s)
Benchmarking, Python
The text was updated successfully, but these errors were encountered: