[C++][Python] Segfault during `pyarrow.dataset.write_dataset` with dataset source read with pre_buffer=True #38438

austin3dickey · 2023-10-24T14:43:01Z

Describe the bug, including details regarding any error messages, version, and platform.

Please see voltrondata-labs/arrow-benchmarks-ci#166 for further detail.

Since #37854, the arrow-benchmarks-ci runners have been attempting to run the dataset-serialize benchmark on an x86_64 Ubuntu runner using Python 3.8. Each time, somewhere between 0 and 3 cases succeed before we see Fatal Python error: Segmentation fault.

Component(s)

Benchmarking, Python

The text was updated successfully, but these errors were encountered:

austin3dickey · 2023-10-24T17:18:55Z

I was able to reproduce this on my M1 Mac. Steps:

build Arrow and pyarrow using this as reference (python 3.8)
Instead of running python -m buildkite.benchmark.run_benchmark_groups, I manually ran the dataset-serialize benchmark by:
- cloning https://github.com/voltrondata-labs/benchmarks
- checking out commit 93a57ba (I plan to submit a band-aid patch that avoids this issue in the future)
- running pip install -e . in that repo
- creating a data dir and saving it to BENCHMARKS_DATA_DIR
- creating a temporary output dir and saving it to BENCHMARK_OUTPUT_DIR (only required since I'm using Darwin)
- setting export DRY_RUN=true so it doesn't try to post to Conbench
- running conbench dataset-serialize ALL --iterations=6 --all=true --drop-caches=true with sudo permissions

This was able to run one case, and generate a segfault on the second case. Logs look the same as here.

austin3dickey · 2023-10-24T20:49:44Z

Here is a more stripped-down example:

import pathlib
import shutil
import tempfile
import uuid

import pyarrow.dataset

tempdir = tempfile.TemporaryDirectory()

# First, download https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com/2009/01/data.parquet
source_ds = pyarrow.dataset.dataset("data.parquet")

for n_rows in [561000, 5610000]:
    for serialization_format in ["parquet", "arrow", "feather", "csv"]:
        data = source_ds.head(
            n_rows,
            # # Uncomment this and the segfault does not happen!
            # fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
            #     pre_buffer=False
            # ),
        )
        out_dir = pathlib.Path(tempdir.name) / str(uuid.uuid4())

        # This is where the segfault happens, during one of the loops
        print(f"Writing to {serialization_format}")
        pyarrow.dataset.write_dataset(
            data=data,
            format=serialization_format,
            base_dir=out_dir,
            existing_data_behavior="overwrite_or_ignore",
        )
        print("Done")

        shutil.rmtree(out_dir)

I ran this a few times, and there are a mix of symptoms. See:

> python ~/Desktop/test.py
Writing to parquet
[1]    47390 segmentation fault  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
[1]    47400 bus error  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
[1]    47413 bus error  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
[1]    47431 segmentation fault  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
Done
Writing to feather
Done
Writing to csv
Done
Writing to parquet
Done
[1]    58361 segmentation fault  python ~/Desktop/test.py

austin3dickey · 2023-10-25T13:26:25Z

Edited the above example to be slightly more minimal.

mapleFU · 2023-10-25T13:31:36Z

Do we have some stack for the coredump?

austin3dickey · 2023-10-25T13:38:04Z

I need some help generating the stack as I'm fairly new to Arrow development. Could you point me to the correct flags? I think I built in release mode so I'll try switching to debug mode first.

In the meantime I'd appreciate if anyone else can confirm that the above example produces a segfault on their machine.

austin3dickey · 2023-10-25T13:59:43Z

Without rebuilding (I'm still on release mode) I used lldb:

> lldb /opt/miniconda3/envs/test/bin/python /Users/austin/Desktop/test.py
(lldb) target create "/opt/miniconda3/envs/test/bin/python"
Current executable set to '/opt/miniconda3/envs/test/bin/python' (arm64).
(lldb) settings set -- target.run-args  "/Users/austin/Desktop/test.py"
(lldb) run
Process 77545 launched: '/opt/miniconda3/envs/test/bin/python' (arm64)
Writing to parquet
Process 77545 stopped
* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x5)
    frame #0: 0x0000000105e6bdac libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&, arrow::internal::Executor*) + 1720
libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups:
->  0x105e6bdac <+1720>: ldr    x8, [x0]
    0x105e6bdb0 <+1724>: ldr    x9, [x8, #0x30]
    0x105e6bdb4 <+1728>: add    x8, sp, #0x278
    0x105e6bdb8 <+1732>: add    x1, sp, #0x258
Target 0: (python) stopped.
(lldb) bt
* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x5)
  * frame #0: 0x0000000105e6bdac libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&, arrow::internal::Executor*) + 1720
    frame #1: 0x0000000105e6b52c libparquet.1400.0.0.dylib`parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*, std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, int, std::__1::vector<int, std::__1::allocator<int>> const&) + 164
    frame #2: 0x0000000105e753b8 libparquet.1400.0.0.dylib`std::__1::enable_if<is_future<arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>>::value, void>::type arrow::detail::ContinueFuture::operator()<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'(), arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>, arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>>(arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>, parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'()&&) const + 96
    frame #3: 0x0000000105e75160 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty>::ThenOnComplete<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'(), arrow::Future<arrow::internal::Empty>::PassthruOnFailure<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'()>>::operator()(arrow::Result<arrow::internal::Empty> const&) && + 72
    frame #4: 0x000000014030fa6c libarrow.1400.0.0.dylib`arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool) + 292
    frame #5: 0x000000014030f804 libarrow.1400.0.0.dylib`arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) + 200
    frame #6: 0x0000000105e671f0 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty>::DoMarkFinished(arrow::Result<arrow::internal::Empty>) + 148
    frame #7: 0x0000000105e67050 libparquet.1400.0.0.dylib`void arrow::Future<arrow::internal::Empty>::MarkFinished<arrow::internal::Empty, void>(arrow::Status) + 68
    frame #8: 0x0000000105e73ef0 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty> arrow::internal::Executor::DoTransfer<arrow::internal::Empty, arrow::Future<arrow::internal::Empty>, arrow::Status>(arrow::Future<arrow::internal::Empty>, bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) + 124
    frame #9: 0x000000014030fed4 libarrow.1400.0.0.dylib`arrow::internal::FnOnce<void ()>::FnImpl<arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool)::'lambda'()>::invoke() + 40
    frame #10: 0x000000014033c6a8 libarrow.1400.0.0.dylib`void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 504
    frame #11: 0x0000000196b5bfa8 libsystem_pthread.dylib`_pthread_start + 148

mapleFU · 2023-10-25T14:01:13Z

Nice, let me use the same way to debug here..

I'm hard to re-produce this using arrow-13.0 in my MacOs...I'll try to build the latest one

austin3dickey · 2023-10-25T14:33:17Z

Yeah, as discussed in voltrondata-labs/arrow-benchmarks-ci#166 I think the cause is #37854, which wasn't in the 13.0.0 release yet.

mapleFU · 2023-10-25T17:08:05Z

Trying to write the same logic with C++ but didn't find out the reason...

code=1, address=0x5

From address=0x5 it might be a memory problem. I've check the logic for Head and ReadOneRowGroup. I guess just read them would cause any problems.

Also, in your lldb, data segment fault during Read rather than write. And Head will continue head on a single file...Does it really related to write_dataset? Without write_dataset, would it segfault?

mapleFU · 2023-10-25T18:13:06Z

Damn, I finally re-produce the bug in C++, let me find out why. It might be a bit hard for me, because ASAN report a weird reason( Future use a relesed State ) . I'll submit the code if I can't find out the reason tomorrow.

jorisvandenbossche · 2023-10-26T14:20:35Z

Thanks @austin3dickey and @mapleFU for the investigation!

…et (#38466) ### Rationale for this change Origin mentioned #38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: #38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

austin3dickey · 2023-11-21T17:41:40Z

Thanks @mapleFU ! I confirmed that your PR fixed the broken benchmark. 👍

…et (#38466) ### Rationale for this change Origin mentioned #38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: #38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

… dataset (apache#38466) ### Rationale for this change Origin mentioned apache#38438 1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex 2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor. This patch also fixes a lifetime issue I founded in CSV handling. ### What changes are included in this PR? 1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above 2. Fix a lifetime problem in CSV ### Are these changes tested? I test it locality. But don't know how to write unittest here. Fell free to help. ### Are there any user-facing changes? Bugfix * Closes: apache#38438 Authored-by: mwish <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>

austin3dickey added the Type: bug label Oct 24, 2023

github-actions bot added Component: Python Component: Benchmarking labels Oct 24, 2023

austin3dickey mentioned this issue Oct 24, 2023

dataset-serialize benchmark is failing with segfaults voltrondata-labs/arrow-benchmarks-ci#166

Closed

austin3dickey changed the title ~~[Benchmarking] [Python] Segfault during dataset_serialize benchmark on Ubuntu, x86_64~~ [Benchmarking] [Python] Segfault during dataset_serialize benchmark Oct 24, 2023

austin3dickey mentioned this issue Oct 24, 2023

Fix dataset-serialize benchmark by setting pre_buffer=False voltrondata-labs/benchmarks#152

Merged

austin3dickey changed the title ~~[Benchmarking] [Python] Segfault during dataset_serialize benchmark~~ [Benchmarking] [Python] Segfault during pyarrow.dataset.write_dataset Oct 24, 2023

mapleFU mentioned this issue Oct 25, 2023

GH-38438: [C++] Dataset: Trying to fix the async bug in Parquet dataset #38466

Merged

github-actions bot assigned mapleFU Oct 25, 2023

jorisvandenbossche changed the title ~~[Benchmarking] [Python] Segfault during pyarrow.dataset.write_dataset~~ [C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True Oct 26, 2023

jorisvandenbossche added the backport-candidate label Oct 26, 2023

bkietz closed this as completed in #38466 Nov 17, 2023

bkietz added this to the 15.0.0 milestone Nov 17, 2023

austin3dickey mentioned this issue Nov 21, 2023

MINOR: don't merge: testing dataset-serialize benchmark #38831

Closed

jorisvandenbossche removed the Component: Benchmarking label Nov 22, 2023

raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023

raulcd removed the backport-candidate label Nov 28, 2023

amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] Segfault during `pyarrow.dataset.write_dataset` with dataset source read with pre_buffer=True #38438

[C++][Python] Segfault during `pyarrow.dataset.write_dataset` with dataset source read with pre_buffer=True #38438

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023

austin3dickey commented Oct 25, 2023 •

edited

Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023 •

edited

Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023 •

edited

Loading

mapleFU commented Oct 25, 2023 •

edited

Loading

jorisvandenbossche commented Oct 26, 2023

austin3dickey commented Nov 21, 2023

[C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True #38438

[C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True #38438

Comments

austin3dickey commented Oct 24, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

austin3dickey commented Oct 24, 2023 • edited Loading

austin3dickey commented Oct 24, 2023 • edited Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023

austin3dickey commented Oct 25, 2023 • edited Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023 • edited Loading

austin3dickey commented Oct 25, 2023

mapleFU commented Oct 25, 2023 • edited Loading

mapleFU commented Oct 25, 2023 • edited Loading

jorisvandenbossche commented Oct 26, 2023

austin3dickey commented Nov 21, 2023

[C++][Python] Segfault during `pyarrow.dataset.write_dataset` with dataset source read with pre_buffer=True #38438

[C++][Python] Segfault during `pyarrow.dataset.write_dataset` with dataset source read with pre_buffer=True #38438

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 24, 2023 •

edited

Loading

austin3dickey commented Oct 25, 2023 •

edited

Loading

mapleFU commented Oct 25, 2023 •

edited

Loading

mapleFU commented Oct 25, 2023 •

edited

Loading

mapleFU commented Oct 25, 2023 •

edited

Loading