Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True #38438

Closed
austin3dickey opened this issue Oct 24, 2023 · 12 comments · Fixed by #38466
Assignees
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@austin3dickey
Copy link
Contributor

austin3dickey commented Oct 24, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Please see voltrondata-labs/arrow-benchmarks-ci#166 for further detail.

Since #37854, the arrow-benchmarks-ci runners have been attempting to run the dataset-serialize benchmark on an x86_64 Ubuntu runner using Python 3.8. Each time, somewhere between 0 and 3 cases succeed before we see Fatal Python error: Segmentation fault.

Component(s)

Benchmarking, Python

@austin3dickey
Copy link
Contributor Author

austin3dickey commented Oct 24, 2023

I was able to reproduce this on my M1 Mac. Steps:

  • build Arrow and pyarrow using this as reference (python 3.8)
  • Instead of running python -m buildkite.benchmark.run_benchmark_groups, I manually ran the dataset-serialize benchmark by:
    • cloning https://github.com/voltrondata-labs/benchmarks
    • checking out commit 93a57ba (I plan to submit a band-aid patch that avoids this issue in the future)
    • running pip install -e . in that repo
    • creating a data dir and saving it to BENCHMARKS_DATA_DIR
    • creating a temporary output dir and saving it to BENCHMARK_OUTPUT_DIR (only required since I'm using Darwin)
    • setting export DRY_RUN=true so it doesn't try to post to Conbench
    • running conbench dataset-serialize ALL --iterations=6 --all=true --drop-caches=true with sudo permissions

This was able to run one case, and generate a segfault on the second case. Logs look the same as here.

@austin3dickey austin3dickey changed the title [Benchmarking] [Python] Segfault during dataset_serialize benchmark on Ubuntu, x86_64 [Benchmarking] [Python] Segfault during dataset_serialize benchmark Oct 24, 2023
@austin3dickey
Copy link
Contributor Author

austin3dickey commented Oct 24, 2023

Here is a more stripped-down example:

import pathlib
import shutil
import tempfile
import uuid

import pyarrow.dataset

tempdir = tempfile.TemporaryDirectory()

# First, download https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com/2009/01/data.parquet
source_ds = pyarrow.dataset.dataset("data.parquet")

for n_rows in [561000, 5610000]:
    for serialization_format in ["parquet", "arrow", "feather", "csv"]:
        data = source_ds.head(
            n_rows,
            # # Uncomment this and the segfault does not happen!
            # fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions(
            #     pre_buffer=False
            # ),
        )
        out_dir = pathlib.Path(tempdir.name) / str(uuid.uuid4())

        # This is where the segfault happens, during one of the loops
        print(f"Writing to {serialization_format}")
        pyarrow.dataset.write_dataset(
            data=data,
            format=serialization_format,
            base_dir=out_dir,
            existing_data_behavior="overwrite_or_ignore",
        )
        print("Done")

        shutil.rmtree(out_dir)

I ran this a few times, and there are a mix of symptoms. See:

> python ~/Desktop/test.py
Writing to parquet
[1]    47390 segmentation fault  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
[1]    47400 bus error  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
[1]    47413 bus error  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
[1]    47431 segmentation fault  python ~/Desktop/test.py

> python ~/Desktop/test.py
Writing to parquet
Done
Writing to arrow
Done
Writing to feather
Done
Writing to csv
Done
Writing to parquet
Done
[1]    58361 segmentation fault  python ~/Desktop/test.py

@austin3dickey austin3dickey changed the title [Benchmarking] [Python] Segfault during dataset_serialize benchmark [Benchmarking] [Python] Segfault during pyarrow.dataset.write_dataset Oct 24, 2023
@austin3dickey
Copy link
Contributor Author

Edited the above example to be slightly more minimal.

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Do we have some stack for the coredump?

@austin3dickey
Copy link
Contributor Author

austin3dickey commented Oct 25, 2023

I need some help generating the stack as I'm fairly new to Arrow development. Could you point me to the correct flags? I think I built in release mode so I'll try switching to debug mode first.

In the meantime I'd appreciate if anyone else can confirm that the above example produces a segfault on their machine.

@austin3dickey
Copy link
Contributor Author

Without rebuilding (I'm still on release mode) I used lldb:

> lldb /opt/miniconda3/envs/test/bin/python /Users/austin/Desktop/test.py
(lldb) target create "/opt/miniconda3/envs/test/bin/python"
Current executable set to '/opt/miniconda3/envs/test/bin/python' (arm64).
(lldb) settings set -- target.run-args  "/Users/austin/Desktop/test.py"
(lldb) run
Process 77545 launched: '/opt/miniconda3/envs/test/bin/python' (arm64)
Writing to parquet
Process 77545 stopped
* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x5)
    frame #0: 0x0000000105e6bdac libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&, arrow::internal::Executor*) + 1720
libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups:
->  0x105e6bdac <+1720>: ldr    x8, [x0]
    0x105e6bdb0 <+1724>: ldr    x9, [x8, #0x30]
    0x105e6bdb4 <+1728>: add    x8, sp, #0x278
    0x105e6bdb8 <+1732>: add    x1, sp, #0x258
Target 0: (python) stopped.
(lldb) bt
* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x5)
  * frame #0: 0x0000000105e6bdac libparquet.1400.0.0.dylib`parquet::arrow::(anonymous namespace)::FileReaderImpl::DecodeRowGroups(std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, std::__1::vector<int, std::__1::allocator<int>> const&, std::__1::vector<int, std::__1::allocator<int>> const&, arrow::internal::Executor*) + 1720
    frame #1: 0x0000000105e6b52c libparquet.1400.0.0.dylib`parquet::arrow::RowGroupGenerator::ReadOneRowGroup(arrow::internal::Executor*, std::__1::shared_ptr<parquet::arrow::(anonymous namespace)::FileReaderImpl>, int, std::__1::vector<int, std::__1::allocator<int>> const&) + 164
    frame #2: 0x0000000105e753b8 libparquet.1400.0.0.dylib`std::__1::enable_if<is_future<arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>>::value, void>::type arrow::detail::ContinueFuture::operator()<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'(), arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>, arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>>(arrow::Future<std::__1::function<arrow::Future<std::__1::shared_ptr<arrow::RecordBatch>> ()>>, parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'()&&) const + 96
    frame #3: 0x0000000105e75160 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty>::ThenOnComplete<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'(), arrow::Future<arrow::internal::Empty>::PassthruOnFailure<parquet::arrow::RowGroupGenerator::FetchNext()::'lambda'()>>::operator()(arrow::Result<arrow::internal::Empty> const&) && + 72
    frame #4: 0x000000014030fa6c libarrow.1400.0.0.dylib`arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool) + 292
    frame #5: 0x000000014030f804 libarrow.1400.0.0.dylib`arrow::ConcreteFutureImpl::DoMarkFinishedOrFailed(arrow::FutureState) + 200
    frame #6: 0x0000000105e671f0 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty>::DoMarkFinished(arrow::Result<arrow::internal::Empty>) + 148
    frame #7: 0x0000000105e67050 libparquet.1400.0.0.dylib`void arrow::Future<arrow::internal::Empty>::MarkFinished<arrow::internal::Empty, void>(arrow::Status) + 68
    frame #8: 0x0000000105e73ef0 libparquet.1400.0.0.dylib`arrow::Future<arrow::internal::Empty> arrow::internal::Executor::DoTransfer<arrow::internal::Empty, arrow::Future<arrow::internal::Empty>, arrow::Status>(arrow::Future<arrow::internal::Empty>, bool)::'lambda'(arrow::Status const&)::operator()(arrow::Status const&) + 124
    frame #9: 0x000000014030fed4 libarrow.1400.0.0.dylib`arrow::internal::FnOnce<void ()>::FnImpl<arrow::ConcreteFutureImpl::RunOrScheduleCallback(std::__1::shared_ptr<arrow::FutureImpl> const&, arrow::FutureImpl::CallbackRecord&&, bool)::'lambda'()>::invoke() + 40
    frame #10: 0x000000014033c6a8 libarrow.1400.0.0.dylib`void* std::__1::__thread_proxy[abi:v160006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, arrow::internal::ThreadPool::LaunchWorkersUnlocked(int)::$_6>>(void*) + 504
    frame #11: 0x0000000196b5bfa8 libsystem_pthread.dylib`_pthread_start + 148

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Nice, let me use the same way to debug here..

I'm hard to re-produce this using arrow-13.0 in my MacOs...I'll try to build the latest one

@austin3dickey
Copy link
Contributor Author

Yeah, as discussed in voltrondata-labs/arrow-benchmarks-ci#166 I think the cause is #37854, which wasn't in the 13.0.0 release yet.

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Trying to write the same logic with C++ but didn't find out the reason...

code=1, address=0x5

From address=0x5 it might be a memory problem. I've check the logic for Head and ReadOneRowGroup. I guess just read them would cause any problems.

Also, in your lldb, data segment fault during Read rather than write. And Head will continue head on a single file...Does it really related to write_dataset? Without write_dataset, would it segfault?

@mapleFU
Copy link
Member

mapleFU commented Oct 25, 2023

Damn, I finally re-produce the bug in C++, let me find out why. It might be a bit hard for me, because ASAN report a weird reason( Future use a relesed State ) . I'll submit the code if I can't find out the reason tomorrow.

@jorisvandenbossche jorisvandenbossche changed the title [Benchmarking] [Python] Segfault during pyarrow.dataset.write_dataset [C++][Python] Segfault during pyarrow.dataset.write_dataset with dataset source read with pre_buffer=True Oct 26, 2023
@jorisvandenbossche
Copy link
Member

Thanks @austin3dickey and @mapleFU for the investigation!

bkietz pushed a commit that referenced this issue Nov 17, 2023
…et (#38466)

### Rationale for this change

Origin mentioned #38438

1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex
2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor.

This patch also fixes a lifetime issue I founded in CSV handling.

### What changes are included in this PR?

1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above
2. Fix a lifetime problem in CSV

### Are these changes tested?

I test it locality. But don't know how to write unittest here. Fell free to help.

### Are there any user-facing changes?

Bugfix

* Closes: #38438

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 15.0.0 milestone Nov 17, 2023
@austin3dickey
Copy link
Contributor Author

Thanks @mapleFU ! I confirmed that your PR fixed the broken benchmark. 👍

raulcd pushed a commit that referenced this issue Nov 28, 2023
…et (#38466)

### Rationale for this change

Origin mentioned #38438

1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex
2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor.

This patch also fixes a lifetime issue I founded in CSV handling.

### What changes are included in this PR?

1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above
2. Fix a lifetime problem in CSV

### Are these changes tested?

I test it locality. But don't know how to write unittest here. Fell free to help.

### Are there any user-facing changes?

Bugfix

* Closes: #38438

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Dec 7, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
… dataset (apache#38466)

### Rationale for this change

Origin mentioned apache#38438

1. When PreBuffer is default enabled, the code in `RowGroupGenerator::FetchNext` would switch to async mode. This make the state handling more complex
2. In `RowGroupGenerator::FetchNext`, `[this]` is captured without `shared_from_this`. This is not bad, however, `this->executor_` may point to a invalid address if this dtor.

This patch also fixes a lifetime issue I founded in CSV handling.

### What changes are included in this PR?

1. Fix handling in `cpp/src/parquet/arrow/reader.cc` as I talked above
2. Fix a lifetime problem in CSV

### Are these changes tested?

I test it locality. But don't know how to write unittest here. Fell free to help.

### Are there any user-facing changes?

Bugfix

* Closes: apache#38438

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
6 participants