-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault from opening large single column csv with small blockszie pyarrow.csv.open_csv() #38878
Comments
Would you like to try master branch? I think maybe my patch ( #38466 ) has fixed this |
thank you for your response! I tried building pyarrow from main branch, and now its saying Thanks in advance for your time. |
This error might raise from the code below. How can I reproduce the problem using C++ or Python code? Status Chunker::ProcessWithPartial(std::shared_ptr<Buffer> partial,
std::shared_ptr<Buffer> block,
std::shared_ptr<Buffer>* completion,
std::shared_ptr<Buffer>* rest) {
if (partial->size() == 0) {
// If partial is empty, don't bother looking for completion
*completion = SliceBuffer(block, 0, 0);
*rest = block;
return Status::OK();
}
int64_t first_pos = -1;
RETURN_NOT_OK(boundary_finder_->FindFirst(std::string_view(*partial),
std::string_view(*block), &first_pos));
if (first_pos == BoundaryFinder::kNoDelimiterFound) {
// No delimiter in block => the current object is too large for block size
return StraddlingTooLarge();
} else {
*completion = SliceBuffer(block, 0, first_pos);
*rest = SliceBuffer(block, first_pos);
return Status::OK();
}
} |
Here's a python sample that I created to replicate this behavior. I tested on my end and it seems to work. Please let me know if it does produce This samples takes a csv file and erases its content, then it writes 30000 random numbers of length 12 to the file (to simulate a standard single column csv file with constant row length).
By the way, I notice that arrow aims to find a deliminator in the block and otherwise raises an error in the code you quoted. But by standard csv format a single column csv file would not contain any deliminator |
update: |
cc @pitrou |
I'll try to reproduce these in |
Ok, this is because there is a recursion for handling of arrow/cpp/src/arrow/csv/reader.cc Lines 934 to 938 in 94fc124
This should not be a problem in normal usage, but you are asking for a large |
Describe the usage question you have. Please include as many useful details as possible.
platform:
NAME="Ubuntu" VERSION="23.04 (Lunar Lobster)"
pyarrow version:
pyarrow 14.0.1
pyarrow-hotfix 0.5
python version:
Python 3.11.4 (main, Jun 9 2023, 07:59:55) [GCC 12.3.0] on linux
I have a very large single column csv file (about 63 million rows). I was hoping to create a lazy file streamer that reads one entry from the csv file at a time. I know each entry in my file has a length of 12 chars, so I tried setting block size to 13 (+1 for \n) with the pyarrow.csv.open_csv function.
import pyarrow.csv as csv
c_options = csv.ConvertOptions(column_types={'dne': pa.float32()})
r_options = csv.ReadOptions(skip_rows_after_names=8200,use_threads=True, column_names=["dne"],block_size=13)
stream = csv.open_csv(file, convert_options = c_options, read_options = r_options )
this code functions properly as expected, but when i change the
skip_rows_after_names
param of read options to 8300 I start to get segmentation faults when in the open_csv function. How to fix this (or am I using it wrong)? I want to be able to use only a portion of at (like from row 98885 to 111200)I was able to produce this error on another computer with the exact same platform and versions. The file was created with
with open(f"feature_{i}.csv", "w+") as f: for i in range(FILE_LEN): n = random.uniform(-0.5, 0.5) nn = str(n)[:12] f.write(f"{nn}\n")
Component(s)
Python
The text was updated successfully, but these errors were encountered: