-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] get different total rows answer for one dataset #37840
Comments
@litao3rd can you give a minimal case for this(like a file or few files that can reproduce this case), and provide the version of arrow you're using? |
Sorry for my mistake. I am currently using version 12.0.1 of Arrow. (Maybe I should update the library to latest version 13.0.0?) The original data is downloaded from this website. To simplify the process, I have created a small script for downloading a few files that can reproduce this issue. The script will create a directory called "tlc-trip-record-data" in the current directory and download the necessary data into this directory. Please note that we are behind the Great Firewall, so you may need to execute this script with an optional proxy using the following command: sh ./download-tcl-trip-record-data.sh --proxy protocol://host:port. I have encountered a perplexing issue. When I use 6 months of data, the first block returns a total of 142,516,648 rows, while the other block returns 0 rows. However, when I use 5 months of data, excluding January, both blocks yield the same result. Unfortunately, I am unable to identify the bug in Arrow due to its complexity. Please note that you need to modify the path to tlc-trip-record-data directory in cpp codes. #!/bin/bash
set -e
dataset="tlc-trip-record-data"
test -d $dataset || mkdir $dataset
pushd $dataset
for n in $(seq 1 6); do
d="2023-0$n"
echo "[*] downloading data for $d"
curl $@ -LO "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_$d.parquet"
curl $@ -LO "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_$d.parquet"
curl $@ -LO "https://d37ci6vzurychx.cloudfront.net/trip-data/fhv_tripdata_$d.parquet"
curl $@ -LO "https://d37ci6vzurychx.cloudfront.net/trip-data/fhvhv_tripdata_$d.parquet"
echo "[*] finished download data for $d"
done
popd |
Oops, let me have a try with master, I've check some I've test that the result status of
Because field_type is null, and column is int64. |
The month1 schema is:
The month6 schema is:
I think the problem raises from the schema mismatch |
ds::FinishOptions finishOptions;
finishOptions.inspect_options.fragments = 30;
auto dataset = factory->Finish(finishOptions).ValueOrDie(); I think the result from However, the most suitable ways is to set the schema in ::arrow::SchemaBuilder builder;
builder.AddField(::arrow::field("dispatching_base_num", ::arrow::large_utf8()));
builder.AddField(::arrow::field("pickup_datetime", ::arrow::timestamp(::arrow::TimeUnit::SECOND)));
builder.AddField(::arrow::field("dropoff_datetime", ::arrow::timestamp(::arrow::TimeUnit::SECOND)));
builder.AddField(::arrow::field("PULocationID", ::arrow::int32()));
builder.AddField(::arrow::field("DOLocationID", ::arrow::int32()));
builder.AddField(::arrow::field("SR_Flag", ::arrow::int32()));
builder.AddField(::arrow::field("dispatching_base_number", ::arrow::large_utf8()));
finishOptions.schema = builder.Finish().ValueOrDie();
auto dataset = factory->Finish(finishOptions).ValueOrDie(); Explicit set a schema is better in this case. |
It's truly remarkable! The method you demonstrate is certainly helpful for someone like me who is not very familier with Arrow. I sincerely appreciate your all kindful replies. |
I am not very familier with C++ so that I cound not understand this code snippet well. Could you kindly provide a complete snippet to illustrate how this code works? This would greatly assist me in grasping its workings. |
arrow::Status status;
fs::FileSelector selector;
selector.base_dir = base_dir;
selector.recursive = true;
auto factory = ds::FileSystemDatasetFactory::Make(filesystem, selector, format, ds::FileSystemFactoryOptions())
.ValueOrDie();
ds::FinishOptions finishOptions;
::arrow::SchemaBuilder builder;
auto s = builder.AddField(::arrow::field("dispatching_base_num", ::arrow::large_utf8()));
if (!s.ok()) {
s.Abort();
}
s =builder.AddField(::arrow::field("pickup_datetime", ::arrow::timestamp(::arrow::TimeUnit::SECOND)));
s =builder.AddField(::arrow::field("dropoff_datetime", ::arrow::timestamp(::arrow::TimeUnit::SECOND)));
s =builder.AddField(::arrow::field("PULocationID", ::arrow::int32()));
s =builder.AddField(::arrow::field("DOLocationID", ::arrow::int32()));
s =builder.AddField(::arrow::field("SR_Flag", ::arrow::int32()));
s =builder.AddField(::arrow::field("dispatching_base_number", ::arrow::large_utf8()));
if (!s.ok()) {
s.Abort();
}
finishOptions.schema = builder.Finish().ValueOrDie();
auto dataset = factory->Finish(finishOptions).ValueOrDie();
auto sb = dataset->NewScan().ValueOrDie();
s = sb->UseThreads(false);
if (!s.ok()) {
s.Abort();
}
auto scanner = sb->Finish().ValueOrDie();
{
// In this block I got total 57410540 rows
int64_t total_rows = 0;
if (scanner->options()->dataset_schema != nullptr) {
std::cout << "Schema: " << scanner->options()->dataset_schema->ToString() << std::endl;
} else {
std::cout << "Schema: " << "NONE" << std::endl;
}
status = scanner->Scan([&](ds::TaggedRecordBatch batch) -> arrow::Status {
total_rows += batch.record_batch->num_rows();
return arrow::Status::OK();
});
std::cout << "total count rows in visitor mode = " << total_rows << ", result:" << status.ToString() << "\n";
}
{
// In this block I got total 1526807659 rows
std::cout << "total count rows() = " << scanner->CountRows().ValueOrDie() << "\n";
}
return 0; remember to check the |
@bkietz I found a problem, when Result<std::shared_ptr<Field>> Field::MergeWith(const Field& other,
MergeOptions options) const {
if (name() != other.name()) {
return Status::Invalid("Field ", name(), " doesn't have the same name as ",
other.name());
}
if (Equals(other, /*check_metadata=*/false)) {
return Copy();
}
if (options.promote_nullability) {
if (type()->Equals(other.type())) {
return Copy()->WithNullable(nullable() || other.nullable());
}
std::shared_ptr<Field> promoted = MaybePromoteNullTypes(*this, other);
if (promoted) return promoted;
}
return Status::Invalid("Unable to merge: Field ", name(),
" has incompatible types: ", type()->ToString(), " vs ",
other.type()->ToString());
} In this case, if we don't set explicit schema, and deduce by parquets, it will says |
Greate! Thanks for your reply. It's really helpful for me. |
I think the current approach is reasonable. However, having a common type as an option would also be good, but I don't want this common type to be set as the default. This case is expected for me. |
Yeah, I just raise a problem I found to the community, you'd better have self-defined meta or self-defined schema in this case. However, I think the |
Yep. It's a good problem for the community. Hope you find a greate way to solve the problem. Very appreciate for your friendly responses. It's an awesome experience for me. |
This case is expected. UnifySchemas is designed to only merge fields which are compatible with zero conversion overhead across any reader. This includes (for example) promotion from a required field to a nullable field, since we can always read a column guaranteed to have no nulls as a column which might have nulls. It doesn't include promotion from utf8 to large_utf8 because although a parquet column can be read into a large_utf8 column with no more steps than it could be read into a utf8 column, when reading arrow IPC we'd need to introduce a conversion step. Supporting extended promotions such as utf8->large_utf8 and integer widening would need to operate at the level of |
The original problem has been resolved. Close this issue. |
Describe the bug, including details regarding any error messages, version, and platform.
I'm learning to use arrows with the C++ language. It's possible that this issue isn't a bug but rather a result of incorrect practices, but I'm not certain.
The code below utilizes the "tlc-trip-record-data" dataset, which consists of 264 parquet files that I've downloaded. My objective is to calculate the total number of rows in the dataset. However, as demonstrated below, I've encountered varying results when working with this dataset. Your assistance would be greatly appreciated.
Output:
Thanks for any reply.
Component(s)
C++
The text was updated successfully, but these errors were encountered: