You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #972 we introduced row level filter to improve queries to parquet files with explicit time range. I did a rough test on small parquet file (~60MB) to get a performance improvement by 8 times. I wonder what is the potential of row filters, so I did a benchmark on parquet readers with following configurations:
no row filters, all columns are fetched from parquet and we check for every element in timestamp column and discard the unrelated rows;
plain row filter that iterates timestamp column to find desired rows. This row filter only fetch timestamp column to evaluate the result;
.unwrap(); // safety: we've checked the data type of timestamp column.
let left = arrow::compute::gt_eq_scalar(ts_col, self.lower_bound)?;
let right = arrow::compute::lt_scalar(ts_col, self.upper_bound)?;
arrow::compute::and(&left, &right)
}
};
}
Benchmark
I created a 865MB parquet file with 86,400,000 rows inside, and the query condition is a small timestamp 1000~2000 and 1000 rows are expected to be retrieved.
We can expect about x5 performance improvement when scanning a small set of time range from a typical size parquet file (865MB).
The key to this improvement is, row filter only takes timestamp column to evaluate. After evaluation, only the rows matching time range are fetched and returned to parquet reader.
By introducing SIMD feature, the comparison between desired time range and row values are futher boosted, FastTimestampRowFilter takes only about 80% of PlainTimestampRowFilter's time to get the result.
Please also notice that, current query conditition will hit one or some consecutive row groups, so the row group pruning feature may also help a lot when row filter is not present.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In #972 we introduced row level filter to improve queries to parquet files with explicit time range. I did a rough test on small parquet file (~60MB) to get a performance improvement by 8 times. I wonder what is the potential of row filters, so I did a benchmark on parquet readers with following configurations:
greptimedb/src/storage/src/sst/parquet.rs
Lines 452 to 467 in 75b8afe
greptimedb/src/storage/src/sst/parquet.rs
Lines 390 to 402 in 75b8afe
Benchmark
I created a 865MB parquet file with 86,400,000 rows inside, and the query condition is a small timestamp 1000~2000 and 1000 rows are expected to be retrieved.
Summary
We can expect about x5 performance improvement when scanning a small set of time range from a typical size parquet file (865MB).
The key to this improvement is, row filter only takes timestamp column to evaluate. After evaluation, only the rows matching time range are fetched and returned to parquet reader.
By introducing SIMD feature, the comparison between desired time range and row values are futher boosted,
FastTimestampRowFilter
takes only about 80% ofPlainTimestampRowFilter
's time to get the result.Please also notice that, current query conditition will hit one or some consecutive row groups, so the row group pruning feature may also help a lot when row filter is not present.
greptimedb/src/storage/src/sst/parquet.rs
Lines 260 to 269 in 75b8afe
Beta Was this translation helpful? Give feedback.
All reactions