5.0.0 (2021-08-10)

Full Changelog

Breaking changes:

Box ScalarValue:Lists, reduce size by half size #788 (alamb)
JOIN conditions are order dependent #778 (seddonm1)
Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
#723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
Update API for extension planning to include logical plan #643 (alamb)
Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
fix join column handling logic for On and Using constraints #605 (houqp)
Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
Support reading from NdJson formatted data sources #404 (heymind)
Add metrics to RepartitionExec #398 (andygrove)
Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
Return Vec<bool> from PredicateBuilder rather than an Fn #370 (alamb)
Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
[Datafusion] NOW() function support #288 (msathis)
Implement select distinct #262 (Dandandan)
Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
Support qualified columns in queries #55 (houqp)
Read CSV format text from stdin or memory #54 (heymind)
Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

Allow extension nodes to correctly plan physical expressions with relations #642
Filters aren't passed down to table scans in a union #557
Support pruning for boolean columns #490
Implement SQLMetrics for RepartitionExec #397
DataFusion benchmarks should show executed plan with metrics after query completes #396
Use published versions of arrow rather than github shas #393
Add Compare to GroupByScalar #364
Reusable "row group pruning" logic #363
Add an Order Preserving merge operator #362
Implement Postgres compatible now() function #251
COUNT DISTINCT does not support dictionary types #249
Use standard make_null_array for CASE #222
Implement date_trunc() function #203
COUNT DISTINCT does not support for Float64 #199
Update SQLMetric to use atomics rather than a Mutex #30
Implement PartialOrd for ScalarValue #838 (viirya)
Support date datatypes in max/min #820 (viirya)
Implement vectorized hashing for DictionaryArray types #812 (alamb)
Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
Implement streaming versions of Dataframe.collect methods #789 (andygrove)
impl from str for column and scalar #762 (Jimexist)
impl fmt::Display for PlanType #752 (Jimexist)
Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
Support table columns alias #735 (Dandandan)
Derive PartialEq for datasource enums #734 (alamb)
Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
Update to use arrow 5.0 #721 (alamb)
#554: Lead/lag window function with offset and default value arguments #687 (jgoday)
dedup using join column in wildcard expansion #678 (houqp)
Implement metrics for HashJoinExec #664 (andygrove)
Show physical plan with metrics in benchmark #662 (andygrove)
Allow non-equijoin filters in join condition #660 (Dandandan)
Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
Add support for leading field in interval #647 (Dandandan)
Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
Ballista: Implement scalable distributed joins #634 (andygrove)
implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
Improve "field not found" error messages #625 (andygrove)
Support modulus op #577 (gangliao)
implement std::default::Default for execution config #570 (Jimexist)
to_timestamp_millis(), to_timestamp_micros(), to_timestamp_seconds() #567 (velvia)
Filter push down for Union #559 (Dandandan)
Implement window functions with partition_by clause #558 (Jimexist)
support table alias in join clause #547 (houqp)
Not equal predicate in physical_planning pruning #544 (jgoday)
add error handling and boundary checking for window frames #530 (Jimexist)
Implement window functions with order_by clause #520 (Jimexist)
support group by column positions #519 [sql] (jychen7)
Implement constant folding for CAST #513 (msathis)
Add window frame constructs - alternative #506 (Jimexist)
Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
Add support for boolean columns in pruning logic #500 (alamb)
#215 resolve aliases for group by exprs #485 (jychen7)
Support anti join #482 (Dandandan)
Support semi join #470 (Dandandan)
add order by construct in window function and logical plans #463 (Jimexist)
Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
fix: display the content of debug explain #434 (NGA-TRAN)
implement lead and lag built-in window function #429 (Jimexist)
add support for ndjson for datafusion-cli #427 (Jimexist)
add first_value, last_value, and nth_value built-in window functions #403 (Jimexist)
export both now and random functions #389 (Jimexist)
Function to create ArrayRef from an iterator of ScalarValues #381 (alamb)
Sort preserving merge (#362) #379 (tustvold)
Add support for multiple partitions with SortExec (#362) #378 (tustvold)
add window expression stream, delegated window aggregation to aggregate functions, and implement row_number #375 (Jimexist)
Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
Implement readable explain plans for physical plans #337 (alamb)
Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
Implement hash partitioned aggregation #320 (Dandandan)
Support COUNT(DISTINCT timestamps) #319 (charlibot)
add random SQL function #303 (Jimexist)
allow datafusion cli to take -- comments #296 (Jimexist)
Add json print format mode to datafusion cli #295 (Jimexist)
Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
allow datafusion-cli to take a file param #285 (Jimexist)
add param validation for datafusion-cli #284 (Jimexist)
[breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
Implement count distinct for dictionary arrays #256 (alamb)
Count distinct floats #252 (pjmore)
Add rule to eliminate LIMIT 0 and replace it with an EmptyRelation #213 (Dandandan)
Allow table providers to indicate their type for catalog metadata #205 (returnString)
Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
[DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
[ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)

Fixed bugs:

Projection pushdown removes unqualified column names even when they are used #617
Panic while running join datatypes/schema.rs:165:10 #601
Indentation is incorrect for joins in formatted physical plans #345
Error while running COUNT DISTINCT (timestamp): 'Unexpected DataType for list #314
When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
Incorrect answers with SELECT DISTINCT queries #250
Intermitent failure in CI join_with_hash_collision #227
Concat from Dataframe API no longer accepts multiple expressions #226
Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
Qualified field resolution too strict #810 [sql] (seddonm1)
Better join order resolution logic #797 [sql] (seddonm1)
Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
#723 limit pruning rule to simple expression #764 (lvheyang)
#699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
Fix Date32 and Date64 parquet row group pruning #690 (alamb)
Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
use Weak ptr to break catalog list <> info schema cyclic reference #681 (crepererum)
honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
[fix] select * on empty table #613 (rdettai)
fix 592, support alias in window functions #607 (Jimexist)
RepartitionExec should not error if output has hung up #576 (alamb)
Fix pruning on not equal predicate #561 (alamb)
hash float arrays using primitive usigned integer type #556 (houqp)
Return errors properly from RepartitionExec #521 (alamb)
refactor sort exec stream and combine batches #515 (Jimexist)
Fix display of execution time in datafusion-cli #514 (Dandandan)
Wrong aggregation arguments error. #505 (jgoday)
fix window aggregation with alias and add integration test case #454 (Jimexist)
fix: don't duplicate existing filters #409 (e-dard)
Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
Fix indented display for multi-child nodes #358 (alamb)
Fix SQL planner to support multibyte column names #357 (agatan)
Fix wrong projection 'optimization' #268 (Dandandan)
Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
Count distinct boolean #230 (pjmore)
Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)

Documentation updates:

No way to get to the examples from docs.rs #186
Update docs to use vendored version of arrow #772 (alamb)
Fix typo in DEVELOPERS.md #692 (lvheyang)
update stale documentations related to window functions #598 (Jimexist)
update readme to reflect work on window functions #471 (Jimexist)
Add examples section to datafusion crate doc #457 (mluts)
add invariants spec #443 (houqp)
add output field name rfc #422 (houqp)
Update more docs and also the developer.md doc #414 (Jimexist)
use prettier to format md files #367 (Jimexist)
Add new logo svg with white background #313 (parthsarthy)
Add projects (Squirtle and Tensorbase) to list in readme #312 (parthsarthy)
docs - fix the ballista link #274 (haoxins)
misc(README): Replace Cube.js with Cube Store #248 (ovr)
Initial docs for SQL syntax #242 (Dandandan)
Deduplicate README.md #79 (msathis)

Performance improvements:

Speed up inlist for strings and primitives #813 (Dandandan)
perf: improve performance of SortPreservingMergeExec operator #722 (e-dard)
Optimize min/max queries with table statistics #719 (b41sh)
perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
Optimize count(*) with table statistics #620 (Dandandan)
optimize window function's find_ranges_in_range #595 (Jimexist)
Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
Use repartition in window functions to speed up #569 (Jimexist)
Constant fold / optimize to_timestamp function during planning #387 (msathis)
Speed up create_batch_from_map #339 (Dandandan)
Simplify math expression code (use unary kernel) #309 (Dandandan)

Closed issues:

Confirm git tagging strategy for releases #770
arrow::util::pretty::pretty_format_batches missing #769
move the assert_batches_eq! macros to a non part of datafusion #745
fix an issue where aliases are not respected in generating downstream schemas in window expr #592
make the planner to print more succinct and useful information in window function explain clause #526
move window frame module to be in logical_plan #517
use a more rust idiomatic way of handling nth_value #448
create a test with more than one partition for window functions #435
COUNT DISTINCT does not support for Boolean #202
Read CSV format text from stdin or memory #198
Fix null handling hash join #195
Allow TableProviders to indicate their type for the information schema #191
Make DataFrame extensible #190
TPC-H Query 19 #170
TPC-H Query 7 #161
Upgrade hashbrown to 0.10 #151
Implement vectorized hashing for hash aggregate #149
More efficient LEFT join implementation #143
Implement vectorized hashing #142
RFC Roadmap for 2021 (DataFusion) #140
Implement hash partitioning #131
Grouping by column position #110
[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
[Rust] Add support for JSON data sources #103
[Rust] Implement metrics framework #95
Publically export Arrow crate from datafusion #36
Implement hash-partitioned hash aggregate #27
Consider using GitHub pages for DataFusion/Ballista documentation #18
Update "repository" in Cargo.toml #16

Merged pull requests:

Use RawTable API in hash join #827 (Dandandan)
Add test for window functions on dictionary #823 (alamb)
Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
Move hash_array into hash_utils.rs #807 (alamb)
Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
fix 226, make concat, concat_ws, and random work with Python crate #761 (Jimexist)
Test for parquet pruning disabling #754 (alamb)
Add explain verbose with limit push down #751 (Jimexist)
Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
Show optimized physical and logical plans in EXPLAIN #744 (alamb)
update python crate to support latest pyo3 syntax and gil sematics #741 (Jimexist)
update python crate dependencies #740 (Jimexist)
provide more details on required .parquet file extension error message #729 (Jimexist)
split up windows functions into a dedicated module with separate files #724 (Jimexist)
Use pytest in integration test #715 (Jimexist)
replace once iter chain with array::IntoIter #704 (houqp)
avoid iterator materialization in column index lookup #703 (houqp)
Fix build with 1.52.1 #696 (alamb)
Fix test output due to logical merge conflict #694 (alamb)
add more integration tests #668 (Jimexist)
Bump arrow and parquet versions to 4.4 #654 (toddtreece)
Add query 15 to TPC-H queries #645 (Dandandan)
Improve error message and comments #641 (alamb)
add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
round trip TPCH queries in tests #630 (houqp)
use Into<String> as argument type wherever applicable #615 (houqp)
reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
fix clippy warnings #581 (Jimexist)
Add benchmarks to window function queries #564 (Jimexist)
reuse code for now function expr creation #548 (houqp)
turn on clippy rule for needless borrow #545 (Jimexist)
Refactor hash aggregates's planner building code #539 (Jimexist)
Cleanup Repartition Exec code #538 (alamb)
reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
remove redundant into_iter() calls #527 (Jimexist)
Fix 517 - move window_frames module to logical_plan #518 (Jimexist)
Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
Avoid warnings in tests when compiling without default features #489 (alamb)
update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
use prettier check in CI #453 (Jimexist)
Optimize nth_value, remove first_value, last_value structs and use idiomatic rust style #452 (Jimexist)
Fixed typo / logical merge conflict #433 (jorgecarleitao)
include test data and add aggregation tests in integration test #425 (Jimexist)
Add some padding around the logo #411 (parthsarthy)
Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
refactor datafusion/scalar_value to use more macro and avoid dup code #392 (Jimexist)
Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
Update arrow dependencies again #341 (alamb)
Update arrow-rs deps #317 (alamb)
Update PR template by commenting out instructions #315 (alamb)
fix clippy warning #286 (Jimexist)
add integration test to compare datafusion-cli against psql #281 (Jimexist)
Update arrow deps #269 (alamb)
Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
Enable redundant_field_names clippy lint #261 (Dandandan)
fix clippy lint #259 (alamb)
Move datafusion-cli to new crate #231 (Dandandan)
Make test join_with_hash_collision deterministic #229 (Dandandan)
Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
Use standard make_null_array for CASE #223 (alamb)
update arrow-rs deps to latest master #216 (alamb)
MINOR: Remove empty rust dir #61 (andygrove)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5.0.0.md

5.0.0.md

5.0.0 (2021-08-10)

Files

5.0.0.md

Latest commit

History

5.0.0.md

File metadata and controls

5.0.0 (2021-08-10)