5.0.0 (2021-08-10)
Breaking changes:
- Box ScalarValue:Lists, reduce size by half size #788 (alamb)
- JOIN conditions are order dependent #778 (seddonm1)
- Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
- #723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
- Update API for extension planning to include logical plan #643 (alamb)
- Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
- fix 593, reduce cloning by taking ownership in logical planner's
from
fn #610 (Jimexist) - fix join column handling logic for
On
andUsing
constraints #605 (houqp) - Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
- Support reading from NdJson formatted data sources #404 (heymind)
- Add metrics to RepartitionExec #398 (andygrove)
- Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
- Return Vec<bool> from PredicateBuilder rather than an
Fn
#370 (alamb) - Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
- [Datafusion] NOW() function support #288 (msathis)
- Implement select distinct #262 (Dandandan)
- Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
- Support qualified columns in queries #55 (houqp)
- Read CSV format text from stdin or memory #54 (heymind)
- Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)
Implemented enhancements:
- Allow extension nodes to correctly plan physical expressions with relations #642
- Filters aren't passed down to table scans in a union #557
- Support pruning for
boolean
columns #490 - Implement SQLMetrics for RepartitionExec #397
- DataFusion benchmarks should show executed plan with metrics after query completes #396
- Use published versions of arrow rather than github shas #393
- Add Compare to GroupByScalar #364
- Reusable "row group pruning" logic #363
- Add an Order Preserving merge operator #362
- Implement Postgres compatible
now()
function #251 - COUNT DISTINCT does not support dictionary types #249
- Use standard make_null_array for CASE #222
- Implement date_trunc() function #203
- COUNT DISTINCT does not support for
Float64
#199 - Update SQLMetric to use atomics rather than a Mutex #30
- Implement PartialOrd for ScalarValue #838 (viirya)
- Support date datatypes in max/min #820 (viirya)
- Implement vectorized hashing for DictionaryArray types #812 (alamb)
- Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
- Implement streaming versions of Dataframe.collect methods #789 (andygrove)
- impl from str for column and scalar #762 (Jimexist)
- impl fmt::Display for PlanType #752 (Jimexist)
- Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
- Support table columns alias #735 (Dandandan)
- Derive PartialEq for datasource enums #734 (alamb)
- Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
- Update to use arrow 5.0 #721 (alamb)
- #554: Lead/lag window function with offset and default value arguments #687 (jgoday)
- dedup using join column in wildcard expansion #678 (houqp)
- Implement metrics for HashJoinExec #664 (andygrove)
- Show physical plan with metrics in benchmark #662 (andygrove)
- Allow non-equijoin filters in join condition #660 (Dandandan)
- Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
- Add support for leading field in interval #647 (Dandandan)
- Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
- Ballista: Implement scalable distributed joins #634 (andygrove)
- implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
- Improve "field not found" error messages #625 (andygrove)
- Support modulus op #577 (gangliao)
- implement
std::default::Default
for execution config #570 (Jimexist) to_timestamp_millis()
,to_timestamp_micros()
,to_timestamp_seconds()
#567 (velvia)- Filter push down for Union #559 (Dandandan)
- Implement window functions with
partition_by
clause #558 (Jimexist) - support table alias in join clause #547 (houqp)
- Not equal predicate in physical_planning pruning #544 (jgoday)
- add error handling and boundary checking for window frames #530 (Jimexist)
- Implement window functions with
order_by
clause #520 (Jimexist) - support group by column positions #519 [sql] (jychen7)
- Implement constant folding for CAST #513 (msathis)
- Add window frame constructs - alternative #506 (Jimexist)
- Add
partition by
constructs in window functions and modify logical planning #501 (Jimexist) - Add support for boolean columns in pruning logic #500 (alamb)
- #215 resolve aliases for group by exprs #485 (jychen7)
- Support anti join #482 (Dandandan)
- Support semi join #470 (Dandandan)
- add
order by
construct in window function and logical plans #463 (Jimexist) - Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
- fix: display the content of debug explain #434 (NGA-TRAN)
- implement lead and lag built-in window function #429 (Jimexist)
- add support for ndjson for datafusion-cli #427 (Jimexist)
- add
first_value
,last_value
, andnth_value
built-in window functions #403 (Jimexist) - export both
now
andrandom
functions #389 (Jimexist) - Function to create
ArrayRef
from an iterator of ScalarValues #381 (alamb) - Sort preserving merge (#362) #379 (tustvold)
- Add support for multiple partitions with SortExec (#362) #378 (tustvold)
- add window expression stream, delegated window aggregation to aggregate functions, and implement
row_number
#375 (Jimexist) - Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
- Implement readable explain plans for physical plans #337 (alamb)
- Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
- Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
- add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
- Implement hash partitioned aggregation #320 (Dandandan)
- Support COUNT(DISTINCT timestamps) #319 (charlibot)
- add random SQL function #303 (Jimexist)
- allow datafusion cli to take -- comments #296 (Jimexist)
- Add json print format mode to datafusion cli #295 (Jimexist)
- Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
- Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
- allow datafusion-cli to take a file param #285 (Jimexist)
- add param validation for datafusion-cli #284 (Jimexist)
- [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
- Implement count distinct for dictionary arrays #256 (alamb)
- Count distinct floats #252 (pjmore)
- Add rule to eliminate
LIMIT 0
and replace it with anEmptyRelation
#213 (Dandandan) - Allow table providers to indicate their type for catalog metadata #205 (returnString)
- Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
- Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
- [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
- [ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)
Fixed bugs:
- Projection pushdown removes unqualified column names even when they are used #617
- Panic while running join datatypes/schema.rs:165:10 #601
- Indentation is incorrect for joins in formatted physical plans #345
- Error while running
COUNT DISTINCT (timestamp)
: 'Unexpected DataType for list #314 - When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
- Incorrect answers with SELECT DISTINCT queries #250
- Intermitent failure in CI join_with_hash_collision #227
Concat
from Dataframe API no longer accepts multiple expressions #226- Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
- Qualified field resolution too strict #810 [sql] (seddonm1)
- Better join order resolution logic #797 [sql] (seddonm1)
- Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
- Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
- #723 limit pruning rule to simple expression #764 (lvheyang)
- #699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
- Fix Date32 and Date64 parquet row group pruning #690 (alamb)
- Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
- use
Weak
ptr to break catalog list <> info schema cyclic reference #681 (crepererum) - honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
- fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
- RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
- [fix] select * on empty table #613 (rdettai)
- fix 592, support alias in window functions #607 (Jimexist)
- RepartitionExec should not error if output has hung up #576 (alamb)
- Fix pruning on not equal predicate #561 (alamb)
- hash float arrays using primitive usigned integer type #556 (houqp)
- Return errors properly from RepartitionExec #521 (alamb)
- refactor sort exec stream and combine batches #515 (Jimexist)
- Fix display of execution time in datafusion-cli #514 (Dandandan)
- Wrong aggregation arguments error. #505 (jgoday)
- fix window aggregation with alias and add integration test case #454 (Jimexist)
- fix: don't duplicate existing filters #409 (e-dard)
- Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
- Fix indented display for multi-child nodes #358 (alamb)
- Fix SQL planner to support multibyte column names #357 (agatan)
- Fix wrong projection 'optimization' #268 (Dandandan)
- Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
- Count distinct boolean #230 (pjmore)
- Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)
Documentation updates:
- No way to get to the examples from docs.rs #186
- Update docs to use vendored version of arrow #772 (alamb)
- Fix typo in DEVELOPERS.md #692 (lvheyang)
- update stale documentations related to window functions #598 (Jimexist)
- update readme to reflect work on window functions #471 (Jimexist)
- Add examples section to datafusion crate doc #457 (mluts)
- add invariants spec #443 (houqp)
- add output field name rfc #422 (houqp)
- Update more docs and also the developer.md doc #414 (Jimexist)
- use prettier to format md files #367 (Jimexist)
- Add new logo svg with white background #313 (parthsarthy)
- Add projects (Squirtle and Tensorbase) to list in readme #312 (parthsarthy)
- docs - fix the ballista link #274 (haoxins)
- misc(README): Replace Cube.js with Cube Store #248 (ovr)
- Initial docs for SQL syntax #242 (Dandandan)
- Deduplicate README.md #79 (msathis)
Performance improvements:
- Speed up inlist for strings and primitives #813 (Dandandan)
- perf: improve performance of
SortPreservingMergeExec
operator #722 (e-dard) - Optimize min/max queries with table statistics #719 (b41sh)
- perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
- Optimize count(*) with table statistics #620 (Dandandan)
- optimize window function's
find_ranges_in_range
#595 (Jimexist) - Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
- Use repartition in window functions to speed up #569 (Jimexist)
- Constant fold / optimize
to_timestamp
function during planning #387 (msathis) - Speed up
create_batch_from_map
#339 (Dandandan) - Simplify math expression code (use unary kernel) #309 (Dandandan)
Closed issues:
- Confirm git tagging strategy for releases #770
- arrow::util::pretty::pretty_format_batches missing #769
- move the
assert_batches_eq!
macros to a non part of datafusion #745 - fix an issue where aliases are not respected in generating downstream schemas in window expr #592
- make the planner to print more succinct and useful information in window function explain clause #526
- move window frame module to be in
logical_plan
#517 - use a more rust idiomatic way of handling nth_value #448
- create a test with more than one partition for window functions #435
- COUNT DISTINCT does not support for
Boolean
#202 - Read CSV format text from stdin or memory #198
- Fix null handling hash join #195
- Allow TableProviders to indicate their type for the information schema #191
- Make DataFrame extensible #190
- TPC-H Query 19 #170
- TPC-H Query 7 #161
- Upgrade hashbrown to 0.10 #151
- Implement vectorized hashing for hash aggregate #149
- More efficient LEFT join implementation #143
- Implement vectorized hashing #142
- RFC Roadmap for 2021 (DataFusion) #140
- Implement hash partitioning #131
- Grouping by column position #110
- [Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
- [Rust] Add support for JSON data sources #103
- [Rust] Implement metrics framework #95
- Publically export Arrow crate from datafusion #36
- Implement hash-partitioned hash aggregate #27
- Consider using GitHub pages for DataFusion/Ballista documentation #18
- Update "repository" in Cargo.toml #16
Merged pull requests:
- Use
RawTable
API in hash join #827 (Dandandan) - Add test for window functions on dictionary #823 (alamb)
- Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
- Move
hash_array
into hash_utils.rs #807 (alamb) - Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
- fix 226, make
concat
,concat_ws
, andrandom
work withPython
crate #761 (Jimexist) - Test for parquet pruning disabling #754 (alamb)
- Add explain verbose with limit push down #751 (Jimexist)
- Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
- Show optimized physical and logical plans in EXPLAIN #744 (alamb)
- update
python
crate to support latest pyo3 syntax and gil sematics #741 (Jimexist) - update
python
crate dependencies #740 (Jimexist) - provide more details on required .parquet file extension error message #729 (Jimexist)
- split up windows functions into a dedicated module with separate files #724 (Jimexist)
- Use pytest in integration test #715 (Jimexist)
- replace once iter chain with array::IntoIter #704 (houqp)
- avoid iterator materialization in column index lookup #703 (houqp)
- Fix build with 1.52.1 #696 (alamb)
- Fix test output due to logical merge conflict #694 (alamb)
- add more integration tests #668 (Jimexist)
- Bump arrow and parquet versions to 4.4 #654 (toddtreece)
- Add query 15 to TPC-H queries #645 (Dandandan)
- Improve error message and comments #641 (alamb)
- add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
- round trip TPCH queries in tests #630 (houqp)
- use Into<String> as argument type wherever applicable #615 (houqp)
- reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
- fix clippy warnings #581 (Jimexist)
- Add benchmarks to window function queries #564 (Jimexist)
- reuse code for now function expr creation #548 (houqp)
- turn on clippy rule for needless borrow #545 (Jimexist)
- Refactor hash aggregates's planner building code #539 (Jimexist)
- Cleanup Repartition Exec code #538 (alamb)
- reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
- remove redundant
into_iter()
calls #527 (Jimexist) - Fix 517 - move
window_frames
module tological_plan
#518 (Jimexist) - Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
- Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
- Avoid warnings in tests when compiling without default features #489 (alamb)
- update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
- use prettier check in CI #453 (Jimexist)
- Optimize
nth_value
, removefirst_value
,last_value
structs and use idiomatic rust style #452 (Jimexist) - Fixed typo / logical merge conflict #433 (jorgecarleitao)
- include test data and add aggregation tests in integration test #425 (Jimexist)
- Add some padding around the logo #411 (parthsarthy)
- Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
- refactor datafusion/
scalar_value
to use more macro and avoid dup code #392 (Jimexist) - Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
- Update arrow dependencies again #341 (alamb)
- Update arrow-rs deps #317 (alamb)
- Update PR template by commenting out instructions #315 (alamb)
- fix clippy warning #286 (Jimexist)
- add integration test to compare datafusion-cli against psql #281 (Jimexist)
- Update arrow deps #269 (alamb)
- Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
- Enable redundant_field_names clippy lint #261 (Dandandan)
- fix clippy lint #259 (alamb)
- Move datafusion-cli to new crate #231 (Dandandan)
- Make test join_with_hash_collision deterministic #229 (Dandandan)
- Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
- Use standard make_null_array for CASE #223 (alamb)
- update arrow-rs deps to latest master #216 (alamb)
- MINOR: Remove empty rust dir #61 (andygrove)