Skip to content

Latest commit

 

History

History
310 lines (292 loc) · 39.4 KB

5.0.0.md

File metadata and controls

310 lines (292 loc) · 39.4 KB

5.0.0 (2021-08-10)

Full Changelog

Breaking changes:

  • Box ScalarValue:Lists, reduce size by half size #788 (alamb)
  • JOIN conditions are order dependent #778 (seddonm1)
  • Show the result of all optimizer passes in EXPLAIN VERBOSE #759 (alamb)
  • #723 Datafusion add option in ExecutionConfig to enable/disable parquet pruning #749 (lvheyang)
  • Update API for extension planning to include logical plan #643 (alamb)
  • Rename MergeExec to CoalescePartitionsExec #635 (andygrove)
  • fix 593, reduce cloning by taking ownership in logical planner's from fn #610 (Jimexist)
  • fix join column handling logic for On and Using constraints #605 (houqp)
  • Rewrite pruning logic in terms of PruningStatistics using Array trait (option 2) #426 (alamb)
  • Support reading from NdJson formatted data sources #404 (heymind)
  • Add metrics to RepartitionExec #398 (andygrove)
  • Use 4.x arrow-rs from crates.io rather than git sha #395 (alamb)
  • Return Vec<bool> from PredicateBuilder rather than an Fn #370 (alamb)
  • Refactor: move RowGroupPredicateBuilder into its own module, rename to PruningPredicateBuilder #365 (alamb)
  • [Datafusion] NOW() function support #288 (msathis)
  • Implement select distinct #262 (Dandandan)
  • Refactor datafusion/src/physical_plan/common.rs build_file_list to take less param and reuse code #253 (Jimexist)
  • Support qualified columns in queries #55 (houqp)
  • Read CSV format text from stdin or memory #54 (heymind)
  • Use atomics for SQLMetric implementation, remove unused name field #25 (returnString)

Implemented enhancements:

  • Allow extension nodes to correctly plan physical expressions with relations #642
  • Filters aren't passed down to table scans in a union #557
  • Support pruning for boolean columns #490
  • Implement SQLMetrics for RepartitionExec #397
  • DataFusion benchmarks should show executed plan with metrics after query completes #396
  • Use published versions of arrow rather than github shas #393
  • Add Compare to GroupByScalar #364
  • Reusable "row group pruning" logic #363
  • Add an Order Preserving merge operator #362
  • Implement Postgres compatible now() function #251
  • COUNT DISTINCT does not support dictionary types #249
  • Use standard make_null_array for CASE #222
  • Implement date_trunc() function #203
  • COUNT DISTINCT does not support for Float64 #199
  • Update SQLMetric to use atomics rather than a Mutex #30
  • Implement PartialOrd for ScalarValue #838 (viirya)
  • Support date datatypes in max/min #820 (viirya)
  • Implement vectorized hashing for DictionaryArray types #812 (alamb)
  • Convert unsupported conditions in left right join to filters #796 [sql] (Dandandan)
  • Implement streaming versions of Dataframe.collect methods #789 (andygrove)
  • impl from str for column and scalar #762 (Jimexist)
  • impl fmt::Display for PlanType #752 (Jimexist)
  • Remove unnecessary projection in logical plan optimization phase #747 (waynexia)
  • Support table columns alias #735 (Dandandan)
  • Derive PartialEq for datasource enums #734 (alamb)
  • Allow filetype to be lowercase, Implement FromStr for FileType #728 (Jimexist)
  • Update to use arrow 5.0 #721 (alamb)
  • #554: Lead/lag window function with offset and default value arguments #687 (jgoday)
  • dedup using join column in wildcard expansion #678 (houqp)
  • Implement metrics for HashJoinExec #664 (andygrove)
  • Show physical plan with metrics in benchmark #662 (andygrove)
  • Allow non-equijoin filters in join condition #660 (Dandandan)
  • Add End-to-end test for parquet pruning + metrics for ParquetExec #657 (alamb)
  • Add support for leading field in interval #647 (Dandandan)
  • Remove hard-coded PartitionMode from Ballista serde #637 (andygrove)
  • Ballista: Implement scalable distributed joins #634 (andygrove)
  • implement rank and dense_rank function and refactor built-in window function evaluation #631 (Jimexist)
  • Improve "field not found" error messages #625 (andygrove)
  • Support modulus op #577 (gangliao)
  • implement std::default::Default for execution config #570 (Jimexist)
  • to_timestamp_millis(), to_timestamp_micros(), to_timestamp_seconds() #567 (velvia)
  • Filter push down for Union #559 (Dandandan)
  • Implement window functions with partition_by clause #558 (Jimexist)
  • support table alias in join clause #547 (houqp)
  • Not equal predicate in physical_planning pruning #544 (jgoday)
  • add error handling and boundary checking for window frames #530 (Jimexist)
  • Implement window functions with order_by clause #520 (Jimexist)
  • support group by column positions #519 [sql] (jychen7)
  • Implement constant folding for CAST #513 (msathis)
  • Add window frame constructs - alternative #506 (Jimexist)
  • Add partition by constructs in window functions and modify logical planning #501 (Jimexist)
  • Add support for boolean columns in pruning logic #500 (alamb)
  • #215 resolve aliases for group by exprs #485 (jychen7)
  • Support anti join #482 (Dandandan)
  • Support semi join #470 (Dandandan)
  • add order by construct in window function and logical plans #463 (Jimexist)
  • Remove reundant filters (e.g. c> 5 AND c>5 --> c>5) #436 (jgoday)
  • fix: display the content of debug explain #434 (NGA-TRAN)
  • implement lead and lag built-in window function #429 (Jimexist)
  • add support for ndjson for datafusion-cli #427 (Jimexist)
  • add first_value, last_value, and nth_value built-in window functions #403 (Jimexist)
  • export both now and random functions #389 (Jimexist)
  • Function to create ArrayRef from an iterator of ScalarValues #381 (alamb)
  • Sort preserving merge (#362) #379 (tustvold)
  • Add support for multiple partitions with SortExec (#362) #378 (tustvold)
  • add window expression stream, delegated window aggregation to aggregate functions, and implement row_number #375 (Jimexist)
  • Add PartialOrd and Ord to GroupByScalar (#364) #368 (tustvold)
  • Implement readable explain plans for physical plans #337 (alamb)
  • Add window expression part 1 - logical and physical planning, structure, to/from proto, and explain, for empty over clause only #334 (Jimexist)
  • Use NullArray to Pass row count to ScalarFunctions that take 0 arguments #328 (Jimexist)
  • add --quiet/-q flag and allow timing info to be turned on/off #323 (Jimexist)
  • Implement hash partitioned aggregation #320 (Dandandan)
  • Support COUNT(DISTINCT timestamps) #319 (charlibot)
  • add random SQL function #303 (Jimexist)
  • allow datafusion cli to take -- comments #296 (Jimexist)
  • Add json print format mode to datafusion cli #295 (Jimexist)
  • Add print format param with support for tsv print format to datafusion cli #292 (Jimexist)
  • Add print format param and support for csv print format to datafusion cli #289 (Jimexist)
  • allow datafusion-cli to take a file param #285 (Jimexist)
  • add param validation for datafusion-cli #284 (Jimexist)
  • [breaking change] fix 265, log should be log10, and add ln #271 (Jimexist)
  • Implement count distinct for dictionary arrays #256 (alamb)
  • Count distinct floats #252 (pjmore)
  • Add rule to eliminate LIMIT 0 and replace it with an EmptyRelation #213 (Dandandan)
  • Allow table providers to indicate their type for catalog metadata #205 (returnString)
  • Use arrow eq kernels in CaseWhen expression evaluation #52 (Dandandan)
  • Re-export Arrow and Parquet crates from DataFusion #39 (returnString)
  • [DataFusion] Optimize hash join inner workings, null handling fix #24 (Dandandan)
  • [ARROW-12441] [DataFusion] Cross join implementation #11 (Dandandan)

Fixed bugs:

  • Projection pushdown removes unqualified column names even when they are used #617
  • Panic while running join datatypes/schema.rs:165:10 #601
  • Indentation is incorrect for joins in formatted physical plans #345
  • Error while running COUNT DISTINCT (timestamp): 'Unexpected DataType for list #314
  • When joining two tables, get Error: Plan("Schema contains duplicate unqualified field name 'xxx'") #311
  • Incorrect answers with SELECT DISTINCT queries #250
  • Intermitent failure in CI join_with_hash_collision #227
  • Concat from Dataframe API no longer accepts multiple expressions #226
  • Fix right, full join handling when having multiple non-matching rows at the left side #845 (Dandandan)
  • Qualified field resolution too strict #810 [sql] (seddonm1)
  • Better join order resolution logic #797 [sql] (seddonm1)
  • Produce correct answers for Group BY NULL (Option 1) #793 (alamb)
  • Use consistent version of string_to_timestamp_nanos in DataFusion #767 (alamb)
  • #723 limit pruning rule to simple expression #764 (lvheyang)
  • #699 fix return type conflict when calling builtin math fuctions #716 (lvheyang)
  • Fix Date32 and Date64 parquet row group pruning #690 (alamb)
  • Remove qualifiers on pushed down predicates / Fix parquet pruning #689 (alamb)
  • use Weak ptr to break catalog list <> info schema cyclic reference #681 (crepererum)
  • honor table name for csv/parquet scan in ballista plan serde #629 (houqp)
  • fix 621, where unnamed window functions shall be differentiated by partition and order by clause #622 (Jimexist)
  • RFC: Do not prune out unnecessary columns with unqualified references #619 (alamb)
  • [fix] select * on empty table #613 (rdettai)
  • fix 592, support alias in window functions #607 (Jimexist)
  • RepartitionExec should not error if output has hung up #576 (alamb)
  • Fix pruning on not equal predicate #561 (alamb)
  • hash float arrays using primitive usigned integer type #556 (houqp)
  • Return errors properly from RepartitionExec #521 (alamb)
  • refactor sort exec stream and combine batches #515 (Jimexist)
  • Fix display of execution time in datafusion-cli #514 (Dandandan)
  • Wrong aggregation arguments error. #505 (jgoday)
  • fix window aggregation with alias and add integration test case #454 (Jimexist)
  • fix: don't duplicate existing filters #409 (e-dard)
  • Fixed incorrect logical type in GroupByScalar. #391 (jorgecarleitao)
  • Fix indented display for multi-child nodes #358 (alamb)
  • Fix SQL planner to support multibyte column names #357 (agatan)
  • Fix wrong projection 'optimization' #268 (Dandandan)
  • Fix Left join implementation is incorrect for 0 or multiple batches on the right side #238 (Dandandan)
  • Count distinct boolean #230 (pjmore)
  • Fix Filter / where clause without column names is removed in optimization pass #225 (Dandandan)

Documentation updates:

Performance improvements:

  • Speed up inlist for strings and primitives #813 (Dandandan)
  • perf: improve performance of SortPreservingMergeExec operator #722 (e-dard)
  • Optimize min/max queries with table statistics #719 (b41sh)
  • perf: Improve materialisation performance of SortPreservingMergeExec #691 (e-dard)
  • Optimize count(*) with table statistics #620 (Dandandan)
  • optimize window function's find_ranges_in_range #595 (Jimexist)
  • Collapse sort into window expr and do sort within logical phase #571 (Jimexist)
  • Use repartition in window functions to speed up #569 (Jimexist)
  • Constant fold / optimize to_timestamp function during planning #387 (msathis)
  • Speed up create_batch_from_map #339 (Dandandan)
  • Simplify math expression code (use unary kernel) #309 (Dandandan)

Closed issues:

  • Confirm git tagging strategy for releases #770
  • arrow::util::pretty::pretty_format_batches missing #769
  • move the assert_batches_eq! macros to a non part of datafusion #745
  • fix an issue where aliases are not respected in generating downstream schemas in window expr #592
  • make the planner to print more succinct and useful information in window function explain clause #526
  • move window frame module to be in logical_plan #517
  • use a more rust idiomatic way of handling nth_value #448
  • create a test with more than one partition for window functions #435
  • COUNT DISTINCT does not support for Boolean #202
  • Read CSV format text from stdin or memory #198
  • Fix null handling hash join #195
  • Allow TableProviders to indicate their type for the information schema #191
  • Make DataFrame extensible #190
  • TPC-H Query 19 #170
  • TPC-H Query 7 #161
  • Upgrade hashbrown to 0.10 #151
  • Implement vectorized hashing for hash aggregate #149
  • More efficient LEFT join implementation #143
  • Implement vectorized hashing #142
  • RFC Roadmap for 2021 (DataFusion) #140
  • Implement hash partitioning #131
  • Grouping by column position #110
  • [Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
  • [Rust] Add support for JSON data sources #103
  • [Rust] Implement metrics framework #95
  • Publically export Arrow crate from datafusion #36
  • Implement hash-partitioned hash aggregate #27
  • Consider using GitHub pages for DataFusion/Ballista documentation #18
  • Update "repository" in Cargo.toml #16

Merged pull requests:

  • Use RawTable API in hash join #827 (Dandandan)
  • Add test for window functions on dictionary #823 (alamb)
  • Update dependencies: prost to 0.8 and tonic to 0.5 #818 (alamb)
  • Move hash_array into hash_utils.rs #807 (alamb)
  • Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786 (alamb)
  • fix 226, make concat, concat_ws, and random work with Python crate #761 (Jimexist)
  • Test for parquet pruning disabling #754 (alamb)
  • Add explain verbose with limit push down #751 (Jimexist)
  • Move assert_batches_eq! macros to test_utils.rs #746 (alamb)
  • Show optimized physical and logical plans in EXPLAIN #744 (alamb)
  • update python crate to support latest pyo3 syntax and gil sematics #741 (Jimexist)
  • update python crate dependencies #740 (Jimexist)
  • provide more details on required .parquet file extension error message #729 (Jimexist)
  • split up windows functions into a dedicated module with separate files #724 (Jimexist)
  • Use pytest in integration test #715 (Jimexist)
  • replace once iter chain with array::IntoIter #704 (houqp)
  • avoid iterator materialization in column index lookup #703 (houqp)
  • Fix build with 1.52.1 #696 (alamb)
  • Fix test output due to logical merge conflict #694 (alamb)
  • add more integration tests #668 (Jimexist)
  • Bump arrow and parquet versions to 4.4 #654 (toddtreece)
  • Add query 15 to TPC-H queries #645 (Dandandan)
  • Improve error message and comments #641 (alamb)
  • add integration tests for rank, dense_rank, fix last_value evaluation with rank #638 (Jimexist)
  • round trip TPCH queries in tests #630 (houqp)
  • use Into<String> as argument type wherever applicable #615 (houqp)
  • reuse alias map in aggregate logical planning and refactor position resolution #606 (Jimexist)
  • fix clippy warnings #581 (Jimexist)
  • Add benchmarks to window function queries #564 (Jimexist)
  • reuse code for now function expr creation #548 (houqp)
  • turn on clippy rule for needless borrow #545 (Jimexist)
  • Refactor hash aggregates's planner building code #539 (Jimexist)
  • Cleanup Repartition Exec code #538 (alamb)
  • reuse datafusion physical planner in ballista building from protobuf #532 (Jimexist)
  • remove redundant into_iter() calls #527 (Jimexist)
  • Fix 517 - move window_frames module to logical_plan #518 (Jimexist)
  • Refactor window aggregation, simplify batch processing logic #516 (Jimexist)
  • Add datafusion::test_util, resolve test data paths without env vars #498 (mluts)
  • Avoid warnings in tests when compiling without default features #489 (alamb)
  • update cargo.toml in python crate and fix unit test due to hash joins #483 (Jimexist)
  • use prettier check in CI #453 (Jimexist)
  • Optimize nth_value, remove first_value, last_value structs and use idiomatic rust style #452 (Jimexist)
  • Fixed typo / logical merge conflict #433 (jorgecarleitao)
  • include test data and add aggregation tests in integration test #425 (Jimexist)
  • Add some padding around the logo #411 (parthsarthy)
  • Benchmark subcommand to distinguish between DataFusion and Ballista #402 (jgoday)
  • refactor datafusion/scalar_value to use more macro and avoid dup code #392 (Jimexist)
  • Update TPC-H benchmark to show physical plan when debug mode is enabled #386 (andygrove)
  • Update arrow dependencies again #341 (alamb)
  • Update arrow-rs deps #317 (alamb)
  • Update PR template by commenting out instructions #315 (alamb)
  • fix clippy warning #286 (Jimexist)
  • add integration test to compare datafusion-cli against psql #281 (Jimexist)
  • Update arrow deps #269 (alamb)
  • Use multi-stage build dockerfile in datafusion-cli and reduce image size from 2.16GB to 89.9MB #266 (Jimexist)
  • Enable redundant_field_names clippy lint #261 (Dandandan)
  • fix clippy lint #259 (alamb)
  • Move datafusion-cli to new crate #231 (Dandandan)
  • Make test join_with_hash_collision deterministic #229 (Dandandan)
  • Update arrow-rs deps (to fix build due to flatbuffers update) #224 (alamb)
  • Use standard make_null_array for CASE #223 (alamb)
  • update arrow-rs deps to latest master #216 (alamb)
  • MINOR: Remove empty rust dir #61 (andygrove)