-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Presto Community Roadmap Discussion April 6, 2017
Attendance: Facebook, Teradata, Twitter, Jet, Walmart, Netflix, Bloomberg, Turbine/WB,Innominds & others (please add your company name here!)
*all community contributions since the last roadmap meeting over a year ago)
Data types: VARCHAR(n), DECIMAL, REAL, TINYINT, SMALLINT, INTEGER, CHAR(n)
Improved ROW type
Language:
- Support for correlated subqueries
- EXISTS
- INTERSECT/EXCEPT
- Quantified comparisons
- Selective aggregates
- Non-equality outer joins
- Lambda expressions
CREATE TABLE IF NOT EXISTS
Prepared statements
EXPLAIN ANALYZE GRANT/REVOKE SHOW CREATE [TABLE | VIEW] CREATE/DROP/RENAME SCHEMA
Implicit coercions for INSERT
Performance improvements:
- JOIN/GROUP BY
- window functions
- approx_percentile, map_agg, array_agg
- local parallelism
- communication between coordinator-worker
- adaptive concurrency
- intra-node parallelism for intermediate stages
Pluggable event listeners
Resource groups
Hive connector
- Kerberos support
- Optimized RCFile reader
- Transactional DELETE+INSERT
- Writing to bucketed tables
- Support for mismatched table/partition schema
Cassandra connector improvements
New connectors
- MongoDB
- Accumulo
Language:
- GROUPING
- Ordered aggregations
Optimizer Spill to disk
ZSTD support for ORC reader/writer Optimized RCFile writer Optimized ORC writer
Bucket-by-bucket execution
Performance and scalability improvements
- Coordinator CPU utilization
- Structural types
- Intra-task scheduling/prioritization
- Memory-aware scheduling
- HTTP/2
Connectors
- TPC-DS
- Thrift
- SQL Server
SSL-based authentication
Also Vaughn indicated FB is working on concrete plans for increasing PR throughput.
*some may not be yet merged upstream but they are already shipped in the Teradata distro (www.teradata.com/presto)
Connectors
- Cassandra (major improvements)
- MS SQL Server
- TPC-DS
- Memory
Security
- Kerberized Hadoop support
- LDAP Authentication
- GRANT / REVOKE / ROLES
Misc
- JDBC & ODBC by Simba (free to use)
- BI Tools support & certifications
SQL syntax
- DECIMAL, REAL, CHAR
- Subqueries
- EXISTS / EXCEPT / INTERSECT
- Non-equi joins
- grouping()
- Prepared statements
Performance
- Windowing functions
- Joins
Spill to disk
- Aggregation, Join, and other operators
Statistics for Hive connector
- Collected in Hive Metastore
Execution Engine performance
- Distributed sorting
- Runtime filtering
Cost-Based Optimizer
- Selectivity / cost calculator
- Join ordering
- Join distribution type selection
LZO Thrift (Persisted)
- The vast majority of data on Twitter’s internal HDFS are in LZO Thrift format.
- Working on adding support as a new input format in the Presto-hive module.
Security - start up UI on a different port
- Presto currently uses a single port for all HTTP communications (e.g. UI, JMX, client API, coordination, query execution). It would be helpful to be able to configure the use of different ports to have more granular control of firewall settings.
- Launch new module in a new port and bind the UI and related instances from the main app into new module
- https://github.com/prestodb/presto/pull/7106
Nested Schema evolution/push down for Parquet
- For non-primitive fields, Presto currently compares the type in a particular partition to what's in HMS. It doesn't support the evolution on those fields. If we can pass the type to the specific reader and give the chance that allows reader to figure out the conversion from partition schema to table schema, it will reduce much cost of maintaining the metadata(for example, data with old schema for a table, data with new schema for another table) or reproducing old data with a new schema.
- Depends on https://github.com/prestodb/presto/pull/4714
- Yaliang’s patch: https://github.com/prestodb/presto/pull/7305
- Pruning nested fields: https://github.com/prestodb/presto/pull/6675
Dain (FB) commented: "Nezih is working on getting everyone focused on the new optimized parquet reader to get rid of the old one and then getting all the patches together"
Tableau connection
- The issue with Tableau support is with the secure connection. The tricky part is Presto embeds the next url to get the next chunk of result in the response and in http and sneaks through the proxy.
- Connection via JDBC is closed source in Teradata
Improved resource pooling
- Scheduled Zeppelin queries are starting to hog the cluster, after midnight. We're exploring how to best mitigate that effect and will be looking into Presto resource groups functionality.
interested in leveraging stats for the accumulo connector
- Using Swift api connector to connect to cloud storage.
- also using Teradata JDBC drivers
- Are there plans to implement WholeStageCodeGeneration?
TD has a prototype from a hackathon, we'll probably be working on that in the second half of the year
- Are we planning to work on supporting OR predicates in tuple domain in addition to AND?
There's an open PR, we may do it that way, or we may expose whole expression tree to connectors
- Is concern about extending the SPI still holding back support for nested data types?
We may be able to do it without changing the SPI by creating virtual columns, and then from the API's perspective it's just a regular column
- Planning time increase from 0.160 for table with 600 columns
Please try latest version because we had some regressions that we fixed. File an issue on github and we'll look at it.
- Can we have hooks to determine the number of splits based on the number of worker threads?
splits are sized so that they're not too small (so you don't spend tons of cpu time setting up the computation and not doing work) and not too large (Because don't want high latency for query). For the hive connector we have a heuristic of starting with small splits and then increase the size in order to help short running queries.
- Where's the Raptor documentation? (Lawrence from Jet)
We're rewriting it to be transactional. And it won't be backwards compatible. Once there's a new version out there and using it in production and we have a story for upgrading, etc. then we'll document. It could still be used now as a cache (because then you don't care that it's not backwards compatible)