Skip to content
This repository has been archived by the owner on Jun 29, 2021. It is now read-only.

Query Plan Cache (In progress)

sandeep akinapelli edited this page Nov 12, 2019 · 3 revisions

Note:This feature is in progress and not available in SNAP yet.

Plan cache Overview:

The goal of plan caching is to fasten up the execution of the same query in a subsequent execution by caching the execution plan and thus eliminating the compilation/optimization time. This is a shared cached, used by all the sessions.

Cache invalidation

This cache can be invalidated by running the command invalidate plan cache. This clears all the queries across all the tables. The clearing of queries belonging to specific tables is planned in future.

Identifying the Filters in the optimized plan:

Filters of the form <single column expression> <binary operator> <literal> are used as a candidates for (called cache predicates) generalized predicate replacement in future plans. Since spark doesnt support parameterized execution, the original query will be optimized, and the optimal plan along with the identified predicates are stored in the cache.

A normalized version of the query plan servers as the cache key, after all the identified predicates from the previous step, are replaced with default literals.

If a new query matches any of the existing the cached queries after normalized, it is considered as a cache hit. The predicates from the input query are used to replace the predicates in the stored cache plan.

Currently the filters in filter operator node and filterspec in snap index are checked to replace the cached predicates.

Partition plans:

The behvaior of the caching, in the presence of partitions, particulary cache predicate on partition columns need to be investigated.

Cache Configuration:

The cache will be LRU with weights. The weights of the entry is the optimization time.

Configuration parameters:

The parameter spark.sparklinedata.spmd.enable.plancache and spark.sparklinedata.spmd.enable.plancache.extravalidation controls the behavior of the cache if spark.sparklinedata.spmd.enable.plancache is set to false, no plans are cached and it defaults to the regular SNAP execution flow. If spark.sparklinedata.spmd.enable.plancache.extravalidation is set to true, Each time before a new plan is placed on the cache, The literals of the cache predicates are replaced with random values and resulting optimal plan is compared with the spark optimized plan for accuracy. This extra step has to endure the cost of optimizing additional query.

The parameter spark.sparklinedata.spmd.enable.plancache.threshold controls, which plans to cache. Only the plans whose resolution & optimization time exceeds this threshold will be stored in cache. The default value is 300 ms. In order to test for correctness,

Clone this wiki locally