Data Movement

Jump to bottom

Harish Butani edited this page Jan 13, 2022 · 2 revisions

Overall Picture

Given a Spark Catalyst Plan, we will generate an equivalent Oracle query (assuming all data sources in the plan are accessible from Oracle).
In Spark this would appear as a scan on a data source v2 relation.

Question: How should we setup the RDD partitions of this relation?

Bad strategies are:
- Set up one RDD partition: In some cases you end up streaming large amounts of data through one pipe.
- Set up RDD partitions based on the blocks/partitions/num_rows of tables involved in the query.
  - This can go wrong in many ways. For example, if you have two tables, each with 100k blocks, and we decide to split scans into 1000 blocks, then it would involve a million tasks.
  - It doesn’t account for operations on the base data: join statistics, aggregation, etc.

Idea/Investigation:

Use a plan from Oracle to decide. See reading and understanding plans.
If the plan tells us o/p rows is small enough, we can use one task. In a large number of cases, this will be true.
If the plan involves a single table, use single table blocks/partitions/num_rows to generate multiple queries by adding blocks/partitions/num_rows predicates.
If the plan involves a table that is much larger than other tables, use single table blocks/partitions/num_rows to generate multiple queries by adding blocks/partitions/num_rows predicates.
If the plan output involves an input column, use its statistics to set up tasks based on adding column predicates.
- A good candidate is input columns. If it's a partition column, a predicate on it can be applied to set up parallel tasks.
In some cases a result cache may be a good strategy. Cache results and then use rownum predicates to split streaming.
If none of above is possible, try to descend the plan and apply strategies to a subquery block. Choose the subquery block to apply, based on a subquery block cost estimate.