Introduce whyNot API #449

sezruby · 2021-05-28T21:32:50Z

What is the context for this pull request?

Tracking Issue: [PROPOSAL] Debugging Index Usage via Why-Not #253
Parent Issue: [PROPOSAL]: HyperspaceOneRule #405
Dependencies:

What changes were proposed in this pull request?

Introduce whyNot API that explains why each index is not applied to a specific sub plan of the given dataframe.

The following is an example definition for a disqualified reason.

  case class ColSchemaMismatch(sourceColumns: String, indexColumns: String) extends FilterReason {
    override final val codeStr: String = "COL_SCHEMA_MISMATCH"
    override val args = Seq("sourceColumns" -> sourceColumns, "indexColumns" -> indexColumns)
    override def verboseStr: String = {
      s"Column Schema does not match. Source data columns: [$sourceColumns], " +
        s"Index columns: [$indexColumns]"
    }
  }

These "FilterReason"s will be collected for each (index entry, sub plan) pair, and the result will be printed like the following:

hs.whyNot(query) - collect reasons for all indexes

=============================================================
Plan with Hyperspace & Summary:
=============================================================
Join Inner, (c3# = c3#)
:- Project [c4#, c3#]
:  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
:     +- Relation[c3#,c4#] Hyperspace(Type: CI, Name: leftDfFilterIndex, LogVersion: 1)
+- Project [c5#, c3#]
   +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
      +- Relation[c3#,c5#] Hyperspace(Type: CI, Name: rightDfFilterIndex, LogVersion: 1)

Applied indexes:
- leftDfFilterIndex
- rightDfFilterIndex

Applicable indexes, but not applied due to priority:
- leftDfJoinIndex
- rightDfJoinIndex

Non-applicable indexes - index is outdated:
- No such index found.

Non-applicable indexes - no applicable query plan:
- No such index found.

For more information, please visit: https://microsoft.github.io/hyperspace/docs/why-not-result-analysis

=============================================================
Plan without Hyperspace & WhyNot reasons:
=============================================================
00 Join Inner, (c3# = c3#)
01 :- Project [c4#, c3#]
02 :  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
03 :     +- Relation[c1#,c2#,c3#,c4#,c5#] parquet
04 +- Project [c5#, c3#]
05    +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
06       +- Relation[c1#,c2#,c3#,c4#,c5#] parquet

+----------+------------------+---------+-------------------------+------------------------------------------------------------+
|SubPlan   |IndexName         |IndexType|Reason                   |Message                                                     |
+----------+------------------+---------+-------------------------+------------------------------------------------------------+
|Filter @2 |leftDfFilterIndex |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c4,c3]            |
|Filter @2 |leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |
|Filter @2 |rightDfFilterIndex|CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c5], filterCols=[c4,c3]                    |
|Filter @2 |rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c5]            |
|Filter @5 |leftDfFilterIndex |CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c4], filterCols=[c5,c3]                    |
|Filter @5 |leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |
|Filter @5 |rightDfFilterIndex|CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c5,c3]            |
|Filter @5 |rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c5]            |
|Join @0   |leftDfFilterIndex |CI       |NOT_ALL_JOIN_COL_INDEXED |child=[left], joinCols=[c3], indexedCols=[c4]               |
|Join @0   |leftDfFilterIndex |CI       |NOT_ALL_JOIN_COL_INDEXED |child=[right], joinCols=[c3], indexedCols=[c4]              |
|Join @0   |leftDfJoinIndex   |CI       |MISSING_INDEXED_COL      |child=[right], requiredIndexedCols=[c5,c3], indexedCols=[c3]|
|Join @0   |rightDfFilterIndex|CI       |NOT_ALL_JOIN_COL_INDEXED |child=[left], joinCols=[c3], indexedCols=[c5]               |
|Join @0   |rightDfFilterIndex|CI       |NOT_ALL_JOIN_COL_INDEXED |child=[right], joinCols=[c3], indexedCols=[c5]              |
|Join @0   |rightDfJoinIndex  |CI       |MISSING_INDEXED_COL      |child=[left], requiredIndexedCols=[c4,c3], indexedCols=[c3] |
|Project @1|leftDfJoinIndex   |CI       |ANOTHER_INDEX_APPLIED    |appliedIndex=[leftDfFilterIndex]                            |
|Project @1|rightDfFilterIndex|CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c5], filterCols=[c4,c3]                    |
|Project @1|rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c4,c3], indexCols=[c3,c5]                     |
|Project @4|leftDfFilterIndex |CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c4], filterCols=[c5,c3]                    |
|Project @4|leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c5,c3], indexCols=[c3,c4]                     |
|Project @4|rightDfJoinIndex  |CI       |ANOTHER_INDEX_APPLIED    |appliedIndex=[rightDfFilterIndex]                           |
+----------+------------------+---------+-------------------------+------------------------------------------------------------+

hs.whyNot(query, indexName, extended = true) - collect reasons for the given index

=============================================================
Plan with Hyperspace & Summary:
=============================================================
Join Inner, (c3# = c3#)
:- Project [c4#, c3#]
:  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
:     +- Relation[c3#,c4#] Hyperspace(Type: CI, Name: leftDfFilterIndex, LogVersion: 1)
+- Project [c5#, c3#]
   +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
      +- Relation[c3#,c5#] Hyperspace(Type: CI, Name: rightDfFilterIndex, LogVersion: 1)

Applied indexes:
- leftDfFilterIndex
- rightDfFilterIndex

Applicable indexes, but not applied due to priority:
- leftDfJoinIndex
- rightDfJoinIndex

Non-applicable indexes - index is outdated:
- No such index found.

Non-applicable indexes - no applicable query plan:
- No such index found.

For more information, please visit: https://microsoft.github.io/hyperspace/docs/why-not-result-analysis

=============================================================
Plan without Hyperspace & WhyNot reasons:
=============================================================
00 Join Inner, (c3# = c3#)
01 :- Project [c4#, c3#]
02 :  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
03 :     +- Relation[c1#,c2#,c3#,c4#,c5#] parquet
04 +- Project [c5#, c3#]
05    +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
06       +- Relation[c1#,c2#,c3#,c4#,c5#] parquet

+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|SubPlan   |IndexName      |IndexType|Reason               |Message                                                     |VerboseMessage                                                                                                     |
+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|Filter @2 |leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |Index does not contain required columns. Required columns: [c3,c4,c5,c2,c1], Index columns: [c3,c4]                |
|Filter @5 |leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |Index does not contain required columns. Required columns: [c3,c4,c5,c2,c1], Index columns: [c3,c4]                |
|Join @0   |leftDfJoinIndex|CI       |MISSING_INDEXED_COL  |child=[right], requiredIndexedCols=[c5,c3], indexedCols=[c3]|Index does not contain required columns for right subplan. Required indexed columns: [c5,c3], Indexed columns: [c3]|
|Project @1|leftDfJoinIndex|CI       |ANOTHER_INDEX_APPLIED|appliedIndex=[leftDfFilterIndex]                            |Another candidate index is applied: leftDfFilterIndex                                                              |
|Project @4|leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c5,c3], indexCols=[c3,c4]                     |Index does not contain required columns. Required columns: [c5,c3], Index columns: [c3,c4]                         |
+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+

Additional results for `hs.explain(query, verbose = true)

=============================================================
Applicable indexes:
=============================================================
Plan without Hyperspace:

00 Join Inner, (Col1# = Col1#)
01 :- Filter isnotnull(Col1#)
02 :  +- Relation[Col1#,Col2#] parquet
03 +- Filter isnotnull(Col1#)
04    +- Relation[Col1#,Col2#] parquet

+---------+---------+---------+---------------+
|SubPlan  |IndexName|IndexType|RuleName       |
+---------+---------+---------+---------------+
|Filter @1|joinIndex|CI       |FilterIndexRule|
|Filter @3|joinIndex|CI       |FilterIndexRule|
|Join @0  |joinIndex|CI       |JoinIndexRule  |
+---------+---------+---------+---------------+

Follow-up PRs

documentation - will update an exhaustive documentation for the API
python binding

Does this PR introduce any user-facing change?

Yes, new API is introduced

How was this patch tested?

sezruby · 2021-07-13T22:32:35Z

@clee704 Could you review the PR?
I'd like to fix test failures - JoinIndexRuleTest/FilterIndexRuleTest/ExplainTest after confirming the output format / message.

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

clee704 · 2021-07-19T16:07:20Z

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

@@ -171,6 +171,17 @@ class Hyperspace(spark: SparkSession) {
    indexManager.index(indexName)
  }

+  def whyNot(df: DataFrame, indexName: String = "", extended: Boolean = false)(


nit: indexName: Option[String] would make a better interface.

But this way, users need to add "Some(" . . which does not look intuitive

hyperspace.whyNot(query(leftDf, rightDf)(), Some("leftDfJoinIndex"), extended = true)

Agreed. By the way, how about supporting multiple index names? indexNames: Seq[String] = Nil

We could add additional APIs later on demand - like returning DF instead of printing

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/index/covering/JoinIndexRule.scala

src/test/scala/com/microsoft/hyperspace/index/rules/ScoreBasedIndexPlanOptimizerTest.scala

src/main/scala/com/microsoft/hyperspace/index/plananalysis/FilterReason.scala

clee704

thanks!

sezruby mentioned this pull request May 28, 2021

[WIP] Introduce whyNot API #289

Closed

sezruby changed the title ~~Introduce whyNot API~~ [WIP] Introduce whyNot API May 28, 2021

sezruby marked this pull request as draft May 28, 2021 21:34

sezruby self-assigned this May 28, 2021

sezruby force-pushed the whynotapi branch from 77a3eb7 to c3fd62f Compare June 1, 2021 22:46

sezruby mentioned this pull request Jun 8, 2021

[PROPOSAL]: HyperspaceOneRule #405

Closed

7 tasks

sezruby force-pushed the whynotapi branch 2 times, most recently from 3671dd5 to 1bda13c Compare June 14, 2021 23:51

clee704 added the enhancement New feature or request label Jun 15, 2021

sezruby requested a review from clee704 June 22, 2021 00:39

sezruby marked this pull request as ready for review June 22, 2021 00:40

sezruby changed the title ~~[WIP] Introduce whyNot API~~ Introduce whyNot API Jun 22, 2021

sezruby marked this pull request as draft June 28, 2021 19:34

sezruby force-pushed the whynotapi branch from ec2c40a to 63bc1c7 Compare July 12, 2021 17:00

Introduce whyNot API

9ec1ca2

sezruby force-pushed the whynotapi branch from 63bc1c7 to 9ec1ca2 Compare July 12, 2021 17:05

sezruby marked this pull request as ready for review July 13, 2021 22:32

Fix FilterReason class

28cbb3d

paryoja reviewed Jul 15, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala Show resolved Hide resolved

sezruby requested a review from imback82 July 16, 2021 03:40

clee704 reviewed Jul 19, 2021

View reviewed changes

sezruby force-pushed the whynotapi branch from e3532a2 to d2e8a71 Compare July 20, 2021 05:17

review commit

421cbc0

sezruby force-pushed the whynotapi branch 6 times, most recently from deed490 to 37c0bee Compare July 21, 2021 03:57

sezruby force-pushed the whynotapi branch from 37c0bee to dcab3de Compare July 21, 2021 03:58

Add assert

ac83a25

sezruby force-pushed the whynotapi branch from dcab3de to ac83a25 Compare July 21, 2021 04:40

remove space

4155cad

clee704 previously approved these changes Aug 2, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into whynot4

1257950

sezruby dismissed clee704’s stale review via 1257950 August 2, 2021 23:25

sezruby requested a review from clee704 August 3, 2021 00:05

clee704 approved these changes Aug 3, 2021

View reviewed changes

sezruby merged commit b60393a into microsoft:master Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce whyNot API #449

Introduce whyNot API #449

sezruby commented May 28, 2021 •

edited

Loading

sezruby commented Jul 13, 2021

clee704 Jul 19, 2021

sezruby Jul 19, 2021

clee704 Jul 21, 2021

sezruby Jul 21, 2021

clee704 left a comment

Introduce whyNot API #449

Introduce whyNot API #449

Conversation

sezruby commented May 28, 2021 • edited Loading

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

sezruby commented Jul 13, 2021

clee704 Jul 19, 2021

Choose a reason for hiding this comment

sezruby Jul 19, 2021

Choose a reason for hiding this comment

clee704 Jul 21, 2021

Choose a reason for hiding this comment

sezruby Jul 21, 2021

Choose a reason for hiding this comment

clee704 left a comment

Choose a reason for hiding this comment

sezruby commented May 28, 2021 •

edited

Loading