Skip to content
This repository has been archived by the owner on Jun 14, 2024. It is now read-only.

Introduce whyNot API #449

Merged
merged 6 commits into from
Aug 3, 2021
Merged

Introduce whyNot API #449

merged 6 commits into from
Aug 3, 2021

Conversation

sezruby
Copy link
Collaborator

@sezruby sezruby commented May 28, 2021

What is the context for this pull request?

What changes were proposed in this pull request?

Introduce whyNot API that explains why each index is not applied to a specific sub plan of the given dataframe.

The following is an example definition for a disqualified reason.

  case class ColSchemaMismatch(sourceColumns: String, indexColumns: String) extends FilterReason {
    override final val codeStr: String = "COL_SCHEMA_MISMATCH"
    override val args = Seq("sourceColumns" -> sourceColumns, "indexColumns" -> indexColumns)
    override def verboseStr: String = {
      s"Column Schema does not match. Source data columns: [$sourceColumns], " +
        s"Index columns: [$indexColumns]"
    }
  }

These "FilterReason"s will be collected for each (index entry, sub plan) pair, and the result will be printed like the following:

  • hs.whyNot(query) - collect reasons for all indexes
=============================================================
Plan with Hyperspace & Summary:
=============================================================
Join Inner, (c3# = c3#)
:- Project [c4#, c3#]
:  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
:     +- Relation[c3#,c4#] Hyperspace(Type: CI, Name: leftDfFilterIndex, LogVersion: 1)
+- Project [c5#, c3#]
   +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
      +- Relation[c3#,c5#] Hyperspace(Type: CI, Name: rightDfFilterIndex, LogVersion: 1)

Applied indexes:
- leftDfFilterIndex
- rightDfFilterIndex

Applicable indexes, but not applied due to priority:
- leftDfJoinIndex
- rightDfJoinIndex

Non-applicable indexes - index is outdated:
- No such index found.

Non-applicable indexes - no applicable query plan:
- No such index found.

For more information, please visit: https://microsoft.github.io/hyperspace/docs/why-not-result-analysis

=============================================================
Plan without Hyperspace & WhyNot reasons:
=============================================================
00 Join Inner, (c3# = c3#)
01 :- Project [c4#, c3#]
02 :  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
03 :     +- Relation[c1#,c2#,c3#,c4#,c5#] parquet
04 +- Project [c5#, c3#]
05    +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
06       +- Relation[c1#,c2#,c3#,c4#,c5#] parquet

+----------+------------------+---------+-------------------------+------------------------------------------------------------+
|SubPlan   |IndexName         |IndexType|Reason                   |Message                                                     |
+----------+------------------+---------+-------------------------+------------------------------------------------------------+
|Filter @2 |leftDfFilterIndex |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c4,c3]            |
|Filter @2 |leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |
|Filter @2 |rightDfFilterIndex|CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c5], filterCols=[c4,c3]                    |
|Filter @2 |rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c5]            |
|Filter @5 |leftDfFilterIndex |CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c4], filterCols=[c5,c3]                    |
|Filter @5 |leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |
|Filter @5 |rightDfFilterIndex|CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c5,c3]            |
|Filter @5 |rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c5]            |
|Join @0   |leftDfFilterIndex |CI       |NOT_ALL_JOIN_COL_INDEXED |child=[left], joinCols=[c3], indexedCols=[c4]               |
|Join @0   |leftDfFilterIndex |CI       |NOT_ALL_JOIN_COL_INDEXED |child=[right], joinCols=[c3], indexedCols=[c4]              |
|Join @0   |leftDfJoinIndex   |CI       |MISSING_INDEXED_COL      |child=[right], requiredIndexedCols=[c5,c3], indexedCols=[c3]|
|Join @0   |rightDfFilterIndex|CI       |NOT_ALL_JOIN_COL_INDEXED |child=[left], joinCols=[c3], indexedCols=[c5]               |
|Join @0   |rightDfFilterIndex|CI       |NOT_ALL_JOIN_COL_INDEXED |child=[right], joinCols=[c3], indexedCols=[c5]              |
|Join @0   |rightDfJoinIndex  |CI       |MISSING_INDEXED_COL      |child=[left], requiredIndexedCols=[c4,c3], indexedCols=[c3] |
|Project @1|leftDfJoinIndex   |CI       |ANOTHER_INDEX_APPLIED    |appliedIndex=[leftDfFilterIndex]                            |
|Project @1|rightDfFilterIndex|CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c5], filterCols=[c4,c3]                    |
|Project @1|rightDfJoinIndex  |CI       |MISSING_REQUIRED_COL     |requiredCols=[c4,c3], indexCols=[c3,c5]                     |
|Project @4|leftDfFilterIndex |CI       |NO_FIRST_INDEXED_COL_COND|firstIndexedCol=[c4], filterCols=[c5,c3]                    |
|Project @4|leftDfJoinIndex   |CI       |MISSING_REQUIRED_COL     |requiredCols=[c5,c3], indexCols=[c3,c4]                     |
|Project @4|rightDfJoinIndex  |CI       |ANOTHER_INDEX_APPLIED    |appliedIndex=[rightDfFilterIndex]                           |
+----------+------------------+---------+-------------------------+------------------------------------------------------------+
  • hs.whyNot(query, indexName, extended = true) - collect reasons for the given index
=============================================================
Plan with Hyperspace & Summary:
=============================================================
Join Inner, (c3# = c3#)
:- Project [c4#, c3#]
:  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
:     +- Relation[c3#,c4#] Hyperspace(Type: CI, Name: leftDfFilterIndex, LogVersion: 1)
+- Project [c5#, c3#]
   +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
      +- Relation[c3#,c5#] Hyperspace(Type: CI, Name: rightDfFilterIndex, LogVersion: 1)

Applied indexes:
- leftDfFilterIndex
- rightDfFilterIndex

Applicable indexes, but not applied due to priority:
- leftDfJoinIndex
- rightDfJoinIndex

Non-applicable indexes - index is outdated:
- No such index found.

Non-applicable indexes - no applicable query plan:
- No such index found.

For more information, please visit: https://microsoft.github.io/hyperspace/docs/why-not-result-analysis

=============================================================
Plan without Hyperspace & WhyNot reasons:
=============================================================
00 Join Inner, (c3# = c3#)
01 :- Project [c4#, c3#]
02 :  +- Filter ((isnotnull(c4#) && (c4# = 2)) && isnotnull(c3#))
03 :     +- Relation[c1#,c2#,c3#,c4#,c5#] parquet
04 +- Project [c5#, c3#]
05    +- Filter ((isnotnull(c5#) && (c5# = 3000)) && isnotnull(c3#))
06       +- Relation[c1#,c2#,c3#,c4#,c5#] parquet

+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|SubPlan   |IndexName      |IndexType|Reason               |Message                                                     |VerboseMessage                                                                                                     |
+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
|Filter @2 |leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |Index does not contain required columns. Required columns: [c3,c4,c5,c2,c1], Index columns: [c3,c4]                |
|Filter @5 |leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c3,c4,c5,c2,c1], indexCols=[c3,c4]            |Index does not contain required columns. Required columns: [c3,c4,c5,c2,c1], Index columns: [c3,c4]                |
|Join @0   |leftDfJoinIndex|CI       |MISSING_INDEXED_COL  |child=[right], requiredIndexedCols=[c5,c3], indexedCols=[c3]|Index does not contain required columns for right subplan. Required indexed columns: [c5,c3], Indexed columns: [c3]|
|Project @1|leftDfJoinIndex|CI       |ANOTHER_INDEX_APPLIED|appliedIndex=[leftDfFilterIndex]                            |Another candidate index is applied: leftDfFilterIndex                                                              |
|Project @4|leftDfJoinIndex|CI       |MISSING_REQUIRED_COL |requiredCols=[c5,c3], indexCols=[c3,c4]                     |Index does not contain required columns. Required columns: [c5,c3], Index columns: [c3,c4]                         |
+----------+---------------+---------+---------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------+
  • Additional results for `hs.explain(query, verbose = true)
=============================================================
Applicable indexes:
=============================================================
Plan without Hyperspace:

00 Join Inner, (Col1# = Col1#)
01 :- Filter isnotnull(Col1#)
02 :  +- Relation[Col1#,Col2#] parquet
03 +- Filter isnotnull(Col1#)
04    +- Relation[Col1#,Col2#] parquet

+---------+---------+---------+---------------+
|SubPlan  |IndexName|IndexType|RuleName       |
+---------+---------+---------+---------------+
|Filter @1|joinIndex|CI       |FilterIndexRule|
|Filter @3|joinIndex|CI       |FilterIndexRule|
|Join @0  |joinIndex|CI       |JoinIndexRule  |
+---------+---------+---------+---------------+

Follow-up PRs

  • documentation - will update an exhaustive documentation for the API
  • python binding

Does this PR introduce any user-facing change?

Yes, new API is introduced

How was this patch tested?

@sezruby sezruby changed the title Introduce whyNot API [WIP] Introduce whyNot API May 28, 2021
@sezruby sezruby marked this pull request as draft May 28, 2021 21:34
@sezruby sezruby self-assigned this May 28, 2021
@sezruby sezruby mentioned this pull request Jun 8, 2021
7 tasks
@sezruby sezruby force-pushed the whynotapi branch 2 times, most recently from 3671dd5 to 1bda13c Compare June 14, 2021 23:51
@clee704 clee704 added the enhancement New feature or request label Jun 15, 2021
@sezruby sezruby requested a review from clee704 June 22, 2021 00:39
@sezruby sezruby marked this pull request as ready for review June 22, 2021 00:40
@sezruby sezruby changed the title [WIP] Introduce whyNot API Introduce whyNot API Jun 22, 2021
@sezruby sezruby marked this pull request as draft June 28, 2021 19:34
@sezruby sezruby marked this pull request as ready for review July 13, 2021 22:32
@sezruby
Copy link
Collaborator Author

sezruby commented Jul 13, 2021

@clee704 Could you review the PR?
I'd like to fix test failures - JoinIndexRuleTest/FilterIndexRuleTest/ExplainTest after confirming the output format / message.

@sezruby sezruby requested a review from imback82 July 16, 2021 03:40
@@ -171,6 +171,17 @@ class Hyperspace(spark: SparkSession) {
indexManager.index(indexName)
}

def whyNot(df: DataFrame, indexName: String = "", extended: Boolean = false)(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indexName: Option[String] would make a better interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way, users need to add "Some(" . . which does not look intuitive

hyperspace.whyNot(query(leftDf, rightDf)(), Some("leftDfJoinIndex"), extended = true)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. By the way, how about supporting multiple index names? indexNames: Seq[String] = Nil

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add additional APIs later on demand - like returning DF instead of printing

@sezruby sezruby force-pushed the whynotapi branch 6 times, most recently from deed490 to 37c0bee Compare July 21, 2021 03:57
clee704
clee704 previously approved these changes Aug 2, 2021
Copy link

@clee704 clee704 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@sezruby sezruby merged commit b60393a into microsoft:master Aug 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants