Refactor Exec Parsers - remove individual parser classes #1396

nartal1 · 2024-10-29T00:51:04Z

This PR contributes to #1325 . This is the first iteration to clean up ExecParsers code.

This pull request refactors several execution parsers in the planparser package to use a new generic execution parser class. The changes aim to simplify the codebase by consolidating common logic into a single class, GenericExecParser.
In this PR, following changes are done:

If the Execs don't have expressions, then execName is assigned and ExecInfo object is created from GenericExecParser.
If the Execs have expressions then appropriate expressionParser is passed to the constructor of GenericExexParser and ExecInfo object is created from GenericParser.
If there is any additional calculation in the parse function, then only that part is overriden in the derived ExecParser and the rest of the function uses the GenereicExecParser base code.
Removed all the files which are not needed anymore
HashAggregateExec was getting incorrect metrics. Updated it to get the correct one.
Updated unit tests which had HashAggregateExec. Current unit tests in SqlPlanParserSuite passes with these changes.

Refactoring to use GenericExecParser:

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/CustomShuffleReaderExecParser.scala: Modified to extend GenericExecParser instead of implementing its own parsing logic.
core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/GenericExecParser.scala: Added a new GenericExecParser class to handle common parsing logic for execution nodes.

Removal of Individual Parser Classes:

Removed multiple individual parser classes, including AggregateInPandasExecParser, ArrowEvalPythonExecParser, CartesianProductExecParser, CoalesceExecParser, CollectLimitExecParser, ExpandExecParser, FilterExecParser, FlatMapGroupsInPandasExecParser, GenerateExecParser, and others. These classes were responsible for parsing specific execution nodes but are now replaced by a more generic approach. [1] [2] [3] [4] [5] [6] [7] [8] [9]

These changes streamline the parser configuration process, reduce redundancy, and make it easier to manage and extend the parser functionality.

This PR is to remove redundant code in many ExecParsers. We have several ExecParsers with the same functionality in parse code. In this PR, following changes are done: 1. If the Execs don't have expressions, then execName is assigned and ExecInfo object is created from GenericExecParser. 2. If the Execs have expressions then reflection is used to assign the appropriate parser to be used. 3. If there is any additional calculation in the parse function, then only that part is overriden in the ExecParser and the rest of the function uses the GenereicExecParser base code 4.Removed all the files which are not needed anymore Signed-off-by: Niranjan Artal <[email protected]>

tgravescs · 2024-10-29T14:43:57Z

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

nartal1 · 2024-10-29T18:01:31Z

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

I added json file so that we have all the Execs information at one place and can add or update the Execs/expressions/configs easily. Also, if there are any fields to be added - like joinType we can update the json file instead of updating the code.
Agree that it adds complexity of reading json and there is no type checking. The main reason was to decouple the parserNames and parserMethods from the code.
Would you prefer this information to be in the code itself?We can create a Map to hold all the Execs information something like below in ExecParserLoader.scala (instead of reading and parsing the JSON we could add it in the code itself):

val parsersConfig: Map[String, ParserConfig] = Map(
    "Union" -> ParserConfig(
      execName = "Union",
      parseExpressions = false,
    ),
    "TakeOrderedAndProject" -> ParserConfig(
      execName = "TakeOrderedAndProject",
      parseExpressions = true,
      expressionParserMethod = Some(sqlPlanParserType.decl(TermName("parseTakeOrderedExpressions")).asMethod),
      durationMetrics = None

tgravescs · 2024-10-29T19:13:18Z

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

amahussein · 2024-10-30T15:25:05Z

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

There are some other reasons why would need to configure the tools' handling for different execs:

In subtask of [BUG] Revisit the categorization of unsupported ops in Qual tool output #1116. Currently, we hardcode how each exec is assigned a category. This makes it difficult for:
- Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.
- Switching categories of different execs require lots of code changes; especially with the overlap between shouldRemove/UDFs/special cases of ScanOps..etc.
- Having execs configuration:
  - Makes it easy to spot that in one place
  - Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
  - Reduce the maintenance on the devs side since changes can be done by non-dev contributors.
Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.
Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.
- It introduces bugs as we find that changes in one place needs to be carried over in ten other class files.
- sync up with the plugin support files can be limited to update the configurations.
Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

tgravescs · 2024-10-30T15:43:21Z

I think you might be missing my main point. I'm not against refactoring it to make use of common classes. I mentioned this years ago. I'm questioning the need for a separate json file and using reflection.

Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

Having execs configuration:

- Makes it easy to spot that in one place
- Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
- Reduce the maintenance on the devs side since changes can be done by non-dev contributors.

it can be in one place in the code as well, and with a separate file I now have to have 2 files open to find what it actually is doing. Harder to navigate via IDE.
When would this experimentation happen? you either support an exec properly or you don't., The exec either supports expressions or it doesn't. If you are adding it you either add a couple lines of code of couples lines in json file
When would a non-dev modify this? This was going back to my question or if customers or someone is modifying it then this may make sense, but at the same time in my opinion this is also adding a lot of complexity.

Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.

I don't understand how this has anything to do with json/reflection vs doing it in code. Either case you are adding a few lines somewhere and do the testing. I guess adding to a json file could be easier but you have the negatives already listed as well. A developer should be easily be able to add a couple lines of code, adding in code does all the proper type checking with less complexity.

 Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.

I already stated ways to remove the new class requirement. That was mostly done originally to parallelize development, other ways to reduce code without introducing json file and reflection.

Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

This again doesn't require json and reflection to reuse code. The question isn't about refactoring the code, the question is more about adding this json file and reflection.

amahussein · 2024-10-30T15:54:25Z

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

The tools would generate a formatted file report for operator characteristics (supported, category,..etc.). Or the non-dev can look at the configuration field to know the behavior
Will be easier to test. This config file can be checked versus the CSV output files to verify the behavior of the tools.

Anyway, most important part that we agree on the refactoring.
I am fine with removing the reflection.

…ve Parsers Signed-off-by: Niranjan Artal <[email protected]>

nartal1 · 2024-10-31T06:13:48Z

@tgravescs @amahussein - Please take another look and let me know if this approach seems okay.

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/CustomShuffleReaderExecParser.scala

amahussein

Thanks @nartal1
I posted a few questions

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/ObjectHashAggregateExecParser.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/SQLPlanParser.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/GenericExecParser.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/SQLPlanParser.scala

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/GenericExecParser.scala

1. default Exec name 2. added argument to GenericParser constructor

Signed-off-by: Niranjan Artal <[email protected]>

…-issue-1325-part1

nartal1 · 2024-11-05T02:16:18Z

Thanks @amahussein and @tgravescs for the review. I have addressed review comments.
This PR doesn't include refactor of all the Execs but covers most of the ones which had common code. PTAL.

amahussein

Thanks @nartal1
LGTME!

nartal1 added feature request New feature or request core_tools Scope the core module (scala) labels Oct 29, 2024

nartal1 requested a review from amahussein October 29, 2024 00:51

nartal1 self-assigned this Oct 29, 2024

nartal1 marked this pull request as draft October 29, 2024 01:26

Remove reflection instead use case classes to match and call respecti…

88e9c57

…ve Parsers Signed-off-by: Niranjan Artal <[email protected]>

nartal1 marked this pull request as ready for review October 31, 2024 06:09

tgravescs reviewed Oct 31, 2024

View reviewed changes

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/CustomShuffleReaderExecParser.scala Outdated Show resolved Hide resolved

core/src/main/scala/com/nvidia/spark/rapids/tool/planparser/CustomShuffleReaderExecParser.scala Outdated Show resolved Hide resolved

amahussein reviewed Nov 1, 2024

View reviewed changes

nartal1 added 4 commits November 3, 2024 18:10

addressed few review comments

2dc2146

1. default Exec name 2. added argument to GenericParser constructor

addressed review comments

926d82a

Signed-off-by: Niranjan Artal <[email protected]>

Merge branch 'dev' of github.com:NVIDIA/spark-rapids-tools into tools…

820d541

…-issue-1325-part1

Update tests

a1bdc3a

amahussein approved these changes Nov 6, 2024

View reviewed changes

nartal1 merged commit 019ede2 into NVIDIA:dev Nov 6, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Exec Parsers - remove individual parser classes #1396

Refactor Exec Parsers - remove individual parser classes #1396

nartal1 commented Oct 29, 2024 •

edited

Loading

tgravescs commented Oct 29, 2024

nartal1 commented Oct 29, 2024

tgravescs commented Oct 29, 2024

amahussein commented Oct 30, 2024

tgravescs commented Oct 30, 2024 •

edited

Loading

amahussein commented Oct 30, 2024

nartal1 commented Oct 31, 2024

amahussein left a comment

nartal1 commented Nov 5, 2024

amahussein left a comment

Refactor Exec Parsers - remove individual parser classes #1396

Refactor Exec Parsers - remove individual parser classes #1396

Conversation

nartal1 commented Oct 29, 2024 • edited Loading

Removal of Individual Parser Classes:

tgravescs commented Oct 29, 2024

nartal1 commented Oct 29, 2024

tgravescs commented Oct 29, 2024

amahussein commented Oct 30, 2024

tgravescs commented Oct 30, 2024 • edited Loading

amahussein commented Oct 30, 2024

nartal1 commented Oct 31, 2024

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Nov 5, 2024

amahussein left a comment

Choose a reason for hiding this comment

nartal1 commented Oct 29, 2024 •

edited

Loading

tgravescs commented Oct 30, 2024 •

edited

Loading