Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Exec Parsers - remove individual parser classes #1396

Merged
merged 6 commits into from
Nov 6, 2024

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Oct 29, 2024

This PR contributes to #1325 . This is the first iteration to clean up ExecParsers code.

This pull request refactors several execution parsers in the planparser package to use a new generic execution parser class. The changes aim to simplify the codebase by consolidating common logic into a single class, GenericExecParser.
In this PR, following changes are done:

  1. If the Execs don't have expressions, then execName is assigned and ExecInfo object is created from GenericExecParser.
  2. If the Execs have expressions then appropriate expressionParser is passed to the constructor of GenericExexParser and ExecInfo object is created from GenericParser.
  3. If there is any additional calculation in the parse function, then only that part is overriden in the derived ExecParser and the rest of the function uses the GenereicExecParser base code.
  4. Removed all the files which are not needed anymore
  5. HashAggregateExec was getting incorrect metrics. Updated it to get the correct one.
  6. Updated unit tests which had HashAggregateExec. Current unit tests in SqlPlanParserSuite passes with these changes.

Refactoring to use GenericExecParser:

Removal of Individual Parser Classes:

  • Removed multiple individual parser classes, including AggregateInPandasExecParser, ArrowEvalPythonExecParser, CartesianProductExecParser, CoalesceExecParser, CollectLimitExecParser, ExpandExecParser, FilterExecParser, FlatMapGroupsInPandasExecParser, GenerateExecParser, and others. These classes were responsible for parsing specific execution nodes but are now replaced by a more generic approach. [1] [2] [3] [4] [5] [6] [7] [8] [9]

These changes streamline the parser configuration process, reduce redundancy, and make it easier to manage and extend the parser functionality.

This PR is to remove redundant code in many ExecParsers. We have
several ExecParsers with the same functionality in parse code.
In this PR, following changes are done:

1. If the Execs don't have expressions, then execName is assigned and
ExecInfo object is created from GenericExecParser.
2. If the Execs have expressions then reflection is used to assign the
appropriate parser to be used.
3. If there is any additional calculation in the parse function, then
only that part is overriden in the ExecParser and the rest of the
function uses the GenereicExecParser base code
4.Removed all the files which are not needed anymore

Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 added feature request New feature or request core_tools Scope the core module (scala) labels Oct 29, 2024
@nartal1 nartal1 self-assigned this Oct 29, 2024
@nartal1 nartal1 marked this pull request as draft October 29, 2024 01:26
@tgravescs
Copy link
Collaborator

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 29, 2024

Introduced json file which has all the information of Execs, expressions and so on. This can be extended to include other info such as join types, metrics and so on.

Can you please provide more details on the benefits of using a json file for this is better then just having it in the code? The base classes all make sense but having extra json file based on what I see implemented here seems like extra complexity that I want to understand.

I added json file so that we have all the Execs information at one place and can add or update the Execs/expressions/configs easily. Also, if there are any fields to be added - like joinType we can update the json file instead of updating the code.
Agree that it adds complexity of reading json and there is no type checking. The main reason was to decouple the parserNames and parserMethods from the code.
Would you prefer this information to be in the code itself?We can create a Map to hold all the Execs information something like below in ExecParserLoader.scala (instead of reading and parsing the JSON we could add it in the code itself):

val parsersConfig: Map[String, ParserConfig] = Map(
    "Union" -> ParserConfig(
      execName = "Union",
      parseExpressions = false,
    ),
    "TakeOrderedAndProject" -> ParserConfig(
      execName = "TakeOrderedAndProject",
      parseExpressions = true,
      expressionParserMethod = Some(sqlPlanParserType.decl(TermName("parseTakeOrderedExpressions")).asMethod),
      durationMetrics = None

@tgravescs
Copy link
Collaborator

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

@amahussein
Copy link
Collaborator

So its definitely an interesting solutions but without further information or reasons for this design we seem to be introducing more complexity, less readable, performance overhead, and losing type checking and binary compatibility checks for no reasons.

I honestly don't even know that you need a map. This can just be a case statement and some subclasses to reduce most of the code duplication.

There are some other reasons why would need to configure the tools' handling for different execs:

  • In subtask of [BUG] Revisit the categorization of unsupported ops in Qual tool output #1116. Currently, we hardcode how each exec is assigned a category. This makes it difficult for:
    • Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.
    • Switching categories of different execs require lots of code changes; especially with the overlap between shouldRemove/UDFs/special cases of ScanOps..etc.
    • Having execs configuration:
      • Makes it easy to spot that in one place
      • Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
      • Reduce the maintenance on the devs side since changes can be done by non-dev contributors.
  • Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.
  • Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.
    • It introduces bugs as we find that changes in one place needs to be carried over in ten other class files.
    • sync up with the plugin support files can be limited to update the configurations.
  • Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

@tgravescs
Copy link
Collaborator

tgravescs commented Oct 30, 2024

I think you might be missing my main point. I'm not against refactoring it to make use of common classes. I mentioned this years ago. I'm questioning the need for a separate json file and using reflection.

Internal teams won't know how an exec is going to be reported by the tools unless they run the tools cmd on an eventlog.

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

Having execs configuration:

- Makes it easy to spot that in one place
- Allows internal team to experiment different the tools behavior by changing the configuration of specific execs.
- Reduce the maintenance on the devs side since changes can be done by non-dev contributors.

  • it can be in one place in the code as well, and with a separate file I now have to have 2 files open to find what it actually is doing. Harder to navigate via IDE.
  • When would this experimentation happen? you either support an exec properly or you don't., The exec either supports expressions or it doesn't. If you are adding it you either add a couple lines of code of couples lines in json file
  • When would a non-dev modify this? This was going back to my question or if customers or someone is modifying it then this may make sense, but at the same time in my opinion this is also adding a lot of complexity.
Adding new environments will be more straightforward without changing the core classes. For example, supporting Velox/Photon would be simplified by only implementing what's different for the parsing portion.

I don't understand how this has anything to do with json/reflection vs doing it in code. Either case you are adding a few lines somewhere and do the testing. I guess adding to a json file could be easier but you have the negatives already listed as well. A developer should be easily be able to add a couple lines of code, adding in code does all the proper type checking with less complexity.

 Currently, adding new spark execs requires adding a new class which ends up being copy/paste from other existing classes.

I already stated ways to remove the new class requirement. That was mostly done originally to parallelize development, other ways to reduce code without introducing json file and reflection.

Finally, one important reason to refactor the parsing code is to allow using that parsing from Profiler/Qualifcation context. Currently, the parser code assumes the qualification; however, the same functionality would be needed as more customer are going to run the Profiling tool (some users requested detailed operator reports in the profiling output)

This again doesn't require json and reflection to reuse code. The question isn't about refactoring the code, the question is more about adding this json file and reflection.

@amahussein
Copy link
Collaborator

I don't know what this means, you have to run the tool to get the output. how does this change magically report output?

  • The tools would generate a formatted file report for operator characteristics (supported, category,..etc.). Or the non-dev can look at the configuration field to know the behavior
  • Will be easier to test. This config file can be checked versus the CSV output files to verify the behavior of the tools.

Anyway, most important part that we agree on the refactoring.
I am fine with removing the reflection.

@nartal1 nartal1 marked this pull request as ready for review October 31, 2024 06:09
@nartal1
Copy link
Collaborator Author

nartal1 commented Oct 31, 2024

@tgravescs @amahussein - Please take another look and let me know if this approach seems okay.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1
I posted a few questions

1. default Exec name
2. added argument to GenericParser constructor
Signed-off-by: Niranjan Artal <[email protected]>
@nartal1
Copy link
Collaborator Author

nartal1 commented Nov 5, 2024

Thanks @amahussein and @tgravescs for the review. I have addressed review comments.
This PR doesn't include refactor of all the Execs but covers most of the ones which had common code. PTAL.

Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nartal1
LGTME!

@nartal1 nartal1 merged commit 019ede2 into NVIDIA:dev Nov 6, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants