Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#7433: Replace NativeDataFrameMode with a complete "native" execution. #7436

Conversation

sfc-gh-mvashishtha
Copy link
Contributor

@sfc-gh-mvashishtha sfc-gh-mvashishtha commented Feb 9, 2025

Replace NativeDataFrameMode with "NativeOnNative" execution mode.

User-facing changes

Prior to this commit, users had to switch the config variable NativeDataFrameMode from the default of "Default" to "Pandas" to use native execution. Now native execution is another modin execution mode with StorageFormat of "Native" and Engine of "Native."

Integration tests and CI

Prior to this commit, we ran 1) a set of tests checking that native Modin dataframes could interoperate with non-native dataframes 2) a subset of tests in native dataframe mode.

Now, we run the interoperability test suite, but also run the entire rest of the test suite (except for some partitioned-execution-only tests) in native execution mode via the test-all job matrix. This commit also renames the interoperability test suite from modin/tests/pandas/native_df_mode to modin/tests/pandas/native_df_interoperability/.

Deleting most of the NativeQueryCompiler implementation

NativeQueryCompiler had a long implementation which was mostly the same as the BaseQueryCompiler implementation. However, there were some bugs in NativeQueryCompiler, including some correctness bugs related to copying the underlying pandas dataframe (see #7435). This commit deletes most of the NativeQueryCompiler implementation, so that the native query compiler mostly works just like the BaseQueryCompiler. The main difference is that while BaseQueryCompiler uses a partitioned pandas dataframe (under the Python execution, so all in a single process), the native query compiler does not use partitions.

Warning messages about default to pandas

While BaseQueryCompiler and BaseIO warn when they default to pandas, they should not do so when using native execution. We add class-level fields to these classes that tell whether to warn on default to pandas.

By default, we treat warnings as errors in our test suite, so in many places we have to look for the default to pandas warning only if we are not native execution mode. For convenience, this PR adds testing utility methods to 1) detect the global native execution mode 2) detect whether a dataframe or series is using native execution 3) conditionally expect a warning about defaulting to pandas.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Add an execution mode for NativeQueryCompiler and remove NativeDataFrameMode. #7433
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…"native" execution.

Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Signed-off-by: sfc-gh-mvashishtha <[email protected]>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI overview of changes here:

  • I added some clarifying comments here around where I made changes.
  • We now test all native dataframe mode capabilities on every pull request or push to main, instead of conditionally on test-native-dataframe-mode.
  • "native" mostly runs the same tests that we run with "PandasOnPython", and it also runs the native interoperability suite.

Copy link
Contributor Author

@sfc-gh-mvashishtha sfc-gh-mvashishtha Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The changes here are all about enabling skipping the "default to pandas" warning in the native I/O subclass of BaseIO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the changes here are about enabling subclasses to choose whether to warn on default to pandas.

modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
Copy link
Contributor

@sfc-gh-joshi sfc-gh-joshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple questions. It may also be worth pinging the people that added the Native QC in the first place, in case this would cause any breaking changes for them.

.github/workflows/ci.yml Show resolved Hide resolved
modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
docs/usage_guide/optimization_notes/index.rst Show resolved Hide resolved
modin/core/storage_formats/pandas/native_query_compiler.py Outdated Show resolved Hide resolved
modin/tests/pandas/dataframe/test_udf.py Outdated Show resolved Hide resolved
@sfc-gh-mvashishtha
Copy link
Contributor Author

@arunjose696 @anmyachev we are trying to add a complete execution mode for Native execution so that eventually, we can make it easy to switch execution between native and Snowflake execution.

Please let us know if anyone is relying on NativeDataFrameMode.

@anmyachev
Copy link
Collaborator

@arunjose696 @anmyachev we are trying to add a complete execution mode for Native execution so that eventually, we can make it easy to switch execution between native and Snowflake execution.

Please let us know if anyone is relying on NativeDataFrameMode.

Hi @sfc-gh-mvashishtha, based on the following comment and the pull request description, the change is essentially only in the mode name (from user's perspective). If so, no problem, we'll rewrite it if necessary. Thanks for letting me know.

the NativeQueryCompiler object is still backed by an unpartitioned pandas frame, correct? You've just made it inherit most BaseQueryCompiler methods instead?

That's correct.

@sfc-gh-mvashishtha sfc-gh-mvashishtha merged commit 4152c95 into modin-project:main Feb 13, 2025
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add an execution mode for NativeQueryCompiler and remove NativeDataFrameMode.
3 participants