Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(E6data): initial implemantation for E6data SQL Analytics platform #9517

Closed
wants to merge 19 commits into from

Conversation

hackintoshrao
Copy link

@hackintoshrao hackintoshrao commented Jul 4, 2024

Description of changes

This pull request aims to integrate E6data, a distributed SQL analytics engine, with Ibis. E6data is also designed for high-performance analytics on large-scale data, and this integration will allow Ibis users to leverage E6data's capabilities seamlessly.

Key changes and considerations:

  1. E6data Backend Implementation:

    • A new E6data backend class was added, inheriting from SQLBackend.
    • Implemented connection handling, query execution, and result fetching specific to E6data.
    • Adapted the backend to handle E6data's unique features, such as catalog support and cluster management.
  2. SQL Dialect Customization:

    • Created an E6data dialect class extending from MySQL, as E6data shares similarities with MySQL syntax.
    • Customized the Tokenizer to use double quotes for identifiers.
    • Modified the Generator to map certain data types (VARCHAR, CHAR, TEXT) to STRING, aligning with E6data's type system.
    • Custom TRANSFORMS for specific SQL functions like concat and length were added.
  3. Compiler Modifications:

    • Updated E6DataCompiler to use the new E6data dialect and E6DataType for type mapping.
    • Retained existing rewrites, including a custom limit rewrite, to ensure compatibility with E6data's query execution model.
  4. Connection String and Authentication:

    • Implemented support for E6data's connection string format, including catalog name, secure connection, auto-resume, and cluster UUID parameters.
  5. Schema and Metadata Handling:

    • Adapted schema retrieval and table listing functions to work with E6data's multi-level hierarchy (catalog, database, table).
  6. Testing:
    I need some guidance on how to go about adding tests for the integration.

  7. Documentation:
    Could you give me some pointers on how to add relevant documentation?

  8. Dependencies:

    • Currently, the platform is not open for public access; existing users require authentication keys to use the analytics engine. I would like guidance on enabling the maintainers to test and provide credentials for automated tests once they are supported.
      We're also working on a Mini-kube-based single-node testing infrastructure, which might make adding testing automation for the CI easier.

I'm new to the Ibis community, and this PR could be much better. I appreciate the time and guidance from maintainers in improving it further. Any comments are welcome; again, I appreciate your time and patience.

- Add support for catalog, secure connection, auto-resume, and cluster UUID
- Implement custom table() method to handle catalog and database hierarchy
- Modify get_schema() to use E6Data-specific column information
- Adjust execute() method for E6Data compatibility
- Update _fetch_from_cursor() to handle E6Data result format
- Customize Tokenizer to use double quotes for identifiers
- Modify Generator to map VARCHAR, CHAR, and TEXT to STRING
- Add custom TRANSFORMS for concat and length functions
- Add E6data dialect
- Use E6DataType for type mapping
- Retain existing rewrites including custom limit rewrite
- Keep other compiler configurations unchanged
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Did a first pass, and will add a separate comment addressing some of your questions.

pyproject.toml Outdated
@@ -87,6 +87,7 @@ shapely = { version = ">=2,<3", optional = true }
# issues with versions <3.0.2
snowflake-connector-python = { version = ">=3.0.2,<4,!=3.3.0b1", optional = true }
trino = { version = ">=0.321,<1", optional = true }
e6data-python-connector = { version = "2.2.0" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e6data-python-connector = { version = "2.2.0" }
e6data-python-connector = { version = ">=2.2.0,<3", optional = true }

Unless there's a reason to pin to a single version, this should include a version range, and optional = true is necessary to ensure this is shipped as an extra, and not a required dependency to use any other Ibis backend(s).

@@ -403,6 +403,7 @@ class Generator(Postgres.Generator):
JSON_TYPE_REQUIRED_FOR_EXTRACTION = True
SUPPORTS_UNLOGGED_TABLES = True


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Please revert this change.

from urllib.parse import parse_qs, urlparse

import numpy as np
import pymysql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using this somewhere? If not, pleas remove it as an e6data backend dependency and also remove this import.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like the correct implementation for an e6data conftest.py given that it's not running in a container.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the tests until implemented correctly.

pyproject.toml Outdated
@@ -152,6 +153,7 @@ dask = ["dask", "regex", "packaging"]
datafusion = ["datafusion"]
druid = ["pydruid"]
duckdb = ["duckdb"]
e6data = ["e6data-python-connector","pymysql"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e6data = ["e6data-python-connector","pymysql"]
e6data = ["e6data-python-connector"]

Is pymysql actually required? It doesn't seem to be used anywhere.



@public
class E6DataCompiler(SQLGlotCompiler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not inherit from MySQLCompiler?

import polars as pl
import pyarrow as pa

class Backend(SQLBackend, CanCreateDatabase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With all the copypasting it might be worth considering inheriting from MySQLBackend.

Comment on lines 46 to 47
return ".".join(matched.groups())
def _from_url(self, url: str, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to run your code through just fmt. There are likely other places where there are formatting inconsistencies.

@cpcloud
Copy link
Member

cpcloud commented Jul 4, 2024

  1. Testing:
    I need some guidance on how to go about adding tests for the integration.

The best place to start is to try and run pytest -m e6data, assuming you have a way to access it from your development host.

  1. Documentation:
    Could you give me some pointers on how to add relevant documentation?

Take a look at the individual backend docs pages in docs/backends/. Those are a good place to start with backend-specific docs.

  1. Dependencies:
    • Currently, the platform is not open for public access; existing users require authentication keys to use the analytics engine. I would like guidance on enabling the maintainers to test and provide credentials for automated tests once they are supported.
      We're also working on a Mini-kube-based single-node testing infrastructure, which might make adding testing automation for the CI easier.

We'll need to get whatever credentials/auth information is needed to login into a GitHub Actions secret. Let's chat in a DM on Zulip about this.

- Change Backend class to inherit from MySQLBackend instead of SQLBackend
- Remove unnecessary imports and methods duplicated in MySQLBackend
- Update connection handling to use E6data_python_connector
- Modify schema and table retrieval methods for E6data compatibility
- Replace MySQLPandasData with E6DataPandasData for data conversion
- Clean up and streamline code, removing print statements and unused methods
@hackintoshrao
Copy link
Author

Hey @cpcloud ,

I've addressed most of your comments apart from the tests and documentation. Please let me know if these changes look good. I'll be working on writing the tests now.

We'll need to get whatever credentials/auth information is needed to login into a GitHub Actions secret. Let's chat in a DM on Zulip about this.

I've Dm'ed you on Zulip, kindly need your assistance.

@cpcloud cpcloud closed this Dec 18, 2024
@cpcloud
Copy link
Member

cpcloud commented Dec 18, 2024

Please create a new PR if you're still interested in working on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants