Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat[DatabricksConnector]: connector to connect to Databricks on Cloud #580

Merged
merged 5 commits into from
Sep 21, 2023

Conversation

ArslanSaleem
Copy link
Collaborator

@ArslanSaleem ArslanSaleem commented Sep 20, 2023

Summary by CodeRabbit

  • New Feature: Introduced DatabricksConnector to connect with Databricks Data Cloud, expanding the range of supported data sources.
  • Refactor: Updated SnowFlakeConnector and SQLConnector classes for better configuration handling and security. Passwords are no longer displayed in string representations.
  • Test: Added comprehensive unit tests for the new DatabricksConnector class ensuring its functionality and behavior.
  • Refactor: Simplified SQL queries in rows_count and columns_count methods for improved readability and performance.
  • New Feature: Added support for environment variables in connector configurations, providing more flexibility in setting up connections.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 20, 2023

Walkthrough

This update introduces the DatabricksConnector to the PandasAI library, enabling users to connect to Databricks Data Cloud. It also refactors existing connectors and enhances test coverage. The changes improve code maintainability, security by excluding sensitive data from string representations, and performance by optimizing SQL queries.

Changes

File(s) Summary
examples/from_databricks.py Demonstrates usage of PandasAI with Snowflake.
pandasai/connectors/.../base.py, pandasai/connectors/.../databricks.py Introduces DatabricksConnectorConfig and DatabricksConnector classes for Databricks support.
pandasai/connectors/.../sql.py, pandasai/connectors/.../snowflake.py Refactors SQL and Snowflake connectors, optimizes SQL queries, and improves security in string representations.
tests/connectors/.../test_databricks.py Adds unit tests for DatabricksConnector.
tests/connectors/.../test_snowflake.py, tests/connectors/.../test_sql.py Enhances test coverage for SQL and Snowflake connectors.

🐇💻

Code hopping through the fields,

Connectors grow, our project yields.

Databricks joins the merry band,

With SQL queries close at hand.

Tests ensure we're on the right track,

As we code, there's no looking back!

So here's to progress, swift and bright,

Coding rabbits, take flight! 🚀🌙


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 12

Commits Files that changed from the base of the PR and between 7738feb and 70c71ed.
Files ignored due to filter (2)
  • poetry.lock
  • pyproject.toml
Files selected for processing (8)
  • examples/from_databricks.py (1 hunks)
  • pandasai/connectors/init.py (2 hunks)
  • pandasai/connectors/base.py (1 hunks)
  • pandasai/connectors/databricks.py (1 hunks)
  • pandasai/connectors/sql.py (2 hunks)
  • tests/connectors/test_databricks.py (1 hunks)
  • tests/connectors/test_snowflake.py (1 hunks)
  • tests/connectors/test_sql.py (1 hunks)
Files skipped from review due to trivial changes (1)
  • pandasai/connectors/init.py
Additional comments (Suppressed): 8
examples/from_databricks.py (2)
  • 1-27: The code snippet is an example of how to use the new DatabricksConnector class. It creates a DatabricksConnector instance with a configuration, initializes an OpenAI instance, and uses these to create a SmartDataframe. The chat method is then called on the SmartDataframe instance and the response is printed.

Please ensure that the token and other sensitive information are not hardcoded in the script but rather fetched from secure sources or environment variables. Hardcoding sensitive information can lead to security vulnerabilities.

  • 8-20: Ensure that the host, database, token, port, table, httpPath, and where clause provided in the config dictionary are correct and valid for your Databricks setup. Invalid configurations could lead to connection failures or incorrect data retrieval.
pandasai/connectors/base.py (2)
  • 66-75: The DatabricksConnectorConfig class is introduced to hold the configuration for a Databricks connection. Ensure that all necessary parameters are included and correctly named according to the Databricks API documentation. Also, consider adding default values where applicable.

  • 77-79: No changes have been made to the BaseConnector class in this hunk. The class continues to serve as an abstract base class for all connectors.

tests/connectors/test_sql.py (2)
  • 76-83: The new implementation of test_columns_count_property is more efficient and less error-prone than the previous one. Instead of executing a SQL query to get the number of columns, it now simply calls the len function on the columns returned by the head method. This change simplifies the code and reduces the risk of SQL injection attacks. However, ensure that the head method always returns a DataFrame object with a columns attribute.

  • 79-81: This mock setup for the DataFrame object and its columns is a good practice for unit testing. It isolates the test from external dependencies and makes it deterministic.

tests/connectors/test_snowflake.py (2)
  • 76-83: The new implementation of test_columns_count_property is more efficient and less error-prone than the previous one. Instead of executing a SQL query to fetch the column count, it now directly uses the len function on the columns returned by the head method. This change reduces the complexity of the code and makes it easier to understand and maintain. However, ensure that the head method reliably returns a DataFrame with the correct columns.

  • 79-81: Mocking the head method of the connector to return a DataFrame with two columns is a good way to isolate the test from external dependencies. This ensures that the test only verifies the functionality of the columns_count property and does not fail due to issues in the head method or the database connection.

Comment on lines 19 to 28
self.config = DatabricksConnectorConfig(
dialect="databricks",
host="ehxzojy-ue47135",
port=443,
token="token",
database="SNOWFLAKE_SAMPLE_DATA",
httpPath="/sql/1.0/warehouses/1241rsa32",
table="lineitem",
where=[["column_name", "=", "value"]],
).dict()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DatabricksConnectorConfig instance is created with hardcoded values. This might not be a good practice for unit tests as it reduces flexibility and reusability of the test case. Consider using setUpClass method or a fixture to set these values, which can then be overridden in individual test cases if needed.

Comment on lines 33 to 39
def test_constructor_and_properties(self):
# Test constructor and properties
self.assertEqual(self.connector._config, self.config)
self.assertEqual(self.connector._engine, self.mock_engine)
self.assertEqual(self.connector._connection, self.mock_connection)
self.assertEqual(self.connector._cache_interval, 600)

Copy link
Contributor

@coderabbitai coderabbitai bot Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test test_constructor_and_properties checks the properties of the DatabricksConnector object after its creation. However, it does not verify if the _load_connector_config method was called during the initialization. This is an important part of the initialization process and should be tested.

self.assertTrue(self.connector._load_connector_config.called)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem agree, let's add it to also increase the coverage!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like humans are talking. I'll be quiet for now. If you need me again, just tag @coderabbitai in a new comment.

tests/connectors/test_databricks.py Outdated Show resolved Hide resolved
tests/connectors/test_databricks.py Show resolved Hide resolved
Comment on lines +76 to +83
def test_columns_count_property(self):
# Test columns_count property
self.connector._columns_count = None
mock_df = Mock()
mock_df.columns = ["Column1", "Column2"]
self.connector.head = Mock(return_value=mock_df)
columns_count = self.connector.columns_count
self.assertEqual(columns_count, 2)
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests test_rows_count_property and test_columns_count_property do not check if the head method was called when the _rows_count and _columns_count properties are None. This is an important part of the logic and should be tested.

self.assertTrue(self.connector.head.called)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem let's add it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like humans are talking. I'll be quiet for now. If you need me again, just tag @coderabbitai in a new comment.

pandasai/connectors/sql.py Show resolved Hide resolved
Comment on lines 15 to 16
SnowFlake connectors are used to connect to SnowFlake Data Cloud.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class docstring seems to be copied from the SnowFlake connector and not updated. It should reflect that this is a Databricks connector.

-    """
-    SnowFlake connectors are used to connect to SnowFlake Data Cloud.
-    """
+    """
+    Databricks connectors are used to connect to Databricks SQL Warehouse on Azure, AWS, and GCP.
+    """

Comment on lines 20 to 24
Initialize the SnowFlake connector with the given configuration.

Args:
config (ConnectorConfig): The configuration for the SnowFlake connector.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the docstring for the __init__ method refers to SnowFlake instead of Databricks. This needs to be corrected.

-        Initialize the SnowFlake connector with the given configuration.
+        Initialize the Databricks connector with the given configuration.

-            config (ConnectorConfig): The configuration for the SnowFlake connector.
+            config (DatabricksConnectorConfig): The configuration for the Databricks connector.

Comment on lines 29 to 30
if "database" not in config and os.getenv("SNOWFLAKE_DATABASE"):
config["database"] = os.getenv("SNOWFLAKE_DATABASE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable being checked here is for Snowflake, not Databricks. This should be corrected to check for a Databricks-specific environment variable.

-        if "database" not in config and os.getenv("SNOWFLAKE_DATABASE"):
-            config["database"] = os.getenv("SNOWFLAKE_DATABASE")
+        if "database" not in config and os.getenv("DATABRICKS_DATABASE"):
+            config["database"] = os.getenv("DATABRICKS_DATABASE")

Comment on lines 59 to 63
Return the string representation of the SnowFlake connector.

Returns:
str: The string representation of the SnowFlake connector.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for the __repr__ method also refers to SnowFlake instead of Databricks. This needs to be corrected.

-        Return the string representation of the SnowFlake connector.
+        Return the string representation of the Databricks connector.

-        str: The string representation of the SnowFlake connector.
+        str: The string representation of the Databricks connector.

@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2023

Codecov Report

Merging #580 (637348a) into main (9d55463) will increase coverage by 0.67%.
Report is 4 commits behind head on main.
The diff coverage is 84.44%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main     #580      +/-   ##
==========================================
+ Coverage   83.22%   83.89%   +0.67%     
==========================================
  Files          54       55       +1     
  Lines        2682     2689       +7     
==========================================
+ Hits         2232     2256      +24     
+ Misses        450      433      -17     
Files Changed Coverage Δ
pandasai/connectors/sql.py 64.81% <45.45%> (+4.92%) ⬆️
pandasai/connectors/base.py 87.05% <90.90%> (+0.57%) ⬆️
pandasai/connectors/__init__.py 100.00% <100.00%> (ø)
pandasai/connectors/databricks.py 100.00% <100.00%> (ø)
pandasai/connectors/snowflake.py 96.15% <100.00%> (+15.59%) ⬆️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Commits Files that changed from the base of the PR and between 70c71ed and 358c612.
Files selected for processing (1)
  • pandasai/connectors/databricks.py (1 hunks)

pandasai/connectors/databricks.py Outdated Show resolved Hide resolved

"""
self._engine = create_engine(
f"{config.dialect}://token:{config.token}@{config.host}:{config.port}?http_path={config.httpPath}"
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection string is being created without any sanitization or encoding of the input parameters. This could potentially lead to security issues such as SQL injection attacks if the config parameters are user-supplied. Consider using a safer method to construct the connection string, such as URL encoding the parameters or using a library function designed for this purpose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai, could you make an example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gventuri, sure! Here's an example of how you could safely construct the connection string using URL encoding:

from urllib.parse import quote_plus

class DatabricksConnector(SQLConnector):
    # ...

    def _init_connection(self, config: DatabricksConnectorConfig):
        """
        Initialize Database Connection

        Args:
            config (DatabricksConnectorConfig): Configurations to load database

        """
        safe_token = quote_plus(config.token)
        safe_host = quote_plus(config.host)
        safe_port = quote_plus(config.port)
        safe_httpPath = quote_plus(config.httpPath)

        self._engine = create_engine(
            f"{config.dialect}://token:{safe_token}@{safe_host}:{safe_port}?http_path={safe_httpPath}"
        )

        self._connection = self._engine.connect()

In this example, I've used Python's built-in urllib.parse.quote_plus function to URL encode the input parameters. This will ensure that any special characters in the parameters are properly escaped, preventing potential SQL injection attacks.

Comment on lines 64 to 68
return (
f"<{self.__class__.__name__} dialect={self._config.dialect} "
f"token={self._config.token} "
f"host={self._config.host} port={self._config.port} "
f"database={self._config.database} httpPath={str(self._config.httpPath)}"
Copy link
Contributor

@coderabbitai coderabbitai bot Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing sensitive information like tokens in the string representation of the object can be a security risk. If this object is ever logged or printed in a debugging statement, it could expose the token to unauthorized users. Consider removing the token from the string representation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem let's not log the sensitive info!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like humans are talking. I'll be quiet for now. If you need me again, just tag @coderabbitai in a new comment.

Copy link
Collaborator

@gventuri gventuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, only a couple of minor comments!

Comment on lines 282 to 285
query = sql.text(f"SELECT COUNT(*) FROM {self._config.table}")

# Return the number of rows
self._rows_count = self._connection.execute(query).fetchone()[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this could create a potential attack. We are using bindparams to validate the table name. Also the suggestion from @coderabbitai might work!

examples/from_databricks.py Show resolved Hide resolved
Databricks connectors are used to connect to Databricks Data Cloud.
"""

def __init__(self, config: DatabricksConnectorConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This accepts also a dictionary, Union[DatabricksConnectorConfig, dict]. In that case, we need to convert it to a DatabricksConnectorConfig.

pandasai/connectors/databricks.py Outdated Show resolved Hide resolved

"""
self._engine = create_engine(
f"{config.dialect}://token:{config.token}@{config.host}:{config.port}?http_path={config.httpPath}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@coderabbitai, could you make an example?

host="ehxzojy-ue47135",
port=443,
token="token",
database="SNOWFLAKE_SAMPLE_DATA",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace with DATABRICKS

Comment on lines 33 to 39
def test_constructor_and_properties(self):
# Test constructor and properties
self.assertEqual(self.connector._config, self.config)
self.assertEqual(self.connector._engine, self.mock_engine)
self.assertEqual(self.connector._connection, self.mock_connection)
self.assertEqual(self.connector._cache_interval, 600)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem agree, let's add it to also increase the coverage!

expected_repr = (
"<DatabricksConnector dialect=databricks token=token "
"host=ehxzojy-ue47135 port=443 "
"database=SNOWFLAKE_SAMPLE_DATA httpPath=/sql/1.0/warehouses/1241rsa32"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover, let's use DATABRICKS

Comment on lines +76 to +83
def test_columns_count_property(self):
# Test columns_count property
self.connector._columns_count = None
mock_df = Mock()
mock_df.columns = ["Column1", "Column2"]
self.connector.head = Mock(return_value=mock_df)
columns_count = self.connector.columns_count
self.assertEqual(columns_count, 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem let's add it!

Comment on lines +85 to +91
def test_column_hash_property(self):
# Test column_hash property
mock_df = Mock()
mock_df.columns = ["Column1", "Column2"]
self.connector.head = Mock(return_value=mock_df)
column_hash = self.connector.column_hash
self.assertIsNotNone(column_hash)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArslanSaleem let's try (maybe with a hardcoded value) the correctness of it!

@gventuri
Copy link
Collaborator

@ArslanSaleem also, let's it also to the documentation (connectors.md)!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Commits Files that changed from the base of the PR and between 358c612 and 6ef5416.
Files ignored due to filter (2)
  • poetry.lock
  • pyproject.toml
Files selected for processing (8)
  • examples/from_databricks.py (1 hunks)
  • pandasai/connectors/base.py (3 hunks)
  • pandasai/connectors/databricks.py (1 hunks)
  • pandasai/connectors/snowflake.py (4 hunks)
  • pandasai/connectors/sql.py (8 hunks)
  • tests/connectors/test_databricks.py (1 hunks)
  • tests/connectors/test_snowflake.py (4 hunks)
  • tests/connectors/test_sql.py (4 hunks)
Files skipped from review due to trivial changes (1)
  • tests/connectors/test_databricks.py
Additional comments (Suppressed): 17
pandasai/connectors/base.py (3)
  • 67-76: The new DatabricksConnectorConfig class extends SQLBaseConnectorConfig and adds specific properties for Databricks connection such as host, port, token, and httpPath. Ensure that these properties are correctly used in the DatabricksConnector class and that they match the expected configuration parameters for a Databricks connection.

  • 110-126: The _populate_config_from_env method populates the configuration dictionary with values from environment variables if they do not exist in the config. This is a good practice as it allows for flexibility in configuration management and adheres to the twelve-factor app methodology. However, ensure that sensitive data like tokens or passwords are securely managed when using environment variables.

  • 128-130: The _init_connection method signature has been changed to include a config parameter of type BaseConnectorConfig. Make sure all calls to this method throughout the codebase have been updated to match the new signature.

pandasai/connectors/snowflake.py (4)
  • 2-9: The import statement for os has been removed. If the environment variables are still being used elsewhere in the code, this could cause a NameError.

  • 18-18: The function signature of __init__ has changed from accepting a SnowFlakeConnectorConfig to accepting a Union[SQLConnectorConfig, dict]. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.

  • 24-38: The method of populating the configuration from environment variables has been refactored. Instead of checking and assigning each variable individually, a dictionary of environment variable names is created and passed to a _populate_config_from_env method. This change improves the maintainability and readability of the code by reducing redundancy and complexity.

  • 85-91: The __repr__ method no longer includes the username and password in its output. This is a good security practice as it prevents sensitive information from being accidentally logged or displayed.

tests/connectors/test_snowflake.py (1)
  • 9-13: The patch for pandasai.connectors.sql.sql has been removed in the new hunk. If this is intentional, ensure that all dependencies on this mock within the test cases have been properly updated or removed.
pandasai/connectors/databricks.py (1)
  • 49-50: As previously discussed, the connection string is being created without any sanitization or encoding of the input parameters. This could potentially lead to security issues such as SQL injection attacks if the config parameters are user-supplied. Consider using a safer method to construct the connection string, such as URL encoding the parameters or using a library function designed for this purpose.
from urllib.parse import quote_plus

safe_token = quote_plus(config.token)
safe_host = quote_plus(config.host)
safe_port = quote_plus(str(config.port))
safe_httpPath = quote_plus(config.httpPath)

self._engine = create_engine(
    f"{config.dialect}://token:{safe_token}@{safe_host}:{safe_port}?http_path={safe_httpPath}"
)
pandasai/connectors/sql.py (4)
  • 12-14: The import of sql from sqlalchemy has been removed. Ensure that this does not affect any other parts of the codebase where sql might be used.

  • 95-101: The __repr__ method no longer includes the username and password in its output. This is a good practice for security reasons as it prevents sensitive information from being logged or displayed inadvertently.

  • 389-390: The __init__ methods of the MySQLConnector and PostgreSQLConnector classes now accept a dictionary in addition to an instance of SQLConnectorConfig. This change provides more flexibility in how the configuration can be provided. However, ensure that all calls to these constructors throughout the codebase have been updated to match the new signatures.

  • 399-407: The environment variable loading logic has been refactored into a separate _populate_config_from_env method. This change improves modularity and reduces code duplication. However, ensure that the _populate_config_from_env method correctly handles cases where the environment variables are not set or are set to invalid values.

tests/connectors/test_sql.py (4)
  • 9-13: The patch for "pandasai.connectors.sql.sql" has been removed in the new hunk. If this is intentional and the SQL module is no longer used in the setup, then this change is fine. However, if the SQL module is still being used in the setup or elsewhere in the tests, this could cause issues. Please verify that the removal of this patch does not affect the functionality of the tests.

  • 33-46: The constructor test now includes checks for _load_connector_config and _init_connection methods to be called during the object creation. This is a good practice as it ensures that these important methods are indeed invoked when an instance of SQLConnector is created. No changes needed here.

  • 48-54: The expected representation string in the __repr__ method test has changed. The username and password fields have been removed from the expected string. If this is due to a change in the __repr__ method of the SQLConnector class where it no longer includes the username and password in its representation, then this change is fine. However, please verify that this change aligns with the current implementation of the __repr__ method.

  • 82-89: The test for the columns_count property has been modified. Instead of mocking a return value for self.mock_connection.execute.return_value.fetchone, it now mocks a return value for self.connector.head. This change seems to reflect a modification in the columns_count property implementation, where it now relies on the head method instead of directly executing a SQL query. Please ensure that this change aligns with the current implementation of the columns_count property.

examples/from_databricks.py Show resolved Hide resolved
examples/from_databricks.py Show resolved Hide resolved
@@ -3,6 +3,7 @@
"""

from abc import ABC, abstractmethod
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os module is imported but not used in the provided hunk. If it's not used elsewhere in the file, consider removing this import to keep the code clean and efficient.

- import os

Comment on lines +33 to +45
@patch("pandasai.connectors.SnowFlakeConnector._load_connector_config")
@patch("pandasai.connectors.SnowFlakeConnector._init_connection")
def test_constructor_and_properties(
self, mock_load_connector_config, mock_init_connection
):
# Test constructor and properties
self.assertEqual(self.connector._config, self.config)
self.assertEqual(self.connector._engine, self.mock_engine)
self.assertEqual(self.connector._connection, self.mock_connection)
self.assertEqual(self.connector._cache_interval, 600)
SnowFlakeConnector(self.config)
mock_load_connector_config.assert_called()
mock_init_connection.assert_called()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor of SnowFlakeConnector is now being tested with two additional mocked methods: _load_connector_config and _init_connection. Ensure these methods are not required to be tested separately as their functionality is now covered under the constructor test.

Comment on lines 82 to +89
def test_columns_count_property(self):
# Test columns_count property
self.connector._columns_count = None
self.mock_connection.execute.return_value.fetchone.return_value = (
8,
) # Sample columns count
mock_df = Mock()
mock_df.columns = ["Column1", "Column2"]
self.connector.head = Mock(return_value=mock_df)
columns_count = self.connector.columns_count
self.assertEqual(columns_count, 8)
self.assertEqual(columns_count, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method to test columns_count property has been changed. Previously, it was directly fetching the count from the database, but now it's using the head method to get a dataframe and then counting the columns. This change might affect the performance if the table has a large number of rows. Consider reverting back to the old method if performance is a concern.

Comment on lines 95 to +101
self.connector.head = Mock(return_value=mock_df)
column_hash = self.connector.column_hash
self.assertIsNotNone(column_hash)
self.assertEqual(
column_hash,
"ea6a80582b83e1511f8be83412b13e7b86d20c45b96fcf9731f3b99dc3b568aa",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected value of column_hash is hardcoded in the test. This could lead to test failures if the implementation of generating the hash changes. Consider mocking the method used to generate the hash or dynamically generating the expected hash based on the mock data.

pandasai/connectors/databricks.py Show resolved Hide resolved
Comment on lines +61 to +65
return (
f"<{self.__class__.__name__} dialect={self._config.dialect} "
f"host={self._config.host} port={self._config.port} "
f"database={self._config.database} httpPath={str(self._config.httpPath)}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The __repr__ method is revealing potentially sensitive information such as the host and port. It's generally not a good practice to expose such details. Consider removing these from the representation.

-        return (
-            f"<{self.__class__.__name__} dialect={self._config.dialect} "
-            f"host={self._config.host} port={self._config.port} "
-            f"database={self._config.database} httpPath={str(self._config.httpPath)}"
-        )
+        return f"<{self.__class__.__name__} database={self._config.database}>"

Comment on lines 279 to 285
)

# Run a SQL query to get the number of rows
query = sql.text(
"SELECT COUNT(*) FROM information_schema.columns "
"WHERE table_name = :table_name"
).bindparams(table_name=self._config.table)
query = select(text("COUNT(*)")).select_from(text(self._config.table))

# Return the number of rows
self._rows_count = self._connection.execute(query).fetchone()[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SQL query to get the number of rows has been simplified. The new query directly counts the rows in the table, which should be more efficient than the previous approach of counting columns in the information_schema.columns table. However, ensure that the user executing this query has sufficient permissions on the target table.

Comment on lines 304 to 308
f"{self._config.dialect}"
)

# Run a SQL query to get the number of columns
query = sql.text(
"SELECT COUNT(*) FROM information_schema.columns "
f"WHERE table_name = '{self._config.table}'"
)

# Return the number of columns
self._columns_count = self._connection.execute(query).fetchone()[0]
self._columns_count = len(self.head().columns)
return self._columns_count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columns_count method now uses the len function on the columns returned by the head method instead of running a SQL query. This change could improve performance by reducing the number of database queries. However, make sure that the head method always returns the correct and complete set of columns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Commits Files that changed from the base of the PR and between 6ef5416 and 637348a.
Files selected for processing (1)
  • pandasai/connectors/snowflake.py (4 hunks)
Additional comments (Suppressed): 4
pandasai/connectors/snowflake.py (4)
  • 2-7: The import statement for os module has been removed. If the environment variables are being used elsewhere in the code, this could cause a NameError.

  • 18-18: The function signature of __init__ method has been changed to accept either a SnowFlakeConnectorConfig instance or a dictionary as the configuration parameter. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.

  • 24-38: The way environment variables are loaded into the config has been refactored. This is a good change as it reduces redundancy and makes the code more maintainable. However, ensure that the _populate_config_from_env method correctly populates the config from environment variables.

  • 85-91: The __repr__ method no longer includes the username and password in its output. This is a good security practice as it prevents sensitive information from being logged or displayed inadvertently.

@gventuri gventuri changed the title Feat[DatabricksConnector]: Connector to connect to Databricks on Cloud feat[DatabricksConnector]: connector to connect to Databricks on Cloud Sep 21, 2023
@gventuri gventuri merged commit 325fed6 into main Sep 21, 2023
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants