Feat: Validate dataframe with Pydantic schema #522

ArslanSaleem · 2023-09-02T16:27:36Z

closes DataFrame Validation with Pydantic #517 (Replace 517 with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.

Summary by CodeRabbit

New Feature: Introduced a DataFrame validation module DFValidator that uses Pydantic schemas to validate data in pandas DataFrames. This feature enhances data integrity and consistency.
New Feature: Added a validate method to the SmartDataframe class, allowing users to validate their data directly from the DataFrame instance.
Test: Expanded test coverage for the SmartDataframe class with new test cases focusing on DataFrame validation using Pydantic schemas. This ensures robustness and reliability of the new features.

coderabbitai · 2023-09-02T16:27:40Z

Walkthrough

The changes introduce a new module, DFValidator, for validating pandas DataFrames using Pydantic schemas. The SmartDataframe class now has a validate method that leverages this functionality. New test cases have been added to ensure the validation works as expected.

Changes

File Path	Summary
.../helpers/df_validator.py	Introduced `DFValidator` and `DfValidationResult` classes for DataFrame validation using Pydantic schemas.
.../smart_dataframe/init.py	Added `validate` method to `SmartDataframe` class, utilizing `DFValidator`.
tests/test_smartdataframe.py	Added test cases for DataFrame validation with Pydantic schemas in `SmartDataframe`.

🐇💻

In the land of code where the shadows lie,
A rabbit hopped forth, his eyes on the sky.
With each line he wrote, each test he did pass,
He built a validator, as clear as glass.
Now data frames dance, their structure in check,
Thanks to our hero, CodeRabbit, tech deck! 🎉

Tips

Chat with CodeRabbit Bot (`@coderabbitai`)

Mention @coderabbitai in any review comment for bot assistance. The bot stays silent unless called upon.
Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

coderabbitai

Review Status

Actionable comments generated: 6

Commits

Files that changed from the base of the PR and between caf2fca and e4cf54d commits.

Files selected for processing (3)

pandasai/helpers/df_validator.py (1 hunks)
pandasai/smart_dataframe/init.py (2 hunks)
tests/test_smartdataframe.py (2 hunks)

Additional comments (Suppressed): 8

pandasai/smart_dataframe/__init__.py (1)

24-30: The import of the pydantic module and DFValidator class is correctly done. The new imports are well organized and do not conflict with existing ones.

tests/test_smartdataframe.py (6)

8-14: The import of BaseModel and Field from the pydantic module is new in this hunk. Ensure that the pydantic module is included in the project dependencies.

663-679: This test case validates a DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.

681-697: This test case validates a DataFrame with string values in column "A" using a Pydantic schema that expects integers. The validation should fail, which is the expected behavior.

699-714: This test case validates a Polars DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.

716-733: This test case validates a DataFrame with one string value in column "A" using a Pydantic schema that expects integers. The validation should fail for one record, which is the expected behavior.

735-759: These test cases validate a DataFrame against a Pydantic schema with complex constraints (greater than and less than conditions). The first validation should fail as all values in column "A" are less than 5, while the second validation should pass as all values in column "A" are less than 5.

pandasai/helpers/df_validator.py (1)

46-49: Creating a new class within a method can lead to issues with object identity and garbage collection. Consider defining this class at module level and parameterizing it as needed.

coderabbitai · 2023-09-02T16:29:01Z

pandasai/smart_dataframe/__init__.py

        self._sample_head = df_head.to_csv(index=False)
        return self._sample_head

+    def validate(self, schema: pydantic.BaseModel, verbose: bool = False):
+        """
+        Validates Dataframe rows on the basis Pydantic schema input
+        (Args):
+            schema: Pydantic schema class
+            n_jobs: Parallelism for larger dataframe
+            verbose: Print Errors
+        """
+        df_validator = DFValidator(self.original_import, verbose)
+        return df_validator.validate(schema)
+
    @property
    def datalake(self):
        return self._dl


The validate method has been added to the SmartDataframe class. This method uses the DFValidator class to validate the DataFrame rows against a given Pydantic schema. The method signature looks correct, and the use of the DFValidator class seems appropriate. However, the docstring mentions an argument n_jobs which is not present in the function definition. Please update the docstring to match the function signature.

- n_jobs: Parallelism for larger dataframe

coderabbitai · 2023-09-02T16:29:02Z

pandasai/helpers/df_validator.py

+    def __init__(self, passed: bool = True, errors: List[Dict] = []):
+        self._passed = passed


Default mutable arguments are a common source of unexpected behaviors and bugs in Python. Here, errors: List[Dict] = [] is a mutable default argument. If the errors list is modified, it will persist across all future function calls that do not provide an errors argument. This can lead to unexpected behavior. Consider using a sentinel value to denote "no argument provided" and create a new list inside the function.

- def __init__(self, passed: bool = True, errors: List[Dict] = []): + def __init__(self, passed: bool = True, errors: List[Dict] = None): + if errors is None: + errors = []

coderabbitai · 2023-09-02T16:29:02Z

pandasai/helpers/df_validator.py

+        self.passed = False
+        self._errors.append(error_message)


The passed attribute is defined as a read-only property on lines 12-14, but you're trying to modify it here. This will raise an AttributeError. You should modify the _passed attribute directly.

- self.passed = False + self._passed = False

coderabbitai · 2023-09-02T16:29:02Z

pandasai/helpers/df_validator.py

+        elif dataframe_type == "polars":
+            return df.to_dicts()
+        else:
+            []


This line seems to be a no-op. It creates an empty list and then does nothing with it. If you intended to return an empty list when the dataframe type is neither "pandas" nor "polars", you should use the return statement.

- [] + return []

coderabbitai · 2023-09-02T16:29:02Z

pandasai/helpers/df_validator.py

+    def __init__(self, df, verbose=False):
+        self._df = df
+        self._verbose = verbose
+


Consider making df and verbose keyword-only arguments for better readability and to prevent positional argument errors.

- def __init__(self, df, verbose=False): + def __init__(self, *, df, verbose=False):

codecov-commenter · 2023-09-02T16:29:05Z

Codecov Report

Merging #522 (c69547a) into main (caf2fca) will increase coverage by 0.04%.
The diff coverage is 88.46%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main     #522      +/-   ##
==========================================
+ Coverage   86.89%   86.93%   +0.04%     
==========================================
  Files          46       47       +1     
  Lines        1968     2020      +52     
==========================================
+ Hits         1710     1756      +46     
- Misses        258      264       +6

Files Changed	Coverage Δ
pandasai/helpers/df_validator.py	`87.23% <87.23%> (ø)`
pandasai/smart_dataframe/__init__.py	`93.08% <100.00%> (+0.14%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

coderabbitai

Review Status

Actionable comments generated: 3

Commits

Files that changed from the base of the PR and between e4cf54d and c69547a commits.

Files selected for processing (1)

pandasai/helpers/df_validator.py (1 hunks)

Additional comments (Suppressed): 1

pandasai/helpers/df_validator.py (1)

47-47: The use of List[schema] seems incorrect. The List generic expects a type as its parameter, but here it's being given an instance of BaseModel. This should be replaced with the actual class that represents the schema.

coderabbitai · 2023-09-02T16:47:44Z

pandasai/helpers/df_validator.py

+
+
+class DFValidationResult:
+    def __init__(self, passed: bool = True, errors: List[Dict] = []):


Default mutable arguments in Python can lead to unexpected behavior. Here, errors: List[Dict] = [] is a mutable default argument. If this list is modified, the change will persist across function calls. Consider changing the default value to None and assigning an empty list within the function if the argument is None.

- def __init__(self, passed: bool = True, errors: List[Dict] = []): + def __init__(self, passed: bool = True, errors: List[Dict] = None): + if errors is None: + errors = []

coderabbitai · 2023-09-02T16:47:44Z

pandasai/helpers/df_validator.py

+        """
+        Add an error message to the validation results.
+        """
+        self.passed = False


You are trying to set the attribute passed which is not defined in the class. You probably meant to set _passed.

- self.passed = False + self._passed = False

coderabbitai · 2023-09-02T16:47:45Z

pandasai/helpers/df_validator.py

+    def validate(self, schema: BaseModel) -> DFValidationResult:
+        """
+        Args:
+                schema: Pydantic schema to be validated for the dataframe row
+        """
+        dataframe_type = df_type(self._df)
+        if dataframe_type is None:
+            raise ValueError("UnSupported DataFrame")
+
+        df_json: List[Dict] = self._df_to_list_of_dict(self._df, dataframe_type)
+
+        errors = self._validate_batch(schema, df_json)
+
+        if len(errors) > 0:
+            return DFValidationResult(False, errors)
+        else:
+            return DFValidationResult(True)


This method could be simplified by directly returning the DFValidationResult object without checking the length of errors. The DFValidationResult constructor already sets passed to True by default, so you only need to create a new DFValidationResult when there are errors.

- if len(errors) > 0: - return DFValidationResult(False, errors) - else: - return DFValidationResult(True) + return DFValidationResult(len(errors) == 0, errors)

gventuri · 2023-09-04T23:00:02Z

@ArslanSaleem thanks a lot for the great feature, this is super cool!

coderabbitai

Review Status

Actionable comments generated: 2

Commits

Files that changed from the base of the PR and between c69547a and c6463fe commits.

Files selected for processing (3)

pandasai/helpers/df_validator.py (1 hunks)
pandasai/smart_dataframe/init.py (2 hunks)
tests/test_smartdataframe.py (4 hunks)

Additional comments (Suppressed): 9

pandasai/smart_dataframe/__init__.py (2)

24-29: The import of pydantic and DfValidator from pandasai.helpers.df_validator is correctly done. The new imports are placed in the correct section of the code, following Python's best practices for organizing imports.

241-249: The validate method has been added to the SmartDataframe class. This method uses the DfValidator class to validate the rows of the dataframe against a given Pydantic schema. The implementation looks correct based on the provided context. However, it would be good to verify that the DfValidator class and its validate method have been implemented correctly and handle all possible edge cases.

tests/test_smartdataframe.py (4)

14-14: The BaseModel from Pydantic is imported but not used in this hunk. Ensure that it's used in the subsequent parts of the code.

603-609: The changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.

635-641: Similar to the previous comment, the changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.

663-752: This new test case is added to validate the functionality of dataframe validation using Pydantic schemas. It covers various scenarios including successful validation, validation failures, and complex schema constraints. This is a good addition as it ensures the newly added feature works as expected.

pandasai/helpers/df_validator.py (3)

1-52: The DfValidationResult class is well implemented. It encapsulates the result of a validation operation, including whether it passed and any errors that occurred. The use of properties for passed and errors ensures that these attributes are read-only, which is good for immutability. The add_error method allows adding an error message to the results, and the __bool__ method defines the truthiness of the object based on whether the validation passed or not.

54-126: The DfValidator class is responsible for validating a DataFrame using a Pydantic schema. It uses private methods _validate_batch and _df_to_list_of_dict to perform the actual validation. However, there's a potential issue in the _validate_batch method.

110-126: The validate method validates the DataFrame against the provided schema and returns a DfValidationResult. It first determines the type of the DataFrame, then converts it to a list of dictionaries, and finally validates each batch of rows against the schema. The implementation seems correct and efficient.

coderabbitai · 2023-09-04T23:00:40Z

pandasai/helpers/df_validator.py

+    def _validate_batch(self, schema, df_json: List[Dict]):
+        """
+        Args:
+            schema: Pydantic schema
+            batch_df: dataframe batch
+
+        Returns:
+            list of errors
+        """
+        try:
+            # Create a Pydantic Validator to validate rows of dataframe
+            class PdVal(BaseModel):
+                df: List[schema]
+
+            PdVal(df=df_json)
+            return []
+
+        except ValidationError as e:
+            return e.errors()
+


In the _validate_batch method, you're creating a new Pydantic model PdVal inside the try block every time this method is called. This could lead to performance issues if this method is called frequently, as creating a new class is an expensive operation. Consider defining this class outside of the method and reusing it.

- class PdVal(BaseModel): - df: List[schema] + PdVal.df = List[schema]

coderabbitai · 2023-09-04T23:00:40Z

pandasai/helpers/df_validator.py

+    def _df_to_list_of_dict(self, df: DataFrameType, dataframe_type: str) -> List[Dict]:
+        """
+        Create list of dict of dataframe rows on basis of dataframe type
+        Supports only polars and pandas dataframe
+
+        Args:
+            df: dataframe to be converted
+            dataframe_type: type of dataframe
+
+        Returns:
+            list of dict of dataframe rows
+        """
+        if dataframe_type == "pandas":
+            return df.to_dict(orient="records")
+        elif dataframe_type == "polars":
+            return df.to_dicts()
+        else:
+            return []
+


The _df_to_list_of_dict method converts a DataFrame into a list of dictionaries based on the type of DataFrame (either pandas or polars). If the DataFrame type is neither of these, it returns an empty list. This could potentially hide errors if an unsupported DataFrame type is used. Instead, consider raising an exception when an unsupported DataFrame type is encountered.

- else: - return [] + raise ValueError(f"Unsupported DataFrame type: {dataframe_type}")

arslan-agory and others added 2 commits September 2, 2023 21:01

feat[DFValidator]: Add return types

5188e5c

fix pre-commit errors

e4cf54d

coderabbitai bot reviewed Sep 2, 2023

View reviewed changes

add return value

c69547a

coderabbitai bot reviewed Sep 2, 2023

View reviewed changes

ArslanSaleem changed the title ~~Feat df schema~~ Feat: Validate dataframe with Pydantic schema Sep 4, 2023

refactor: minor changes in validation

c6463fe

gventuri merged commit c986050 into sinaptik-ai:main Sep 4, 2023

coderabbitai bot reviewed Sep 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Validate dataframe with Pydantic schema #522

Feat: Validate dataframe with Pydantic schema #522

ArslanSaleem commented Sep 2, 2023 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2023 •

edited

Loading

Chat with CodeRabbit Bot (`@coderabbitai`)

Pause Incremental Reviews

coderabbitai bot left a comment

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

codecov-commenter commented Sep 2, 2023 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

coderabbitai bot Sep 2, 2023

gventuri commented Sep 4, 2023

coderabbitai bot left a comment

coderabbitai bot Sep 4, 2023

coderabbitai bot Sep 4, 2023

		def __init__(self, passed: bool = True, errors: List[Dict] = []):
		self._passed = passed



		class DFValidationResult:
		def __init__(self, passed: bool = True, errors: List[Dict] = []):

Feat: Validate dataframe with Pydantic schema #522

Feat: Validate dataframe with Pydantic schema #522

Conversation

ArslanSaleem commented Sep 2, 2023 • edited by coderabbitai bot Loading

Summary by CodeRabbit

coderabbitai bot commented Sep 2, 2023 • edited Loading

Walkthrough

Changes

Chat with CodeRabbit Bot (@coderabbitai)

Pause Incremental Reviews

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

codecov-commenter commented Sep 2, 2023 • edited Loading

Codecov Report

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 2, 2023

Choose a reason for hiding this comment

gventuri commented Sep 4, 2023

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 4, 2023

Choose a reason for hiding this comment

coderabbitai bot Sep 4, 2023

Choose a reason for hiding this comment

ArslanSaleem commented Sep 2, 2023 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 2, 2023 •

edited

Loading

Chat with CodeRabbit Bot (`@coderabbitai`)

codecov-commenter commented Sep 2, 2023 •

edited

Loading