Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Validate dataframe with Pydantic schema #522

Merged
merged 4 commits into from
Sep 4, 2023

Conversation

ArslanSaleem
Copy link
Collaborator

@ArslanSaleem ArslanSaleem commented Sep 2, 2023

Summary by CodeRabbit

  • New Feature: Introduced a DataFrame validation module DFValidator that uses Pydantic schemas to validate data in pandas DataFrames. This feature enhances data integrity and consistency.
  • New Feature: Added a validate method to the SmartDataframe class, allowing users to validate their data directly from the DataFrame instance.
  • Test: Expanded test coverage for the SmartDataframe class with new test cases focusing on DataFrame validation using Pydantic schemas. This ensures robustness and reliability of the new features.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2023

Walkthrough

The changes introduce a new module, DFValidator, for validating pandas DataFrames using Pydantic schemas. The SmartDataframe class now has a validate method that leverages this functionality. New test cases have been added to ensure the validation works as expected.

Changes

File Path Summary
.../helpers/df_validator.py Introduced DFValidator and DfValidationResult classes for DataFrame validation using Pydantic schemas.
.../smart_dataframe/init.py Added validate method to SmartDataframe class, utilizing DFValidator.
tests/test_smartdataframe.py Added test cases for DataFrame validation with Pydantic schemas in SmartDataframe.

🐇💻

In the land of code where the shadows lie,
A rabbit hopped forth, his eyes on the sky.
With each line he wrote, each test he did pass,
He built a validator, as clear as glass.
Now data frames dance, their structure in check,
Thanks to our hero, CodeRabbit, tech deck! 🎉


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance. The bot stays silent unless called upon.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 6

Commits Files that changed from the base of the PR and between caf2fca and e4cf54d commits.
Files selected for processing (3)
  • pandasai/helpers/df_validator.py (1 hunks)
  • pandasai/smart_dataframe/init.py (2 hunks)
  • tests/test_smartdataframe.py (2 hunks)
Additional comments (Suppressed): 8
pandasai/smart_dataframe/__init__.py (1)
  • 24-30: The import of the pydantic module and DFValidator class is correctly done. The new imports are well organized and do not conflict with existing ones.
tests/test_smartdataframe.py (6)
  • 8-14: The import of BaseModel and Field from the pydantic module is new in this hunk. Ensure that the pydantic module is included in the project dependencies.

  • 663-679: This test case validates a DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.

  • 681-697: This test case validates a DataFrame with string values in column "A" using a Pydantic schema that expects integers. The validation should fail, which is the expected behavior.

  • 699-714: This test case validates a Polars DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.

  • 716-733: This test case validates a DataFrame with one string value in column "A" using a Pydantic schema that expects integers. The validation should fail for one record, which is the expected behavior.

  • 735-759: These test cases validate a DataFrame against a Pydantic schema with complex constraints (greater than and less than conditions). The first validation should fail as all values in column "A" are less than 5, while the second validation should pass as all values in column "A" are less than 5.

pandasai/helpers/df_validator.py (1)
  • 46-49: Creating a new class within a method can lead to issues with object identity and garbage collection. Consider defining this class at module level and parameterizing it as needed.

Comment on lines 238 to 254
self._sample_head = df_head.to_csv(index=False)
return self._sample_head

def validate(self, schema: pydantic.BaseModel, verbose: bool = False):
"""
Validates Dataframe rows on the basis Pydantic schema input
(Args):
schema: Pydantic schema class
n_jobs: Parallelism for larger dataframe
verbose: Print Errors
"""
df_validator = DFValidator(self.original_import, verbose)
return df_validator.validate(schema)

@property
def datalake(self):
return self._dl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validate method has been added to the SmartDataframe class. This method uses the DFValidator class to validate the DataFrame rows against a given Pydantic schema. The method signature looks correct, and the use of the DFValidator class seems appropriate. However, the docstring mentions an argument n_jobs which is not present in the function definition. Please update the docstring to match the function signature.

-            n_jobs: Parallelism for larger dataframe

Comment on lines 8 to 9
def __init__(self, passed: bool = True, errors: List[Dict] = []):
self._passed = passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default mutable arguments are a common source of unexpected behaviors and bugs in Python. Here, errors: List[Dict] = [] is a mutable default argument. If the errors list is modified, it will persist across all future function calls that do not provide an errors argument. This can lead to unexpected behavior. Consider using a sentinel value to denote "no argument provided" and create a new list inside the function.

-    def __init__(self, passed: bool = True, errors: List[Dict] = []):
+    def __init__(self, passed: bool = True, errors: List[Dict] = None):
+        if errors is None:
+            errors = []

Comment on lines 23 to 24
self.passed = False
self._errors.append(error_message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passed attribute is defined as a read-only property on lines 12-14, but you're trying to modify it here. This will raise an AttributeError. You should modify the _passed attribute directly.

-        self.passed = False
+        self._passed = False

elif dataframe_type == "polars":
return df.to_dicts()
else:
[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems to be a no-op. It creates an empty list and then does nothing with it. If you intended to return an empty list when the dataframe type is neither "pandas" nor "polars", you should use the return statement.

-            []
+            return []

Comment on lines 34 to 37
def __init__(self, df, verbose=False):
self._df = df
self._verbose = verbose

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making df and verbose keyword-only arguments for better readability and to prevent positional argument errors.

-    def __init__(self, df, verbose=False):
+    def __init__(self, *, df, verbose=False):

@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2023

Codecov Report

Merging #522 (c69547a) into main (caf2fca) will increase coverage by 0.04%.
The diff coverage is 88.46%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main     #522      +/-   ##
==========================================
+ Coverage   86.89%   86.93%   +0.04%     
==========================================
  Files          46       47       +1     
  Lines        1968     2020      +52     
==========================================
+ Hits         1710     1756      +46     
- Misses        258      264       +6     
Files Changed Coverage Δ
pandasai/helpers/df_validator.py 87.23% <87.23%> (ø)
pandasai/smart_dataframe/__init__.py 93.08% <100.00%> (+0.14%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Commits Files that changed from the base of the PR and between e4cf54d and c69547a commits.
Files selected for processing (1)
  • pandasai/helpers/df_validator.py (1 hunks)
Additional comments (Suppressed): 1
pandasai/helpers/df_validator.py (1)
  • 47-47: The use of List[schema] seems incorrect. The List generic expects a type as its parameter, but here it's being given an instance of BaseModel. This should be replaced with the actual class that represents the schema.



class DFValidationResult:
def __init__(self, passed: bool = True, errors: List[Dict] = []):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default mutable arguments in Python can lead to unexpected behavior. Here, errors: List[Dict] = [] is a mutable default argument. If this list is modified, the change will persist across function calls. Consider changing the default value to None and assigning an empty list within the function if the argument is None.

-    def __init__(self, passed: bool = True, errors: List[Dict] = []):
+    def __init__(self, passed: bool = True, errors: List[Dict] = None):
+        if errors is None:
+            errors = []

"""
Add an error message to the validation results.
"""
self.passed = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are trying to set the attribute passed which is not defined in the class. You probably meant to set _passed.

-        self.passed = False
+        self._passed = False

Comment on lines 69 to 85
def validate(self, schema: BaseModel) -> DFValidationResult:
"""
Args:
schema: Pydantic schema to be validated for the dataframe row
"""
dataframe_type = df_type(self._df)
if dataframe_type is None:
raise ValueError("UnSupported DataFrame")

df_json: List[Dict] = self._df_to_list_of_dict(self._df, dataframe_type)

errors = self._validate_batch(schema, df_json)

if len(errors) > 0:
return DFValidationResult(False, errors)
else:
return DFValidationResult(True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could be simplified by directly returning the DFValidationResult object without checking the length of errors. The DFValidationResult constructor already sets passed to True by default, so you only need to create a new DFValidationResult when there are errors.

-        if len(errors) > 0:
-            return DFValidationResult(False, errors)
-        else:
-            return DFValidationResult(True)
+        return DFValidationResult(len(errors) == 0, errors)

@ArslanSaleem ArslanSaleem changed the title Feat df schema Feat: Validate dataframe with Pydantic schema Sep 4, 2023
@gventuri
Copy link
Collaborator

gventuri commented Sep 4, 2023

@ArslanSaleem thanks a lot for the great feature, this is super cool!

@gventuri gventuri merged commit c986050 into sinaptik-ai:main Sep 4, 2023
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Commits Files that changed from the base of the PR and between c69547a and c6463fe commits.
Files selected for processing (3)
  • pandasai/helpers/df_validator.py (1 hunks)
  • pandasai/smart_dataframe/init.py (2 hunks)
  • tests/test_smartdataframe.py (4 hunks)
Additional comments (Suppressed): 9
pandasai/smart_dataframe/__init__.py (2)
  • 24-29: The import of pydantic and DfValidator from pandasai.helpers.df_validator is correctly done. The new imports are placed in the correct section of the code, following Python's best practices for organizing imports.

  • 241-249: The validate method has been added to the SmartDataframe class. This method uses the DfValidator class to validate the rows of the dataframe against a given Pydantic schema. The implementation looks correct based on the provided context. However, it would be good to verify that the DfValidator class and its validate method have been implemented correctly and handle all possible edge cases.

tests/test_smartdataframe.py (4)
  • 14-14: The BaseModel from Pydantic is imported but not used in this hunk. Ensure that it's used in the subsequent parts of the code.

  • 603-609: The changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.

  • 635-641: Similar to the previous comment, the changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.

  • 663-752: This new test case is added to validate the functionality of dataframe validation using Pydantic schemas. It covers various scenarios including successful validation, validation failures, and complex schema constraints. This is a good addition as it ensures the newly added feature works as expected.

pandasai/helpers/df_validator.py (3)
  • 1-52: The DfValidationResult class is well implemented. It encapsulates the result of a validation operation, including whether it passed and any errors that occurred. The use of properties for passed and errors ensures that these attributes are read-only, which is good for immutability. The add_error method allows adding an error message to the results, and the __bool__ method defines the truthiness of the object based on whether the validation passed or not.

  • 54-126: The DfValidator class is responsible for validating a DataFrame using a Pydantic schema. It uses private methods _validate_batch and _df_to_list_of_dict to perform the actual validation. However, there's a potential issue in the _validate_batch method.

  • 110-126: The validate method validates the DataFrame against the provided schema and returns a DfValidationResult. It first determines the type of the DataFrame, then converts it to a list of dictionaries, and finally validates each batch of rows against the schema. The implementation seems correct and efficient.

Comment on lines +71 to +90
def _validate_batch(self, schema, df_json: List[Dict]):
"""
Args:
schema: Pydantic schema
batch_df: dataframe batch

Returns:
list of errors
"""
try:
# Create a Pydantic Validator to validate rows of dataframe
class PdVal(BaseModel):
df: List[schema]

PdVal(df=df_json)
return []

except ValidationError as e:
return e.errors()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _validate_batch method, you're creating a new Pydantic model PdVal inside the try block every time this method is called. This could lead to performance issues if this method is called frequently, as creating a new class is an expensive operation. Consider defining this class outside of the method and reusing it.

-            class PdVal(BaseModel):
-                df: List[schema]
+            PdVal.df = List[schema]

Comment on lines +91 to +109
def _df_to_list_of_dict(self, df: DataFrameType, dataframe_type: str) -> List[Dict]:
"""
Create list of dict of dataframe rows on basis of dataframe type
Supports only polars and pandas dataframe

Args:
df: dataframe to be converted
dataframe_type: type of dataframe

Returns:
list of dict of dataframe rows
"""
if dataframe_type == "pandas":
return df.to_dict(orient="records")
elif dataframe_type == "polars":
return df.to_dicts()
else:
return []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _df_to_list_of_dict method converts a DataFrame into a list of dictionaries based on the type of DataFrame (either pandas or polars). If the DataFrame type is neither of these, it returns an empty list. This could potentially hide errors if an unsupported DataFrame type is used. Instead, consider raising an exception when an unsupported DataFrame type is encountered.

-            else:
-                return []
+            raise ValueError(f"Unsupported DataFrame type: {dataframe_type}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame Validation with Pydantic
4 participants