-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Validate dataframe with Pydantic schema #522
Conversation
WalkthroughThe changes introduce a new module, Changes
TipsChat with CodeRabbit Bot (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 6
Files selected for processing (3)
- pandasai/helpers/df_validator.py (1 hunks)
- pandasai/smart_dataframe/init.py (2 hunks)
- tests/test_smartdataframe.py (2 hunks)
Additional comments (Suppressed): 8
pandasai/smart_dataframe/__init__.py (1)
- 24-30: The import of the
pydantic
module andDFValidator
class is correctly done. The new imports are well organized and do not conflict with existing ones.tests/test_smartdataframe.py (6)
8-14: The import of
BaseModel
andField
from thepydantic
module is new in this hunk. Ensure that thepydantic
module is included in the project dependencies.663-679: This test case validates a DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.
681-697: This test case validates a DataFrame with string values in column "A" using a Pydantic schema that expects integers. The validation should fail, which is the expected behavior.
699-714: This test case validates a Polars DataFrame with integer values using a Pydantic schema. The validation should pass as the data types match the schema.
716-733: This test case validates a DataFrame with one string value in column "A" using a Pydantic schema that expects integers. The validation should fail for one record, which is the expected behavior.
735-759: These test cases validate a DataFrame against a Pydantic schema with complex constraints (greater than and less than conditions). The first validation should fail as all values in column "A" are less than 5, while the second validation should pass as all values in column "A" are less than 5.
pandasai/helpers/df_validator.py (1)
- 46-49: Creating a new class within a method can lead to issues with object identity and garbage collection. Consider defining this class at module level and parameterizing it as needed.
pandasai/smart_dataframe/__init__.py
Outdated
self._sample_head = df_head.to_csv(index=False) | ||
return self._sample_head | ||
|
||
def validate(self, schema: pydantic.BaseModel, verbose: bool = False): | ||
""" | ||
Validates Dataframe rows on the basis Pydantic schema input | ||
(Args): | ||
schema: Pydantic schema class | ||
n_jobs: Parallelism for larger dataframe | ||
verbose: Print Errors | ||
""" | ||
df_validator = DFValidator(self.original_import, verbose) | ||
return df_validator.validate(schema) | ||
|
||
@property | ||
def datalake(self): | ||
return self._dl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validate
method has been added to the SmartDataframe
class. This method uses the DFValidator
class to validate the DataFrame rows against a given Pydantic schema. The method signature looks correct, and the use of the DFValidator
class seems appropriate. However, the docstring mentions an argument n_jobs
which is not present in the function definition. Please update the docstring to match the function signature.
- n_jobs: Parallelism for larger dataframe
pandasai/helpers/df_validator.py
Outdated
def __init__(self, passed: bool = True, errors: List[Dict] = []): | ||
self._passed = passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default mutable arguments are a common source of unexpected behaviors and bugs in Python. Here, errors: List[Dict] = []
is a mutable default argument. If the errors
list is modified, it will persist across all future function calls that do not provide an errors
argument. This can lead to unexpected behavior. Consider using a sentinel value to denote "no argument provided" and create a new list inside the function.
- def __init__(self, passed: bool = True, errors: List[Dict] = []):
+ def __init__(self, passed: bool = True, errors: List[Dict] = None):
+ if errors is None:
+ errors = []
pandasai/helpers/df_validator.py
Outdated
self.passed = False | ||
self._errors.append(error_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passed
attribute is defined as a read-only property on lines 12-14, but you're trying to modify it here. This will raise an AttributeError. You should modify the _passed
attribute directly.
- self.passed = False
+ self._passed = False
pandasai/helpers/df_validator.py
Outdated
elif dataframe_type == "polars": | ||
return df.to_dicts() | ||
else: | ||
[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line seems to be a no-op. It creates an empty list and then does nothing with it. If you intended to return an empty list when the dataframe type is neither "pandas" nor "polars", you should use the return
statement.
- []
+ return []
pandasai/helpers/df_validator.py
Outdated
def __init__(self, df, verbose=False): | ||
self._df = df | ||
self._verbose = verbose | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making df
and verbose
keyword-only arguments for better readability and to prevent positional argument errors.
- def __init__(self, df, verbose=False):
+ def __init__(self, *, df, verbose=False):
Codecov Report
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more. @@ Coverage Diff @@
## main #522 +/- ##
==========================================
+ Coverage 86.89% 86.93% +0.04%
==========================================
Files 46 47 +1
Lines 1968 2020 +52
==========================================
+ Hits 1710 1756 +46
- Misses 258 264 +6
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 3
Files selected for processing (1)
- pandasai/helpers/df_validator.py (1 hunks)
Additional comments (Suppressed): 1
pandasai/helpers/df_validator.py (1)
- 47-47: The use of
List[schema]
seems incorrect. TheList
generic expects a type as its parameter, but here it's being given an instance ofBaseModel
. This should be replaced with the actual class that represents the schema.
pandasai/helpers/df_validator.py
Outdated
|
||
|
||
class DFValidationResult: | ||
def __init__(self, passed: bool = True, errors: List[Dict] = []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default mutable arguments in Python can lead to unexpected behavior. Here, errors: List[Dict] = []
is a mutable default argument. If this list is modified, the change will persist across function calls. Consider changing the default value to None
and assigning an empty list within the function if the argument is None
.
- def __init__(self, passed: bool = True, errors: List[Dict] = []):
+ def __init__(self, passed: bool = True, errors: List[Dict] = None):
+ if errors is None:
+ errors = []
pandasai/helpers/df_validator.py
Outdated
""" | ||
Add an error message to the validation results. | ||
""" | ||
self.passed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are trying to set the attribute passed
which is not defined in the class. You probably meant to set _passed
.
- self.passed = False
+ self._passed = False
pandasai/helpers/df_validator.py
Outdated
def validate(self, schema: BaseModel) -> DFValidationResult: | ||
""" | ||
Args: | ||
schema: Pydantic schema to be validated for the dataframe row | ||
""" | ||
dataframe_type = df_type(self._df) | ||
if dataframe_type is None: | ||
raise ValueError("UnSupported DataFrame") | ||
|
||
df_json: List[Dict] = self._df_to_list_of_dict(self._df, dataframe_type) | ||
|
||
errors = self._validate_batch(schema, df_json) | ||
|
||
if len(errors) > 0: | ||
return DFValidationResult(False, errors) | ||
else: | ||
return DFValidationResult(True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could be simplified by directly returning the DFValidationResult
object without checking the length of errors
. The DFValidationResult
constructor already sets passed
to True
by default, so you only need to create a new DFValidationResult
when there are errors.
- if len(errors) > 0:
- return DFValidationResult(False, errors)
- else:
- return DFValidationResult(True)
+ return DFValidationResult(len(errors) == 0, errors)
@ArslanSaleem thanks a lot for the great feature, this is super cool! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 2
Files selected for processing (3)
- pandasai/helpers/df_validator.py (1 hunks)
- pandasai/smart_dataframe/init.py (2 hunks)
- tests/test_smartdataframe.py (4 hunks)
Additional comments (Suppressed): 9
pandasai/smart_dataframe/__init__.py (2)
24-29: The import of
pydantic
andDfValidator
frompandasai.helpers.df_validator
is correctly done. The new imports are placed in the correct section of the code, following Python's best practices for organizing imports.241-249: The
validate
method has been added to theSmartDataframe
class. This method uses theDfValidator
class to validate the rows of the dataframe against a given Pydantic schema. The implementation looks correct based on the provided context. However, it would be good to verify that theDfValidator
class and itsvalidate
method have been implemented correctly and handle all possible edge cases.tests/test_smartdataframe.py (4)
14-14: The
BaseModel
from Pydantic is imported but not used in this hunk. Ensure that it's used in the subsequent parts of the code.603-609: The changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.
635-641: Similar to the previous comment, the changes made here are just renaming and do not affect the functionality of the code. The new name aligns with the class being tested.
663-752: This new test case is added to validate the functionality of dataframe validation using Pydantic schemas. It covers various scenarios including successful validation, validation failures, and complex schema constraints. This is a good addition as it ensures the newly added feature works as expected.
pandasai/helpers/df_validator.py (3)
1-52: The
DfValidationResult
class is well implemented. It encapsulates the result of a validation operation, including whether it passed and any errors that occurred. The use of properties forpassed
anderrors
ensures that these attributes are read-only, which is good for immutability. Theadd_error
method allows adding an error message to the results, and the__bool__
method defines the truthiness of the object based on whether the validation passed or not.54-126: The
DfValidator
class is responsible for validating a DataFrame using a Pydantic schema. It uses private methods_validate_batch
and_df_to_list_of_dict
to perform the actual validation. However, there's a potential issue in the_validate_batch
method.110-126: The
validate
method validates the DataFrame against the provided schema and returns aDfValidationResult
. It first determines the type of the DataFrame, then converts it to a list of dictionaries, and finally validates each batch of rows against the schema. The implementation seems correct and efficient.
def _validate_batch(self, schema, df_json: List[Dict]): | ||
""" | ||
Args: | ||
schema: Pydantic schema | ||
batch_df: dataframe batch | ||
|
||
Returns: | ||
list of errors | ||
""" | ||
try: | ||
# Create a Pydantic Validator to validate rows of dataframe | ||
class PdVal(BaseModel): | ||
df: List[schema] | ||
|
||
PdVal(df=df_json) | ||
return [] | ||
|
||
except ValidationError as e: | ||
return e.errors() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the _validate_batch
method, you're creating a new Pydantic model PdVal
inside the try block every time this method is called. This could lead to performance issues if this method is called frequently, as creating a new class is an expensive operation. Consider defining this class outside of the method and reusing it.
- class PdVal(BaseModel):
- df: List[schema]
+ PdVal.df = List[schema]
def _df_to_list_of_dict(self, df: DataFrameType, dataframe_type: str) -> List[Dict]: | ||
""" | ||
Create list of dict of dataframe rows on basis of dataframe type | ||
Supports only polars and pandas dataframe | ||
|
||
Args: | ||
df: dataframe to be converted | ||
dataframe_type: type of dataframe | ||
|
||
Returns: | ||
list of dict of dataframe rows | ||
""" | ||
if dataframe_type == "pandas": | ||
return df.to_dict(orient="records") | ||
elif dataframe_type == "polars": | ||
return df.to_dicts() | ||
else: | ||
return [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _df_to_list_of_dict
method converts a DataFrame into a list of dictionaries based on the type of DataFrame (either pandas or polars). If the DataFrame type is neither of these, it returns an empty list. This could potentially hide errors if an unsupported DataFrame type is used. Instead, consider raising an exception when an unsupported DataFrame type is encountered.
- else:
- return []
+ raise ValueError(f"Unsupported DataFrame type: {dataframe_type}")
Summary by CodeRabbit
DFValidator
that uses Pydantic schemas to validate data in pandas DataFrames. This feature enhances data integrity and consistency.validate
method to theSmartDataframe
class, allowing users to validate their data directly from the DataFrame instance.SmartDataframe
class with new test cases focusing on DataFrame validation using Pydantic schemas. This ensures robustness and reliability of the new features.