Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.2 #534

Merged
merged 8 commits into from
Sep 7, 2023
Merged

Version 1.2 #534

merged 8 commits into from
Sep 7, 2023

Conversation

gventuri
Copy link
Collaborator

@gventuri gventuri commented Sep 7, 2023

Summary by CodeRabbit

Release Notes

  • New Feature: Introduced DfValidator for validating dataframes using Pydantic schemas.
  • New Feature: Added flexibility to the configuration handling by allowing either a Config object or a dictionary in various classes and methods.
  • New Feature: Added an optional sample_head parameter to the SmartDataframe class.
  • New Feature: Updated chart saving functionality to allow custom directories.
  • Deprecation: Marked Starcoder and Falcon classes as deprecated with warnings suggesting alternatives.
  • Test: Added new test cases for dataframe validation and chart saving functionality.
  • Documentation: Updated documentation with deprecation notices and minor changes.

gventuri and others added 8 commits September 4, 2023 17:48
* fix: directory for saving charts issue (#513)

* (fix): update template (remove hardcode of directory) in
 `GeneratePythonCodePrompt` class
* (fix): add passing "save_charts_path" from the config object to
  `get_prompt()`

* tests: custom path for charts (#513)

* (tests): add test method for a case passing custom path with
  `save_charts_path` parameter

* test: add test for the custom chart prompt

---------

Co-authored-by: Gabriele Venturi <[email protected]>
* feat[DFValidator]: Add return types

* fix pre-commit errors

* add return value

* refactor: minor changes in validation

---------

Co-authored-by: arslan-agory <[email protected]>
Co-authored-by: Gabriele Venturi <[email protected]>
* fix: update type hint for config parameter (#526)

* (fix): add `dict` as a possible type of object to pass in methods and
  functions where `config` parameter occurs
* (feat): add logging of an exception to `load_config()` instead of
  silently suppressing an error

* fix: use Union for config hints (#526)

* (fix): replace "bitwise or" (`|`) usage with `Union` in type hints for
  `config` object
* feat(llm): deprecate falcon and starcoder

* docs: mention Starcoder and Falcon being deprecated

---------

Co-authored-by: Gabriele Venturi <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 7, 2023

Walkthrough

This update introduces flexibility in configuration handling, deprecates certain classes, enhances dataframe validation with Pydantic schemas, and improves chart saving functionality. It also includes necessary test updates to validate these changes.

Changes

File(s) Summary
docs/API/llms.md, pandasai/llm/falcon.py, pandasai/llm/starcoder.py Deprecated Starcoder and Falcon classes with alternative LLMs suggested.
docs/LLMs/llms.md Updated documentation with image link addition and OpenAI models section removal.
pandasai/__init__.py, pandasai/config.py, pandasai/helpers/code_manager.py, pandasai/smart_dataframe/__init__.py, pandasai/smart_datalake/__init__.py Introduced flexibility in configuration handling by allowing Config instances or dictionaries.
pandasai/helpers/df_validator.py, tests/test_smartdataframe.py Added dataframe validation using Pydantic schemas and corresponding tests.
pandasai/prompts/generate_python_code.py, tests/prompts/test_generate_python_code_prompt.py Improved chart saving functionality with custom directory support and updated tests.

🐇💻

Code changes here, code changes there,

Making our software fair and square.

Deprecations, validations, and more,

That's what this update has in store.

So let's hop to it, without any delay,

Celebrating the improvements we made today! 🎉🥕


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • Mention @coderabbitai in any review comment for bot assistance.
  • Note: Review comments are made on code diffs or files, not on the PR overview.

Pause Incremental Reviews

  • Insert @coderabbitai: ignore in the PR description to halt the bot's ongoing reviews. Remove the line to resume.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Commits Files that changed from the base of the PR and between 3a92d33 and 9e6a71a commits.
Files ignored due to filter (2)
  • mkdocs.yml
  • pyproject.toml
Files selected for processing (13)
  • docs/API/llms.md (1 hunks)
  • docs/LLMs/llms.md (2 hunks)
  • pandasai/init.py (4 hunks)
  • pandasai/config.py (2 hunks)
  • pandasai/helpers/code_manager.py (2 hunks)
  • pandasai/helpers/df_validator.py (1 hunks)
  • pandasai/llm/falcon.py (2 hunks)
  • pandasai/llm/starcoder.py (2 hunks)
  • pandasai/prompts/generate_python_code.py (1 hunks)
  • pandasai/smart_dataframe/init.py (5 hunks)
  • pandasai/smart_datalake/init.py (5 hunks)
  • tests/prompts/test_generate_python_code_prompt.py (2 hunks)
  • tests/test_smartdataframe.py (6 hunks)
Files skipped from review due to trivial changes (5)
  • docs/API/llms.md
  • docs/LLMs/llms.md
  • pandasai/llm/falcon.py
  • pandasai/llm/starcoder.py
  • pandasai/prompts/generate_python_code.py
Additional comments (Suppressed): 23
pandasai/config.py (1)
  • 12-15: The type hint for override_config has been updated to accept either a Config object or a dictionary. This is a good change as it provides more flexibility in how the configuration can be provided.
pandasai/smart_dataframe/__init__.py (3)
  • 56-62: The function signature has been updated to accept either a Config object or a dictionary for configuration. Ensure that all calls to this function throughout the codebase have been updated to match the new signature.

  • 65-71: The docstring has been updated to reflect the changes in the function signature. The description of the config parameter now correctly indicates that it can be either a Config object or a dictionary.

  • 238-253: A new method validate has been added which uses the DfValidator class to validate the dataframe rows based on a Pydantic schema input. This is a good addition as it allows for robust data validation using Pydantic's powerful and flexible validation capabilities.

pandasai/__init__.py (5)
  • 103-117: The __init__ method of the PandasAI class has been updated to accept either a Config object or a dictionary for configuration. This change enhances flexibility and allows users to pass in configurations in a more convenient way. However, ensure that all calls to this function throughout the codebase have been updated to match the new signature.

  • 146-148: Deprecation warning added for the PandasAI class, suggesting users to use SmartDataframe instead. This is a good practice as it informs users about the upcoming changes and gives them time to adapt their code accordingly.

  • 150-150: The _config attribute of the PandasAI class is now initialized with a Config object. This is consistent with the change made in the __init__ method where it can now accept either a Config object or a dictionary for configuration.

  • 163-172: No significant changes detected in the run method of the PandasAI class. The method signature and functionality remain the same.

  • 200-209: No significant changes detected in the __call__ method of the PandasAI class. The method signature and functionality remain the same.

pandasai/helpers/df_validator.py (2)
  • 1-52: The DfValidationResult class is well implemented. It encapsulates the result of a validation operation, including whether it passed and any errors that occurred. The use of properties for passed and errors ensures that these attributes are read-only, which is good for immutability. The add_error method allows adding an error message to the results and automatically sets passed to False, which is a logical behavior. The __bool__ method is also correctly implemented to return the value of passed, allowing a DfValidationResult object to be used in a boolean context.

  • 54-126: The DfValidator class is responsible for validating a dataframe using a Pydantic schema. The _validate_batch method validates a batch of rows from the dataframe against the provided schema and returns any errors. The _df_to_list_of_dict method converts the dataframe to a list of dictionaries, which is the format required for validation. The validate method orchestrates the validation process by determining the type of the dataframe, converting it to a list of dictionaries, validating it, and returning the results. This class is well designed and implemented.

However, there's a potential issue with the _df_to_list_of_dict method. If the dataframe_type is neither "pandas" nor "polars", it returns an empty list. This could lead to silent failures if an unsupported dataframe type is used. It would be better to raise an exception in this case.

tests/prompts/test_generate_python_code_prompt.py (1)
  • 26-27: The new variable save_charts_path is being set but not used in the assertion. Please ensure that the test case checks for the correct usage of this variable.
pandasai/helpers/code_manager.py (2)
  • 21-27: The type hint for _config has been updated to accept either a Config object or a dictionary. This change enhances the flexibility of configuration, but it also introduces potential issues. If a dictionary is passed instead of a Config object, there might be missing keys that the Config object would have guaranteed. Ensure that all usages of _config in this class handle both Config objects and dictionaries correctly.

  • 30-35: The __init__ method now accepts either a Config object or a dictionary for the config parameter. As mentioned above, ensure that all usages of config in this class handle both Config objects and dictionaries correctly.

pandasai/smart_datalake/__init__.py (5)
  • 44-50: The type of _config has been changed from Config to Union[Config, dict]. This change allows for more flexibility in the configuration of the SmartDatalake class. However, it's important to ensure that all parts of the code that interact with _config can handle both Config objects and dictionaries.

  • 60-72: The __init__ method now accepts either a Config object or a dictionary for the config parameter. This is a good change as it enhances the flexibility of the SmartDatalake class. However, make sure that the rest of the codebase is updated to reflect this change where necessary.

  • 138-154: The _load_config method has been updated to accept either a Config object or a dictionary. The load_config function is then used to load the configuration. This is a good change as it enhances the flexibility of the SmartDatalake class. However, make sure that the load_config function can handle both Config objects and dictionaries.

  • 164-170: The check for whether an LLM is a PandasAI LLM has been simplified by checking if the LLM has the _llm_type attribute. This is a cleaner way of performing the check and should improve readability.

  • 262-268: The default_values dictionary now includes the save_charts_path key. This is a good addition as it allows the path for saving charts to be customized. However, make sure that the rest of the codebase is updated to reflect this change where necessary.

tests/test_smartdataframe.py (4)
  • 14-14: The import of BaseModel and Field from pydantic is new. Ensure that the pydantic package is included in your project dependencies.

  • 125-127: The test case test_init_without_llm has been updated to pass a configuration dictionary with llm set to None. This change aligns with the updates made to the SmartDataframe class to accept a configuration dictionary. The test case should fail as expected if no LLM is provided.

  • 244-309: This is a new test case for saving charts in a non-default directory. It checks whether the chart is saved in the correct directory and whether the path to the saved chart is correctly returned. The test case seems well-written and covers all necessary assertions.

  • 728-824: These are new test cases for validating dataframes using Pydantic schemas. They cover various scenarios, including validation passing, validation failing due to incorrect types, validation with Polars dataframes, validation failing for one record, and validation with complex Pydantic schemas. These tests are important for ensuring the correct functioning of the new dataframe validation feature.

Comment on lines 32 to 40
if config.get("callback") and not override_config.get("callback"):
config["callback"] = getattr(callbacks, config["callback"])()
except Exception:
pass
logger.error("Could not load configuration", exc_info=True)

if override_config:
config.update(override_config)

config = Config(**config)

return config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling has been improved by logging the exception details instead of silently passing it. However, the code no longer converts the final config dictionary into a Config object before returning it. This could potentially lead to issues if the rest of the codebase expects a Config object rather than a dictionary. Please verify that this change is compatible with the rest of the codebase.

-    return config
+    config = Config(**config)
+    return config

Comment on lines +92 to +108
"""
Create list of dict of dataframe rows on basis of dataframe type
Supports only polars and pandas dataframe

Args:
df: dataframe to be converted
dataframe_type: type of dataframe

Returns:
list of dict of dataframe rows
"""
if dataframe_type == "pandas":
return df.to_dict(orient="records")
elif dataframe_type == "polars":
return df.to_dicts()
else:
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-        else:
-            return []
+        else:
+            raise ValueError(f"Unsupported DataFrame type: {dataframe_type}")

Comment on lines +66 to 112
def test_str_with_custom_save_charts_path(self):
"""Test that the __str__ method is implemented"""

llm = FakeLLM("plt.show()")
dfs = [
SmartDataframe(
pd.DataFrame({"a": [1], "b": [4]}),
config={"llm": llm},
)
]

prompt = GeneratePythonCodePrompt()
prompt.set_var("dfs", dfs)
prompt.set_var("conversation", "Question")
prompt.set_var("save_charts_path", "custom_path")

assert (
prompt.to_string()
== """
You are provided with the following pandas DataFrames with the following metadata:

Dataframe dfs[0], with 1 rows and 2 columns.
This is the metadata of the dataframe dfs[0]:
a,b
1,4


This is the initial python code to be updated:
```python
# TODO import all the dependencies required
import pandas as pd

# Analyze the data
# 1. Prepare: Preprocessing and cleaning data if necessary
# 2. Process: Manipulating data for analysis (grouping, filtering, aggregating, etc.)
# 3. Analyze: Conducting the actual analysis (if the user asks to create a chart save it to an image in custom_path/temp_chart.png and do not show the chart.)
# 4. Output: return a dictionary of:
# - type (possible values "text", "number", "dataframe", "plot")
# - value (can be a string, a dataframe or the path of the plot, NOT a dictionary)
# Example output: { "type": "text", "value": "The average loan amount is $15,000." }
def analyze_data(dfs: list[pd.DataFrame]) -> dict:
# Code goes here (do not add comments)


# Declare a result variable
result = analyze_data(dfs)
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new test case test_str_with_custom_save_charts_path has been added to verify the functionality of setting a custom path for saving charts. The test case seems well-structured and covers the necessary assertions. However, it would be beneficial to add an assertion to check if the chart is actually saved at the specified location.

+        assert os.path.exists("custom_path/temp_chart.png")

@mspronesti
Copy link
Contributor

@gventuri Are we planning to also discontinue the HuggingfaceLLM class? After the suggestions we got in this thread #525, we could consider to keep that class, parametrizing the model name (to append it to the api) in order to allow users to use starchat-beta, codellama-7b, codellama-13b, llama-2-7b and any future fine-tune but without "promising" any performances, as we would with a dedicated class.
Does it make sense? :)

@gventuri gventuri merged commit 9955000 into feature/v1.2 Sep 7, 2023
10 checks passed
@gventuri
Copy link
Collaborator Author

gventuri commented Sep 7, 2023

@mspronesti I suppose we should basically change the implementation so that we go for the "official" way (not limited to 10 tokens with the "hacky" thing of running multiple calls to get the whole code generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants