Skip to content

Commit

Permalink
#16 OOP Upgrade Feature Engineering Agent Example
Browse files Browse the repository at this point in the history
  • Loading branch information
mdancho84 committed Jan 10, 2025
1 parent 9b9aba3 commit 5e97589
Show file tree
Hide file tree
Showing 2 changed files with 54,493 additions and 587 deletions.
126 changes: 0 additions & 126 deletions examples/data_cleaning_agent.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1187,132 +1187,6 @@
"data_cleaning_agent.get_recommended_cleaning_steps(markdown=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Explore the agent documentation for more information"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[0;31mType:\u001b[0m DataCleaningAgent\n",
"\u001b[0;31mString form:\u001b[0m <ai_data_science_team.agents.data_cleaning_agent.DataCleaningAgent object at 0x7f86d9036290>\n",
"\u001b[0;31mFile:\u001b[0m ~/Desktop/course_code/ai-data-science-team/ai_data_science_team/agents/data_cleaning_agent.py\n",
"\u001b[0;31mDocstring:\u001b[0m \n",
"Creates a data cleaning agent that can process datasets based on user-defined instructions or default cleaning steps. \n",
"The agent generates a Python function to clean the dataset, performs the cleaning, and logs the process, including code \n",
"and errors. It is designed to facilitate reproducible and customizable data cleaning workflows.\n",
"\n",
"The agent performs the following default cleaning steps unless instructed otherwise:\n",
"\n",
"- Removing columns with more than 40% missing values.\n",
"- Imputing missing values with the mean for numeric columns.\n",
"- Imputing missing values with the mode for categorical columns.\n",
"- Converting columns to appropriate data types.\n",
"- Removing duplicate rows.\n",
"- Removing rows with missing values.\n",
"- Removing rows with extreme outliers (values 3x the interquartile range).\n",
"\n",
"User instructions can modify, add, or remove any of these steps to tailor the cleaning process.\n",
"\n",
"Parameters\n",
"----------\n",
"model : langchain.llms.base.LLM\n",
" The language model used to generate the data cleaning function.\n",
"n_samples : int, optional\n",
" Number of samples used when summarizing the dataset. Defaults to 30. Reducing this number can help \n",
" avoid exceeding the model's token limits.\n",
"log : bool, optional\n",
" Whether to log the generated code and errors. Defaults to False.\n",
"log_path : str, optional\n",
" Directory path for storing log files. Defaults to None.\n",
"file_name : str, optional\n",
" Name of the file for saving the generated response. Defaults to \"data_cleaner.py\".\n",
"overwrite : bool, optional\n",
" Whether to overwrite the log file if it exists. If False, a unique file name is created. Defaults to True.\n",
"human_in_the_loop : bool, optional\n",
" Enables user review of data cleaning instructions. Defaults to False.\n",
"bypass_recommended_steps : bool, optional\n",
" If True, skips the default recommended cleaning steps. Defaults to False.\n",
"bypass_explain_code : bool, optional\n",
" If True, skips the step that provides code explanations. Defaults to False.\n",
"\n",
"Methods\n",
"-------\n",
"update_params(**kwargs)\n",
" Updates the agent's parameters and rebuilds the compiled state graph.\n",
"ainvoke(user_instructions: str, data_raw: pd.DataFrame, max_retries=3, retry_count=0)\n",
" Cleans the provided dataset asynchronously based on user instructions.\n",
"invoke(user_instructions: str, data_raw: pd.DataFrame, max_retries=3, retry_count=0)\n",
" Cleans the provided dataset synchronously based on user instructions.\n",
"explain_cleaning_steps()\n",
" Returns an explanation of the cleaning steps performed by the agent.\n",
"get_log_summary()\n",
" Retrieves a summary of logged operations if logging is enabled.\n",
"get_state_keys()\n",
" Returns a list of keys from the state graph response.\n",
"get_state_properties()\n",
" Returns detailed properties of the state graph response.\n",
"get_data_cleaned()\n",
" Retrieves the cleaned dataset as a pandas DataFrame.\n",
"get_data_raw()\n",
" Retrieves the raw dataset as a pandas DataFrame.\n",
"get_data_cleaner_function()\n",
" Retrieves the generated Python function used for cleaning the data.\n",
"get_recommended_cleaning_steps()\n",
" Retrieves the agent's recommended cleaning steps.\n",
"\n",
"Examples\n",
"--------\n",
"```python\n",
"import pandas as pd\n",
"from langchain_openai import ChatOpenAI\n",
"from ai_data_science_team.agents import DataCleaningAgent\n",
"\n",
"llm = ChatOpenAI(model=\"gpt-4o-mini\")\n",
"\n",
"data_cleaning_agent = DataCleaningAgent(\n",
" model=llm, n_samples=50, log=True, log_path=\"logs\", human_in_the_loop=True\n",
")\n",
"\n",
"df = pd.read_csv(\"https://raw.githubusercontent.com/business-science/ai-data-science-team/refs/heads/master/data/churn_data.csv\")\n",
"\n",
"data_cleaning_agent.invoke(\n",
" user_instructions=\"Don't remove outliers when cleaning the data.\",\n",
" data_raw=df,\n",
" max_retries=3,\n",
" retry_count=0\n",
")\n",
"\n",
"cleaned_data = data_cleaning_agent.get_data_cleaned()\n",
"\n",
"response = data_cleaning_agent.response\n",
"```\n",
"\n",
"Returns\n",
"--------\n",
"DataCleaningAgent : langchain.graphs.CompiledStateGraph \n",
" A data cleaning agent implemented as a compiled state graph. \n",
"\u001b[0;31mInit docstring:\u001b[0m\n",
"Initialize the agent with provided parameters.\n",
"\n",
"Parameters:\n",
" **params: Arbitrary keyword arguments representing the agent's parameters."
]
}
],
"source": [
"?data_cleaning_agent"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down
Loading

0 comments on commit 5e97589

Please sign in to comment.