Data Agent is a project designed to utilize large models to generate SQL queries for fetching data, subsequently creating corresponding reports and charts. The project architecture includes:
- auto_plot_mat.py: Script for generating charts, which can deal with most of the situations of the dataframe with time_column, numeric_column and category_column.
- main.py: Main function script.
- data/: Directory containing two example CSV files for testing.
- 
Upload Data: Begin by uploading one or more CSV files. These files will be used for generating SQL queries and subsequent analysis. 
- 
Explore Data Schema: : Upon uploading data, the system generates a description of the schema, allowing users to understand the structure of their data for better insights and query formulation. 
- 
Ask Questions: Describe the data you want to query in the provided textbox. 
- 
View Reports and Visualizations: The system will generate SQL queries based on your description, fetch the data, debug the sql and regenerate until there is no errors, create visualizations, and present a detailed report. 
- 
Data Schema Description:: Upon data upload, the system provides a detailed description of the schema, including data types and descriptions of columns. This enables users to understand their data better and formulate more precise queries. 
- 
Intelligent SQL Querying: Utilizes a large language model to generate complex SQL queries based on natural language descriptions. 
- 
Error Correction: In case of SQL errors, the llm can automatically debug the sql and regenerate the correct SQL until there is no errors, allowing for a smoother workflow. 
- 
Data Visualization: The system automatically generates visualizations such as charts based on the fetched data. 
- 
Workflow: This project is designed with a workflow in mind, ensuring a structured approach to data analysis: - 
Data Upload: CSV files are uploaded for processing. 
- 
Data Schema Generation: The system generates a schema description to help users understand the data structure. 
- 
Query Description: Users describe the data they want to analyze in natural language. 
- 
SQL Query Generation: The system generates SQL queries based on the description. 
- 
SQL Execution: Queries are executed on the uploaded data to fetch relevant information. 
- 
Report and Visualization: The system generates detailed reports and visualizations from the fetched data. 
- 
Error Handling: If there are SQL errors, the system provides suggestions for correction. 
- 
Final Report: Users receive a comprehensive report, including SQL queries used, retrieved data, and visualizations. 
 
- 
Here are a few examples of the types of questions you can ask:
- "What is the average income of individuals with a bachelor's degree?"
- "What is the average income by geographic region?"
- "How many individuals are single?"
- 
Ensure you have the necessary dependencies installed. 
- 
Run main.pywith your OpenAI API key as an argument. Example:python main.py --openai_key YOUR_OPENAI_API_KEY
- 
Follow the instructions in the Gradio interface to upload data, ask questions, and view reports. 
- openai: OpenAI Python SDK for interfacing with GPT-3.5 Turbo.
- pandas: For data manipulation and analysis.
- pandasql: For running SQL queries on DataFrame objects.
- gradio: For creating the user interface.
- SQL Error Correction: If there are errors in the generated SQL queries, the system guides the user by suggesting corrected queries, ensuring a smoother workflow.
For any inquiries or feedback, please contact Dalin Wang at [email protected].
This project is licensed under the MIT License. See the LICENSE file for details.
