AI Agent for Web Search and Information Extraction

Project Description

This project involves creating an AI agent that reads through a dataset (CSV or Google Sheets) and performs web searches to retrieve specific information for each entity in a chosen column. The agent leverages a large language model (LLM) to parse the web results and extract requested data, such as email addresses, company details, or other specified information. The project also includes a user-friendly dashboard where users can upload files, define search queries, and view or download the extracted results.

Key Features

Upload CSV files or connect to Google Sheets.
Specify search queries with dynamic placeholders for entity values.
Perform web searches and extract relevant information using LLMs.
View extracted information in a structured format.
Download the extracted results as a CSV.

Setup Instructions

Prerequisites

Before setting up the project, ensure you have the following installed:

Python 3.x
pip (Python package installer)
A Google Cloud Project with Sheets API enabled and a service account key for authentication (for Google Sheets integration).

Installing Dependencies

Clone the repository:

git clone https://github.com/yourusername/ai-agent-web-search.git
cd ai-agent-web-search

Create a virtual environment (optional but recommended):

python3 -m venv venv
source venv/bin/activate  # For macOS/Linux
venv\Scripts\activate  # For Windows

Install the required dependencies:
```
pip install -r requirements.txt
```

Configuring Environment Variables and Service Account

Step 1: Create and Configure the `.env` File

In the root directory of the project, create a .env file.
Add the following variables to the .env file:
```
SERPAPI_KEY=your_serpapi_key
HUGGINGFACE_API_KEY=your_huggingface_api_key
```
- Replace your_serpapi_key with your SerpAPI key for performing web searches.
- Replace your_huggingface_api_key with your HuggingFace API key for natural language processing.

Step 2: Set Up the `config/` Folder

Create a config/ folder in the root directory of your project.
Place your Google Service Account JSON file inside the config/ folder. You can create a Google service account by following the instructions here.
```
GOOGLE_SERVICE_ACCOUNT_JSON=config/gcp_service_account.json
```

Usage Guide

Running the Application

Once you've installed the dependencies and set up the .env file and config/ folder, you can run the application using Streamlit.

Start the Streamlit app:
```
streamlit run app.py
```
The application will launch in your web browser, displaying the dashboard where you can:
- Upload CSV: Choose a CSV file with data.
- Connect Google Sheets: Provide the link to your Google Sheet.
- Select Primary Column: Choose the column from your dataset that contains the entities (e.g., company names).
- Define a Query: Enter a custom query, such as "Get the email address of {company}", where {company} will be replaced with each entity's name from the dataset.
- Extract Information: Click "Run Search" to start the search process and display extracted information.
- Download Results: After the search completes, you can download the extracted results as a CSV file.

Google Sheets Integration

To connect Google Sheets:

Ensure that your Google Sheet is shared with the link set to "Anyone with the link can view."
Paste the link of your Google Sheet into the input field.
The app will load data from the sheet, allowing you to select a column and query it for information.

API Keys and Environment Variables

For the application to function properly, you need to configure the following API keys:

SerpAPI Key: This key is used for performing web searches. You can get your key by signing up on SerpAPI.
HuggingFace API Key: This key allows you to use the HuggingFace API for natural language processing. Obtain it from HuggingFace.
Google Service Account Key: You will need a Google service account key to authenticate with the Google Sheets API. Follow the instructions here to create and download the service account JSON key.

Once you have the API keys, add them to your .env file as mentioned in the setup instructions.

YouTube Video