Skip to content

AI agent that reads through a dataset (CSV or Google Sheets) and performs a web search to retrieve specific information for each entity in a chosen column.

Notifications You must be signed in to change notification settings

Prasukj7-arch/Info-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Agent for Web Search and Information Extraction

Project Description

This project involves creating an AI agent that reads through a dataset (CSV or Google Sheets) and performs web searches to retrieve specific information for each entity in a chosen column. The agent leverages a large language model (LLM) to parse the web results and extract requested data, such as email addresses, company details, or other specified information. The project also includes a user-friendly dashboard where users can upload files, define search queries, and view or download the extracted results.

Key Features

  • Upload CSV files or connect to Google Sheets.
  • Specify search queries with dynamic placeholders for entity values.
  • Perform web searches and extract relevant information using LLMs.
  • View extracted information in a structured format.
  • Download the extracted results as a CSV.

Setup Instructions

Prerequisites

Before setting up the project, ensure you have the following installed:

  • Python 3.x
  • pip (Python package installer)
  • A Google Cloud Project with Sheets API enabled and a service account key for authentication (for Google Sheets integration).

Installing Dependencies

  1. Clone the repository:

    git clone https://github.com/yourusername/ai-agent-web-search.git
    cd ai-agent-web-search
  2. Create a virtual environment (optional but recommended):

    python3 -m venv venv
    source venv/bin/activate  # For macOS/Linux
    venv\Scripts\activate  # For Windows
  3. Install the required dependencies:

    pip install -r requirements.txt

Configuring Environment Variables and Service Account

Step 1: Create and Configure the .env File

  1. In the root directory of the project, create a .env file.

  2. Add the following variables to the .env file:

    SERPAPI_KEY=your_serpapi_key
    HUGGINGFACE_API_KEY=your_huggingface_api_key
    • Replace your_serpapi_key with your SerpAPI key for performing web searches.
    • Replace your_huggingface_api_key with your HuggingFace API key for natural language processing.

Step 2: Set Up the config/ Folder

  1. Create a config/ folder in the root directory of your project.

  2. Place your Google Service Account JSON file inside the config/ folder. You can create a Google service account by following the instructions here.

    GOOGLE_SERVICE_ACCOUNT_JSON=config/gcp_service_account.json

Usage Guide

Running the Application

Once you've installed the dependencies and set up the .env file and config/ folder, you can run the application using Streamlit.

  1. Start the Streamlit app:

    streamlit run app.py
  2. The application will launch in your web browser, displaying the dashboard where you can:

    • Upload CSV: Choose a CSV file with data.
    • Connect Google Sheets: Provide the link to your Google Sheet.
    • Select Primary Column: Choose the column from your dataset that contains the entities (e.g., company names).
    • Define a Query: Enter a custom query, such as "Get the email address of {company}", where {company} will be replaced with each entity's name from the dataset.
    • Extract Information: Click "Run Search" to start the search process and display extracted information.
    • Download Results: After the search completes, you can download the extracted results as a CSV file.

Google Sheets Integration

To connect Google Sheets:

  1. Ensure that your Google Sheet is shared with the link set to "Anyone with the link can view."
  2. Paste the link of your Google Sheet into the input field.
  3. The app will load data from the sheet, allowing you to select a column and query it for information.

API Keys and Environment Variables

For the application to function properly, you need to configure the following API keys:

  • SerpAPI Key: This key is used for performing web searches. You can get your key by signing up on SerpAPI.
  • HuggingFace API Key: This key allows you to use the HuggingFace API for natural language processing. Obtain it from HuggingFace.
  • Google Service Account Key: You will need a Google service account key to authenticate with the Google Sheets API. Follow the instructions here to create and download the service account JSON key.

Once you have the API keys, add them to your .env file as mentioned in the setup instructions.


YouTube Video

AI Agent for Web Search and Information Extraction - Video Tutorial

Watch the video below to see a demonstration of how the AI agent works, performing web searches and extracting specific information from datasets.

AI Agent for Web Search and Information Extraction


Video Highlights:

  • Introduction to the Project: Overview of the AI agent and its key features.
  • How the Agent Works: Walkthrough of how it processes CSV/Google Sheets data and performs web searches.
  • Custom Query Handling: See how users can define custom queries to extract specific information.
  • Results Extraction: Watch the process of collecting and viewing the extracted data.

YouTube Video Details

  • Title: AI Agent for Web Search and Information Extraction
  • Description: In this tutorial, we explain the functionalities of the AI agent, how it interacts with datasets, and extracts information using web searches and large language models.
  • Duration: 3:45 minutes
  • Published on: [Date]
  • Link: Watch the video here

Folder Structure

ai-agent-web-search/
├── app.py                # Main application file
├── requirements.txt      # List of dependencies
├── .env                  # Environment variables file
├── config/               # Folder for configuration files
│   └── gcp_service_account.json  # Google Service Account Key
└── README.md             # Project documentation