Skip to content

Latest commit

 

History

History
105 lines (67 loc) · 7.92 KB

README.md

File metadata and controls

105 lines (67 loc) · 7.92 KB

NLP on GFE testing

AI contracts

This repository contains the Jupyter Notebooks for the ACF Data Surge Team’s project, titled NLP on GFE Testing.

Project Objectives

This project aims to identify AI use cases across federal agencies, particularly those providing public benefits (e.g., HHS, USDA, DOE, HUD, SSA, SBA, VA), and assess their applicability to the Administration for Children and Families (ACF). The project will explore the use of natural language processing (NLP) to analyze federal contract datasets, identifying AI-related procurements and summarizing key findings to enhance decision-making. Additionally, it will investigate the limitations of Government Furnished Equipment (GFE) laptops for NLP tasks, including the identification of suitable local LLM chatbot options that can be easily installed and operated within HHS-approved software constraints.

Project Overview

This project comprises a series of Jupyter notebooks dedicated to various machine learning and NLP tasks on contract datasets. The notebooks serve different purposes, including filtering the dataset for the last five years to identify AI-related keywords and visualizing word frequency through word clouds. Text pre-processing and summarization are conducted using the BART model, while topic modeling is achieved with NLTK and Gensim, followed by the LDA model to display and assign dominant topics. Frequently used AI techniques are extracted, contracts are clustered, and summarized using LLMs like BART, phi3, and llama3. URLs related to attachments are extracted using the Selenium library, and the attachments are handled and downloaded using a headless Firefox WebDriver. Additionally, text extraction from PDFs and Word files is performed for keyword identification and summarization. The project also includes a notebook that applies machine learning models to a COVID dataset and NLP on the attachments of contracts dataset, simulating various scenarios, noting computational loads on GFE.

Below are the links to the notebooks used in this project:

Data Pre-processing

AI usecases and Topic Tagging

Note: large language model (llama3) has outperformed BART model in both text summarization and topic identification for a given text and the prompt. Both models are heavily depend on the accuracy of the clustering technique used to group similar contracts.

GFE limits

Setting up a local chatbot using ollama and anything-llm:

To ensure you don't encounter network issues, ensure that VPN is disconnected prior to following the steps below. It also recommended that GitHub Desktop, and docker desktop applications to be installed on the GFE.

Setting up OLLAMA API on GFE

To integrate LLMs with Python on a GFE (OS: Windows), it needs to set up the Ollama API and install the Phi3 and Llama3 models. Follow these steps:

Download and Install Ollama API:

  • Visit Ollama's website.
  • Click on the "Download" button to get the Windows installer.
  • Run the installer and follow the on-screen instructions to complete the installation.

Install Models:

  • Open a command prompt or PowerShell.
  • Use the following commands to install the models:
    ollama pull phi3
    ollama pull llama3.1
    

It can take several minutes to download these models.

These steps will set up the Ollama API and the models on the GFE.

Setting up Anything-LLM on GFE

The "anything-llm" repository on GitHub is an all-in-one AI application that offers full AI Chatbot capabilities and allows for interacting with documents. This application can be run locally or hosted remotely and provides a customizable chat UI, developer API, and integration with popular vector databases.

In order to setup the local chatbot using this repository, firstly the GitHub repository needs to be cloned to the GFE using the link below:

https://github.com/ypadilla-arch/anything-llm.git

To run this docker:

Make sure you have disconnected from any VPN to avoid connection issues for the next step.

In a command prompt navigate to docker folder in the cloned reporitory.

Once here, execute the following command:

docker compose -f "docker-compose.yml" up -d

It will take a several minutes to download, build and execute the container for the application.

Once everything is loaded you can navigate on your browser to: http://localhost:3001

The Anything-LLM application is already pre-configured in this repo for ease of use. The next steps are to:

  • Create a workspace
  • Upload some documents there (short documents have worked better in our testing)
  • Add the documents to the workspace

After this step you can chat with your documents as context in the chat interface. Depending on how much it is provided to the model, it can take several minutes for the model to provide a response. You can follow additional guidance through the main page.

Execute the following command to stop/remove the docker containers:

docker compose down