Retrieval-Augmented Generation (RAG) System for Scientific Articles

Overview

This repository implements a Retrieval-Augmented Generation (RAG) system designed to extract information from scientific articles in PDF format, generate embeddings, perform document retrieval, and produce answers or summaries using GPT-2.

The system focuses on extracting and processing text from PDFs, embedding the text into vectors, retrieving relevant document chunks with FAISS, and using a GPT-2 model for answer generation.

Project Structure

data/ - This folder holds the PDF files (e.g., a1.pdf, a2.pdf) that will be processed.
output/ - Stores the embeddings and any generated answers or summaries.
src/ - Contains Jupyter notebooks for each core step of the RAG pipeline:
- rag_for_biomass.ipynb: Extracts text from PDF files using PyMuPDF. Generates text embeddings using Sentence-Transformer. Implements FAISS-based retrieval of relevant document chunks. Generates answers using GPT-2 based on retrieved chunks.
requirements.txt - Lists all required Python libraries. Install them using pip.

Prerequisites and Requirements

Before running the code, make sure you have the following:

Python: Version 3.6 to 3.10 (compatible with faiss-gpu).
PyMuPDF: For efficient PDF text extraction.
Sentence-Transformers: For generating sentence embeddings using the all-MiniLM-L6-v2 model.
FAISS: For efficient similarity search and retrieval.
Transformers: For natural language processing tasks using GPT2LMHeadModel and GPT2Tokenizer for answer generation.

Install dependencies

You can install the required libraries using: pip install -r requirements.txt

Name	Name	Last commit message	Last commit date
Latest commit erenarkangil Update RAG_for_bioeconomy_articles.ipynb Nov 6, 2024 7a478a6 · Nov 6, 2024 History 41 Commits
data	data	Delete data/deleteme.html	Nov 6, 2024
RAG_for_bioeconomy_articles.ipynb	RAG_for_bioeconomy_articles.ipynb	Update RAG_for_bioeconomy_articles.ipynb	Nov 6, 2024
README.md	README.md	Update README.md	Nov 6, 2024
requirements.txt	requirements.txt	Update requirements.txt	Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Retrieval-Augmented Generation (RAG) System for Scientific Articles

Overview

Project Structure

Prerequisites and Requirements

Install dependencies

About

Releases

Packages

Languages

erenarkangil/RAG_for_circular_economy_articles

Folders and files

Latest commit

History

Repository files navigation

Retrieval-Augmented Generation (RAG) System for Scientific Articles

Overview

Project Structure

Prerequisites and Requirements

Install dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages