Academic Paper RAG System

A Retrieval-Augmented Generation (RAG) system for academic papers that enables semantic search and question answering using OpenAI embeddings and language models.

Overview

This system allows you to search for academic papers on Google Scholar, download and process them, and then query them using natural language. The system uses OpenAI embeddings for semantic search and OpenAI's GPT models for answering questions based on the retrieved papers.

Features

Search for academic papers on Google Scholar using SerpAPI
Download PDFs and convert them to Markdown using Docling
Generate embeddings for papers using OpenAI's text-embedding-3-small model
Store embeddings in a FAISS vector database for efficient retrieval
Answer queries using RAG with OpenAI's GPT-4o-mini model

Requirements

API Keys

The following API keys are required and should be set in a .env file:

OPENAI_API_KEY: Required for embedding generation and RAG
SERPAPI_API_KEY: Required for Google Scholar search

Environment Variables

Additional environment variables that can be set in the .env file:

PDF_STORAGE_PATH: Directory to store downloaded PDFs (default: "./pdfs")
MARKDOWN_STORAGE_PATH: Directory to store converted Markdown files (default: "./papers")
CHROMA_PERSIST_DIRECTORY: Directory to store vector database (default: "./vector_db")

Python Dependencies

pip install requests python-dotenv serpapi docling llama-index openai faiss-cpu

Usage

Command Line Interface

The system provides a command-line interface through main.py:

Ingesting Papers

python main.py ingest "quantum computing algorithms" --num-results 5

This command will:

Search for "quantum computing algorithms" on Google Scholar
Download the top 5 papers
Convert them to Markdown
Generate embeddings using OpenAI
Store the embeddings in the vector database

Querying Papers

python main.py query "What are the latest advancements in quantum computing algorithms?"

This command will:

Convert your query to an embedding using OpenAI
Find the most relevant papers in the vector database
Use RAG with GPT-4o-mini to generate an answer based on the retrieved papers

Architecture

The system consists of the following components:

scholar_search.py: Google Scholar search via SerpAPI
text_extraction.py: PDF downloading and conversion to Markdown
embedding.py: OpenAI embedding generation
vector_store.py: FAISS vector database integration
rag_engine.py: LlamaIndex + GPT-4o-mini for answering queries
main.py: Main entry point with CLI

Notes

The system uses an in-memory FAISS vector store to avoid compatibility issues with ChromaDB Rust bindings on Windows.
For persistent storage across runs, the system saves the FAISS index to disk.
OpenAI embeddings provide high-quality semantic search capabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
test_model/data		test_model/data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
embedding.py		embedding.py
main.py		main.py
rag_engine.py		rag_engine.py
requirements.txt		requirements.txt
scholar_search.py		scholar_search.py
text_extraction.py		text_extraction.py
vector_store.py		vector_store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Academic Paper RAG System

Overview

Features

Requirements

API Keys

Environment Variables

Python Dependencies

Usage

Command Line Interface

Ingesting Papers

Querying Papers

Architecture

Notes

About

Uh oh!

Releases

Packages

Languages

balakrish181/semantic_search

Folders and files

Latest commit

History

Repository files navigation

Academic Paper RAG System

Overview

Features

Requirements

API Keys

Environment Variables

Python Dependencies

Usage

Command Line Interface

Ingesting Papers

Querying Papers

Architecture

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages