This project is a RAG (Retrieval-Augmented Generation) based smart search system designed to assist users in finding and querying information related to the free courses available on Analytics Vidhya. The system provides relevant course recommendations and answers user queries based on natural language inputs.
This system implements a sophisticated RAG-based smart search architecture to help users discover relevant free courses from Analytics Vidhya website through natural language queries. Here's how the workflow operates:
1. Data Collection and Preprocessing:
- The system begins by scraping course information and webpage content from Analytics Vidhya's website
- The scraped text content is cleaned and structured for further processing
2. Vector Processing:
- An embedding model converts the processed text content into high-dimensional vectors
- These vectors capture the semantic meaning of the course content
- The vectorized content is stored in a vector database (vectorstore) for efficient retrieval
3. Query Processing:
- When a user submits a request using natural language queries or keywords
- The same embedding model converts these queries into vectors for comparison
- The system performs vector-based similarity search to find relevant matches in the vectorstore
4. Content Retrieval and Generation:
- The vector similarity search identifies the most relevant course content
- This relevant context is passed to a Large Language Model (LLM)
- The LLM processes the retrieved context along with the user's query
- It generates personalized responses and course recommendations
5. Output:
- The system provides detailed responses about relevant free courses
- Includes specific course recommendations tailored to the user's interests
- Offers context-aware suggestions based on the query's intent
The key advantage of this RAG architecture is that it combines the power of semantic search through vector embeddings with the natural language understanding capabilities of LLMs, resulting in more accurate and contextually relevant course recommendations for users interested in Analytics Vidhya's free educational content.
- AI-Powered Search: Utilizes a combination of semantic search and generative AI to deliver precise results.
- LLAMA 3.3 70B by Groq: The system uses the powerful Llama 3.3 70B model provided by Groq for natural language understanding and response generation.
- Course Content Scraper: Automated web scraping to collect course content and metadata.
- RAG Implementation: Combines traditional information retrieval with a language model to improve search relevance.
- Vector Search with Pinecone: Efficient vector storage and similarity search using Pinecone.
- Streamlit Web App: Interactive, user-friendly interface built with Streamlit.
- Python: Core programming language
- BeautifulSoup4 & Requests: Web scraping tools
- Streamlit: Framework for building web applications
- Llama 3.3 70B by Groq: Large language model for advanced NLP tasks
- BGE-base-en-v1.5: Embedding model used to handle long texts for semantic search and embedding generation
- Pincone: Vector database for storing and managing embeddings
- LangChain: Library for integrating RAG components
- Data Collection:
- Stored all course links in a text file for batch processing.
- Scraped course content from Analytics Vidhya using
BeautifulSoup4
andrequests
.
- Data Cleaning:
- Removed redundant content (e.g., common descriptions across all courses).
- Split the content into smaller chunks suitable for the model's context window.
- Embedding & Storage:
- Generated vector embeddings for the cleaned content using Llama 3.3 70B.
- Stored vectors in Pinecone for efficient retrieval.
- RAG Implementation:
- Used a retrieval-augmented generation approach where relevant course sections are retrieved and passed to the model for generating answers.
- Streamlit Web App:
- Developed a visually appealing Streamlit app to allow users to search and get course recommendations.
📂 CourseLens
├── 📂images
│ └── img-1.png
│ └── img-2.png
│ └── img-3.png
│ └── img-4.png
│ └── workflow.png
├── 📂 data
│ └── av-free-course-data.txt
│ └── extracted_content.csv
├── 📂 scraping
│ └── collect_links.py
│ └── scrape.py
|──indexing.py
├── app.py
├── query.py
├── requirements.txt
├── README.md
└── .gitignore
- Clone the Repository:
git clone https://github.com/devroopsaha744/CourseLens cd CourseLens
- Install Dependencies:
pip install -r requirements.txt
- Run the Streamlit App:
streamlit run webapp/app.py
- Integration with additional LLMs.
- Fine-tuning the Llama 3.3 70B model for domain-specific tasks.
- Multi-language support.
Feel free to reach out if you have any questions or suggestions!
Created by datafreak aka Devroop Saha