Skip to content

JB RAG Frameworks

Trishal Kumar edited this page Jul 29, 2024 · 2 revisions

Ragas, an open-sourced evaluation framework to test the reliability of answers generated from any RAG pipeline was used to test Jugalbandi. This process involved the following steps:

Synthetic dataset creation: Ragas used OpenAI LLMs to generate a varied set of questions from a predefined dataset. These questions could be one of three:

  • simple,
  • questions whose answers require reasoning, and
  • questions whose answers require information from different contexts

The dataset stores the corresponding context from which the questions were generated, and refers to it as the ‘ground truth’.

Jugalbandi answer generation: The questions generated in the synthetic dataset are passed to Jugalbandi to generate answers. Jugalbandi retrieves relevant chunks (or embeddings) of information from the knowledge base using its similarity search algorithms. These chunks are then passed to a generation model (such as GPT-3.5 or GPT-4o) to produce the answers.

Comparison of Jugalbandi’s answers with the ground truth: In the evaluation of Jugalbandi’s RAG pipeline against Ragas’ ground truth, four key metrics were used to assess performance:

  • Answer Relevance: How relevant the generated answer is to the question.
  • Faithfulness: How accurately the answer reflects the information in the context (ground truth).
  • Context Recall: The ability to retrieve relevant chunks from the knowledge base.
  • Context Precision: The precision of the retrieved chunks in providing the correct answer.

Results: The metrics are analysed to determine the effectiveness of the RAG pipeline. For example, in one experiment which used a legal information knowledge base, with 998 questions, Jugalbandi achieved 85.45% context precision and 92.74% context recall. These results indicate a high level of accuracy in retrieving and generating answers based on the provided knowledge base.

Experiment limitations: While the evaluation framework is a useful tool to assess the performance of Jugalbandi’s chunking and retrieval strategies, some limitations that couldn’t be tested are:

  1. Question Variety: The synthetic dataset mostly includes straightforward questions, whereas user questions can vary significantly in complexity and phrasing. This limitation can affect the evaluation's comprehensiveness. Citizen centric applications will tend to get more open-ended questions, where determining the intent of the user and relating it to the available knowledge base is not always a straightforward task.

  2. Model Dependency: The quality of answers depends on the underlying LLM (e.g., GPT-3.5 or GPT-4). Smaller models like Phi-3 may not perform as well for complex questions, although they are more lightweight and can be deployed on devices with limited resources. Testing with the same LLMs being used both for Jugalbandi and Ragas means that the context may be very similar for the generated and evaluation datasets. In the experiments it was found that answers generated using GPT-4 tended to yield better results in their evaluation, as compared to when GPT-3.5 was used. While this may be attributed to the advances made in the LLM, it’s unclear how it affects the assessment using automated frameworks.

  3. Limited to legal information: The experiments conducted with the evaluation framework were limited to legal datasets. With advances being made in context specific RAG pipelines, the performance metrics generated when evaluating answers from a legal dataset may not apply to other contexts.

  4. Need for manual checks by SMEs: The answers generated by Jugalbandi were compared with the ‘ground truth’ as determined by the frameworks, which in turn is generated using Open AI’s embedding models. It is preferable for the comparison to be made with the ground truth as determined by a subject matter expert for any given context.

  5. Inaccurate Relevance Scores: Evaluation frameworks sometimes overestimate how closely an answer matches the full context. For example, if an answer is based on only a small part of the provided information, Ragas might still give it a high score, which can be misleading.

While this evaluation framework provides valuable insights, it is one component of what must be a much more comprehensive testing process. This highlights the need for extensive testing of the RAG pipeline by experienced professionals or individuals familiar with the knowledge base.

Clone this wiki locally