Skip to content

Commit 3e47e15

Browse files
authored
Merge pull request #9 from BU-Spark/aseef_qa
copied our detailed research from notion and had GenAI organize it fo…
2 parents ff0a42b + fd7c159 commit 3e47e15

File tree

4 files changed

+167
-11
lines changed

4 files changed

+167
-11
lines changed

assets/llm_performance_graph.png

81.5 KB
Loading

assets/llm_price_ratio.png

69 KB
Loading

assets/wip_rag_pipeline.png

137 KB
Loading

research.md

+167-11
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,172 @@
1-
### **Academic Papers**
1+
# Research Documentation
22

3-
1. **J. X. Lee and Y.-T. Song**, "College Exam Grader using LLM AI Models," *2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)*, Beijing, China, 2024, pp. 282-289. DOI: 10.1109/SNPD61259.2024.10673924
4-
2. **Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, and Mario Campanelli**, “Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics,” *Department of Physics and Astronomy, University College London*, 2024\.
5-
3. **Jonas Flodén**, “Grading Exams Using Large Language Models: A Comparison Between Human and AI Grading of Exams in Higher Education Using ChatGPT,” *Department of Business Administration, School of Business, Economics and Law, University of Gothenburg, Gothenburg, Sweden*, 2024\. DOI: 10.1002/berj.4069
3+
These research document contains the collective research of our team over the past few weeks. Our original research was compiled on Notion.
4+
5+
## Table of Contents
6+
1. [Cost Analysis](#1-cost-analysis)
7+
2. [Context Queries](#2-context-queries)
8+
3. [Model Temperature for Consistency](#3-model-temperature-for-consistency)
9+
4. [Retrieval-Augmented Generation (RAG)](#4-retrieval-augmented-generation-rag)
10+
5. [Model Comparisons](#5-model-comparisons)
11+
6. [College Exam Grader using LLM AI Models](#6-college-exam-grader-using-llm-ai-models)
12+
7. [Other Relevant Papers](#7-other-relevant-papers)
13+
8. [Other Relevant Projects (and Frameworks)](#8-other-relevant-projects-and-frameworks)
14+
15+
16+
---
17+
18+
## 1. Cost Analysis
19+
20+
**Objective:**
21+
To determine the cost comparison between self-hosting an LLM via a cloud provider for inference purposes vs. using an API.
22+
23+
**Model:**
24+
DeepSeek-R1-Distill-Llama-70B (43GB VRAM requirement)
25+
- **GPUs:** NVIDIA RTX A6000 or A100
26+
27+
**Token Output Estimates:**
28+
- A100: ~40-45 tokens/sec (estimated)
29+
- A6000: ~20-25 tokens/sec (estimated)
30+
- Actual tested performance on A6000: 12-13 tokens/sec
31+
32+
**Cost Analysis:**
33+
- Spot VM pricing (GCP/Azure/Hyperstack):
34+
- A100: ~$1.29/hr
35+
- A6000: ~$0.50/hr
36+
- 1M tokens on A6000: ~$641 (at ~13 tokens/sec)
37+
38+
**Conclusion:**
39+
Using an API is significantly more cost-effective than self-hosting due to optimization expertise and economies of scale of API providers.
40+
41+
---
42+
43+
## 2. Context Queries
44+
45+
**Description:**
46+
Exploring how queries relying on previous conversation context can be processed cost-effectively across platforms.
47+
48+
**Findings:**
49+
- LLMs are stateless; they don't retain past conversation history between requests.
50+
- Previous conversation history must be resent with each new query for context continuity.
51+
- Optimizations include:
52+
- Summarizing past contexts.
53+
- Utilizing **Key-Value (KV) Caches** for efficient memory use.
54+
55+
**Resources:**
56+
- [Efficient Scaling with KV Caches (Research Paper)](https://arxiv.org/pdf/2211.05102)
57+
- [DeepSeek API Pricing](https://api-docs.deepseek.com/quick_start/pricing)
58+
- [LangChain Memory Blog](https://medium.com/@vinayakdeshpande111/how-to-make-llm-remember-conversation-with-langchain-924083079d95)
59+
60+
---
61+
62+
## 3. Model Temperature for Consistency
63+
64+
**Objective:**
65+
To test if setting temperature to zero removes grading inconsistencies.
66+
67+
**Findings:**
68+
- Lowering temperature improves consistency but doesn't fully eliminate randomness.
69+
- GPU rounding errors can introduce variability even with a temperature of zero.
70+
- Consistency improved by self-hosting LLMs (running on the same GPU setup).
71+
72+
**Discussion Threads:**
73+
- [OpenAI Randomness Discussion 1](https://community.openai.com/t/why-does-openai-api-behave-randomly-with-temperature-0-and-top-p-0/934104)
74+
- [OpenAI Randomness Discussion 2](https://community.openai.com/t/clarifications-on-setting-temperature-0/886447)
675

776
---
877

9-
### **Open-Source Projects**
78+
## 4. Retrieval-Augmented Generation (RAG)
79+
80+
**Description:**
81+
RAG combines LLMs with external knowledge retrieval for up-to-date and accurate responses.
82+
83+
**Advantages:**
84+
- No retraining required for knowledge updates.
85+
- Reduces hallucination risks by using curated data.
86+
- Enhances transparency by referencing sources.
87+
88+
**Workflow:**
89+
1. User query sent to a knowledge base.
90+
2. Relevant information retrieved.
91+
3. LLM generates output using the retrieved data.
92+
93+
**Our Current (very) W.I.P. RAG Pipeline:**
94+
![RAG Pipeline](./assets/wip_rag_pipeline.png)
95+
96+
**Resources:**
97+
- [AWS - RAG Explanation](https://aws.amazon.com/what-is/retrieval-augmented-generation/)
98+
- [Azure - RAG Overview](https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag)
99+
- [RAG Research Paper](http://proceedings.mlr.press/v119/guu20a/guu20a.pdf)
100+
101+
---
102+
103+
## 5. Model Comparisons
104+
105+
![LLM Performance](./assets/llm_performance_graph.png)
106+
![LLM Price Ratio](./assets/llm_price_ratio.png)
107+
108+
Comprehensive site for crowdsourced human based ML model comparisons:
109+
* https://lmarena.ai/
110+
111+
## 6. College Exam Grader using LLM AI Models
112+
113+
**Citation:**
114+
J. X. Lee and Y.-T. Song, "College Exam Grader using LLM AI models," 2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Beijing, China, 2024, pp. 282-289, [DOI](https://doi.org/10.1109/SNPD61259.2024.10673924)
115+
116+
**Notes:**
117+
Our project will extend this research by exploring additional models not tested in their experiments. While they addressed variance in grading, especially for borderline cases, our work will focus on handling more complex and nuanced questions with advanced prompt and rubric engineering.
118+
119+
**Key Insights:**
120+
- **Models Used:** GPT-3.5, GPT-4.0, Gemini-Pro
121+
- GPT-4.0 had the best consistency and lowest error rates across multiple grading experiments.
122+
- **Question Types:** Primarily focused on short-answer questions with objective grading metrics.
123+
- **Rubric Design:** Utilized conditional logic and set theory for accurate point allocation.
124+
- **Prompt Engineering:** Crafted prompts embedding the rubric and sample responses for one-shot learning with low randomness (temperature 0.2).
125+
126+
**Experiments Summary:**
127+
1. **Mock-up Grading:** Tested variance across repeated evaluations.
128+
2. **Human vs. AI:** Compared grading accuracy and consistency, with GPT-4 outperforming humans in certain aspects.
129+
3. **Rephrased Responses:** Maintained scoring consistency across different phrasings of the same answer.
130+
4. **Actual Exam Grading:** Achieved 91.5% grading alignment with human TAs.
131+
132+
**Discussion:**
133+
- **Strengths:** GPT-4.0 demonstrated high reliability for objective grading tasks.
134+
- **Limitations:** Struggles with subjective questions and handling typos; also raises cost concerns for large-scale deployment.
135+
136+
**CoGrader:**
137+
- A commercial AI grading platform primarily used in K-12 settings.
138+
- Supports custom rubrics and has a 30-day free trial for experimentation.
139+
- [CoGrader Website](https://cograder.com/)
140+
141+
142+
## 7. Other Relevant Papers
143+
144+
- **Chain of Thought Prompting:** [Read Here](https://arxiv.org/abs/2201.11903)
145+
- Chain of Thought Prompting is an important technique for self-consistency checks. And this is the foundational paper on that field.
146+
- **Self-Consistency Improves Chain of Thought Reasoning:** [PDF](https://arxiv.org/pdf/2203.11171)
147+
- Helps enhance grading accuracy by rerunning reasoning paths and selecting the most consistent answer.
148+
- Similar to ensemble trees in that you re-run the same model several times and average results to improve the reliability of the output.
149+
- **Other Grading Research Papers:**
150+
- Mok et al., 2024 - Physics grading using LLMs - [DOI](https://arxiv.org/html/2411.13685v1)
151+
- Provides a hands-on evaluation of AI grading in STEM subjects, applicable for testing the model's effectiveness in objective grading tasks.
152+
- Flodén, 2024 - AI vs. Human Grading - [DOI](https://doi.org/10.1002/berj.4069)
153+
- Highlights the strengths and weaknesses of AI-based grading compared to human assessment, which can guide benchmarking efforts for the project.
154+
- Whether current AI consistency matches human consistency in grading is for example an open question still.
155+
156+
---
157+
158+
## 8. Other Relevant Projects (and Frameworks)
159+
160+
### **Haystack for Knowledge Retrieval:**
161+
- Production ready RAG pipeline
162+
- [Website](https://haystack.deepset.ai/)
163+
164+
### AI-Handwrite-Grader by wongcyrus
165+
- Grades handwritten submissions using AI
166+
- Accessible via GitHub CodeSpaces
167+
- [GitHub Repository](https://github.com/wongcyrus/AI-Handwrite-Grader)
10168

11-
1. **GradeAI** by *GradeAI*
12-
An AI-powered auto-grader built in Python, utilizing OpenAI GPT-3.5, Optical Character Recognition (OCR), and Flask to automate the grading process and provide a streamlined approach for assessing student submissions.
13-
[GitHub Repository](https://github.com/GradeAI/gradeai.git)
14-
2. **AI-Handwrite-Grader** by *wongcyrus*
15-
A project designed to grade handwritten student submissions using AI. It is built to be easily accessible through platforms like GitHub CodeSpaces, providing a practical tool for educators to automate the grading of handwritten exams.
16-
[GitHub Repository](https://github.com/wongcyrus/AI-Handwrite-Grader.git)
169+
### GradeAI by GradeAI
170+
- Auto-grader using GPT-3.5, OCR, and Flask
171+
- Streamlines the grading process
172+
- [GitHub Repository](https://github.com/GradeAI/GradeAI)

0 commit comments

Comments
 (0)