|
1 |
| -### **Academic Papers** |
| 1 | +# Research Documentation |
2 | 2 |
|
3 |
| -1. **J. X. Lee and Y.-T. Song**, "College Exam Grader using LLM AI Models," *2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)*, Beijing, China, 2024, pp. 282-289. DOI: 10.1109/SNPD61259.2024.10673924 |
4 |
| -2. **Ryan Mok, Faraaz Akhtar, Louis Clare, Christine Li, Jun Ida, Lewis Ross, and Mario Campanelli**, “Using AI Large Language Models for Grading in Education: A Hands-On Test for Physics,” *Department of Physics and Astronomy, University College London*, 2024\. |
5 |
| -3. **Jonas Flodén**, “Grading Exams Using Large Language Models: A Comparison Between Human and AI Grading of Exams in Higher Education Using ChatGPT,” *Department of Business Administration, School of Business, Economics and Law, University of Gothenburg, Gothenburg, Sweden*, 2024\. DOI: 10.1002/berj.4069 |
| 3 | +These research document contains the collective research of our team over the past few weeks. Our original research was compiled on Notion. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +1. [Cost Analysis](#1-cost-analysis) |
| 7 | +2. [Context Queries](#2-context-queries) |
| 8 | +3. [Model Temperature for Consistency](#3-model-temperature-for-consistency) |
| 9 | +4. [Retrieval-Augmented Generation (RAG)](#4-retrieval-augmented-generation-rag) |
| 10 | +5. [Model Comparisons](#5-model-comparisons) |
| 11 | +6. [College Exam Grader using LLM AI Models](#6-college-exam-grader-using-llm-ai-models) |
| 12 | +7. [Other Relevant Papers](#7-other-relevant-papers) |
| 13 | +8. [Other Relevant Projects (and Frameworks)](#8-other-relevant-projects-and-frameworks) |
| 14 | + |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## 1. Cost Analysis |
| 19 | + |
| 20 | +**Objective:** |
| 21 | +To determine the cost comparison between self-hosting an LLM via a cloud provider for inference purposes vs. using an API. |
| 22 | + |
| 23 | +**Model:** |
| 24 | +DeepSeek-R1-Distill-Llama-70B (43GB VRAM requirement) |
| 25 | +- **GPUs:** NVIDIA RTX A6000 or A100 |
| 26 | + |
| 27 | +**Token Output Estimates:** |
| 28 | +- A100: ~40-45 tokens/sec (estimated) |
| 29 | +- A6000: ~20-25 tokens/sec (estimated) |
| 30 | +- Actual tested performance on A6000: 12-13 tokens/sec |
| 31 | + |
| 32 | +**Cost Analysis:** |
| 33 | +- Spot VM pricing (GCP/Azure/Hyperstack): |
| 34 | + - A100: ~$1.29/hr |
| 35 | + - A6000: ~$0.50/hr |
| 36 | +- 1M tokens on A6000: ~$641 (at ~13 tokens/sec) |
| 37 | + |
| 38 | +**Conclusion:** |
| 39 | +Using an API is significantly more cost-effective than self-hosting due to optimization expertise and economies of scale of API providers. |
| 40 | + |
| 41 | +--- |
| 42 | + |
| 43 | +## 2. Context Queries |
| 44 | + |
| 45 | +**Description:** |
| 46 | +Exploring how queries relying on previous conversation context can be processed cost-effectively across platforms. |
| 47 | + |
| 48 | +**Findings:** |
| 49 | +- LLMs are stateless; they don't retain past conversation history between requests. |
| 50 | +- Previous conversation history must be resent with each new query for context continuity. |
| 51 | +- Optimizations include: |
| 52 | + - Summarizing past contexts. |
| 53 | + - Utilizing **Key-Value (KV) Caches** for efficient memory use. |
| 54 | + |
| 55 | +**Resources:** |
| 56 | +- [Efficient Scaling with KV Caches (Research Paper)](https://arxiv.org/pdf/2211.05102) |
| 57 | +- [DeepSeek API Pricing](https://api-docs.deepseek.com/quick_start/pricing) |
| 58 | +- [LangChain Memory Blog](https://medium.com/@vinayakdeshpande111/how-to-make-llm-remember-conversation-with-langchain-924083079d95) |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## 3. Model Temperature for Consistency |
| 63 | + |
| 64 | +**Objective:** |
| 65 | +To test if setting temperature to zero removes grading inconsistencies. |
| 66 | + |
| 67 | +**Findings:** |
| 68 | +- Lowering temperature improves consistency but doesn't fully eliminate randomness. |
| 69 | +- GPU rounding errors can introduce variability even with a temperature of zero. |
| 70 | +- Consistency improved by self-hosting LLMs (running on the same GPU setup). |
| 71 | + |
| 72 | +**Discussion Threads:** |
| 73 | +- [OpenAI Randomness Discussion 1](https://community.openai.com/t/why-does-openai-api-behave-randomly-with-temperature-0-and-top-p-0/934104) |
| 74 | +- [OpenAI Randomness Discussion 2](https://community.openai.com/t/clarifications-on-setting-temperature-0/886447) |
6 | 75 |
|
7 | 76 | ---
|
8 | 77 |
|
9 |
| -### **Open-Source Projects** |
| 78 | +## 4. Retrieval-Augmented Generation (RAG) |
| 79 | + |
| 80 | +**Description:** |
| 81 | +RAG combines LLMs with external knowledge retrieval for up-to-date and accurate responses. |
| 82 | + |
| 83 | +**Advantages:** |
| 84 | +- No retraining required for knowledge updates. |
| 85 | +- Reduces hallucination risks by using curated data. |
| 86 | +- Enhances transparency by referencing sources. |
| 87 | + |
| 88 | +**Workflow:** |
| 89 | +1. User query sent to a knowledge base. |
| 90 | +2. Relevant information retrieved. |
| 91 | +3. LLM generates output using the retrieved data. |
| 92 | + |
| 93 | +**Our Current (very) W.I.P. RAG Pipeline:** |
| 94 | + |
| 95 | + |
| 96 | +**Resources:** |
| 97 | +- [AWS - RAG Explanation](https://aws.amazon.com/what-is/retrieval-augmented-generation/) |
| 98 | +- [Azure - RAG Overview](https://azure.microsoft.com/en-us/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag) |
| 99 | +- [RAG Research Paper](http://proceedings.mlr.press/v119/guu20a/guu20a.pdf) |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## 5. Model Comparisons |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | +Comprehensive site for crowdsourced human based ML model comparisons: |
| 109 | +* https://lmarena.ai/ |
| 110 | + |
| 111 | +## 6. College Exam Grader using LLM AI Models |
| 112 | + |
| 113 | +**Citation:** |
| 114 | +J. X. Lee and Y.-T. Song, "College Exam Grader using LLM AI models," 2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Beijing, China, 2024, pp. 282-289, [DOI](https://doi.org/10.1109/SNPD61259.2024.10673924) |
| 115 | + |
| 116 | +**Notes:** |
| 117 | +Our project will extend this research by exploring additional models not tested in their experiments. While they addressed variance in grading, especially for borderline cases, our work will focus on handling more complex and nuanced questions with advanced prompt and rubric engineering. |
| 118 | + |
| 119 | +**Key Insights:** |
| 120 | +- **Models Used:** GPT-3.5, GPT-4.0, Gemini-Pro |
| 121 | + - GPT-4.0 had the best consistency and lowest error rates across multiple grading experiments. |
| 122 | +- **Question Types:** Primarily focused on short-answer questions with objective grading metrics. |
| 123 | +- **Rubric Design:** Utilized conditional logic and set theory for accurate point allocation. |
| 124 | +- **Prompt Engineering:** Crafted prompts embedding the rubric and sample responses for one-shot learning with low randomness (temperature 0.2). |
| 125 | + |
| 126 | +**Experiments Summary:** |
| 127 | +1. **Mock-up Grading:** Tested variance across repeated evaluations. |
| 128 | +2. **Human vs. AI:** Compared grading accuracy and consistency, with GPT-4 outperforming humans in certain aspects. |
| 129 | +3. **Rephrased Responses:** Maintained scoring consistency across different phrasings of the same answer. |
| 130 | +4. **Actual Exam Grading:** Achieved 91.5% grading alignment with human TAs. |
| 131 | + |
| 132 | +**Discussion:** |
| 133 | +- **Strengths:** GPT-4.0 demonstrated high reliability for objective grading tasks. |
| 134 | +- **Limitations:** Struggles with subjective questions and handling typos; also raises cost concerns for large-scale deployment. |
| 135 | + |
| 136 | +**CoGrader:** |
| 137 | +- A commercial AI grading platform primarily used in K-12 settings. |
| 138 | +- Supports custom rubrics and has a 30-day free trial for experimentation. |
| 139 | +- [CoGrader Website](https://cograder.com/) |
| 140 | + |
| 141 | + |
| 142 | +## 7. Other Relevant Papers |
| 143 | + |
| 144 | +- **Chain of Thought Prompting:** [Read Here](https://arxiv.org/abs/2201.11903) |
| 145 | + - Chain of Thought Prompting is an important technique for self-consistency checks. And this is the foundational paper on that field. |
| 146 | +- **Self-Consistency Improves Chain of Thought Reasoning:** [PDF](https://arxiv.org/pdf/2203.11171) |
| 147 | + - Helps enhance grading accuracy by rerunning reasoning paths and selecting the most consistent answer. |
| 148 | + - Similar to ensemble trees in that you re-run the same model several times and average results to improve the reliability of the output. |
| 149 | +- **Other Grading Research Papers:** |
| 150 | + - Mok et al., 2024 - Physics grading using LLMs - [DOI](https://arxiv.org/html/2411.13685v1) |
| 151 | + - Provides a hands-on evaluation of AI grading in STEM subjects, applicable for testing the model's effectiveness in objective grading tasks. |
| 152 | + - Flodén, 2024 - AI vs. Human Grading - [DOI](https://doi.org/10.1002/berj.4069) |
| 153 | + - Highlights the strengths and weaknesses of AI-based grading compared to human assessment, which can guide benchmarking efforts for the project. |
| 154 | + - Whether current AI consistency matches human consistency in grading is for example an open question still. |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## 8. Other Relevant Projects (and Frameworks) |
| 159 | + |
| 160 | +### **Haystack for Knowledge Retrieval:** |
| 161 | +- Production ready RAG pipeline |
| 162 | +- [Website](https://haystack.deepset.ai/) |
| 163 | + |
| 164 | +### AI-Handwrite-Grader by wongcyrus |
| 165 | +- Grades handwritten submissions using AI |
| 166 | +- Accessible via GitHub CodeSpaces |
| 167 | +- [GitHub Repository](https://github.com/wongcyrus/AI-Handwrite-Grader) |
10 | 168 |
|
11 |
| -1. **GradeAI** by *GradeAI* |
12 |
| - An AI-powered auto-grader built in Python, utilizing OpenAI GPT-3.5, Optical Character Recognition (OCR), and Flask to automate the grading process and provide a streamlined approach for assessing student submissions. |
13 |
| - [GitHub Repository](https://github.com/GradeAI/gradeai.git) |
14 |
| -2. **AI-Handwrite-Grader** by *wongcyrus* |
15 |
| - A project designed to grade handwritten student submissions using AI. It is built to be easily accessible through platforms like GitHub CodeSpaces, providing a practical tool for educators to automate the grading of handwritten exams. |
16 |
| - [GitHub Repository](https://github.com/wongcyrus/AI-Handwrite-Grader.git) |
| 169 | +### GradeAI by GradeAI |
| 170 | +- Auto-grader using GPT-3.5, OCR, and Flask |
| 171 | +- Streamlines the grading process |
| 172 | +- [GitHub Repository](https://github.com/GradeAI/GradeAI) |
0 commit comments