Confab + evals

rmusser01 · Nov 18, 2024 · 2af9491 · 2af9491
1 parent 2b733b3
commit 2af9491
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 10 deletions.
diff --git a/Docs/Citations_and_Confabulations.md b/Docs/Citations_and_Confabulations.md
@@ -5,10 +5,6 @@
 2. [Confabulations](#confabulations)
 3. [References](#references)
 
-
-
-
-
 RAG
   https://www.lycee.ai/blog/rag-ragallucinations-and-how-to-fight-them
 
@@ -18,26 +14,42 @@ Attributions
 Benchmarks
   https://github.com/lechmazur/confabulations/
   https://huggingface.co/spaces/vectara/Hallucination-evaluation-leaderboard
+  https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard
+  https://osu-nlp-group.github.io/AttributionBench/
+
 
 Research
-  https://github.com/EdinburghNLP/awesome-hallucination-detection  
+  https://github.com/EdinburghNLP/awesome-hallucination-detection
   https://arxiv.org/abs/2407.13481
   https://arxiv.org/abs/2408.06195
   https://arxiv.org/abs/2407.19813
   https://arxiv.org/abs/2407.16557
+  https://arxiv.org/abs/2407.16604
+  https://thetechoasis.beehiiv.com/p/eliminating-hallucinations-robots-imitate-us
   https://arxiv.org/abs/2407.19825
+  https://arxiv.org/pdf/2406.02543
   https://arxiv.org/abs/2406.10279
   https://arxiv.org/pdf/2409.18475
   https://llm-editing.github.io/
+  https://arxiv.org/pdf/2407.03651
   https://cleanlab.ai/blog/trustworthy-language-model/
+  https://arxiv.org/abs/2408.07852
   Detecting Hallucinations
     https://arxiv.org/abs/2410.22071
     https://arxiv.org/abs/2410.02707
-
+  Reflective thinking
+    https://arxiv.org/html/2404.09129v1
+    https://github.com/yanhong-lbh/LLM-SelfReflection-Eval
+  Semantic Entropy
+    https://www.nature.com/articles/s41586-024-07421-0
+    https://arxiv.org/abs/2406.15927
+  HALVA
+    https://research.google/blog/halva-hallucination-attenuated-language-and-vision-assistant/
 
 
 Finetuning: 
 - https://eugeneyan.com/writing/finetuning/
+- 
 
 Evals:
 - https://github.com/yanhong-lbh/LLM-SelfReflection-Eval
@@ -49,7 +61,21 @@ LLM As Judge:
   https://arxiv.org/pdf/2404.12272
   https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/
   https://huggingface.co/vectara/hallucination_evaluation_model
+  https://arxiv.org/pdf/2404.12272
+  https://arize.com/blog/breaking-down-evalgen-who-validates-the-validators/
+
 
+Long context generation
+  https://arxiv.org/pdf/2408.15518
+  https://arxiv.org/pdf/2408.14906
+  https://arxiv.org/pdf/2408.15496
+  https://arxiv.org/pdf/2408.11745
+  https://arxiv.org/pdf/2407.14482
+  https://arxiv.org/pdf/2407.09450
+  https://arxiv.org/pdf/2407.14057
+  https://www.turingpost.com/p/longrag
+  https://www.turingpost.com/p/deepseek
+  https://arxiv.org/pdf/2408.07055
 
 - Detecting Hallucinations using Semantic Entropy:
 - https://www.nature.com/articles/s41586-024-07421-0
@@ -66,8 +92,7 @@ Lynx/patronus
 - https://github.com/NVIDIA/NeMo-Guardrails/blob/develop/examples/configs/patronusai/prompts.yml
 - https://github.com/NVIDIA/NeMo-Guardrails/blob/develop/docs/user_guides/community/patronus-lynx.md
 - https://huggingface.co/PatronusAI/Llama-3-Patronus-Lynx-8B-Instruct-Q4_K_M-GGUF
-- https://www.patronus.ai/blog/lynx-state-of-the-art-open-source-hallucination-detection-model
-
+- https://arxiv.org/abs/2407.08488
 
 ----------------------------------------------------------------------------------------------------------------
 ### <a name="citations"></a> Citations

diff --git a/Docs/Evaluation_Plans.md b/Docs/Evaluation_Plans.md
@@ -11,7 +11,7 @@
 - [VLM Evaluations](#vlm-evals)
 ----------------------------------------------------------------------------------------------------------------
 
-
+https://eugeneyan.com/writing/evals/
 Benchmarking with distilabel
     https://distilabel.argilla.io/latest/sections/pipeline_samples/examples/benchmarking_with_distilabel/
 
@@ -250,6 +250,7 @@ Finetuning
         - https://stackoverflow.com/questions/9879276/how-do-i-evaluate-a-text-summarization-tool
     - https://github.com/confident-ai/deepeval/tree/99aae8ebc09093b8691c7bd6791f6927385cafa8/deepeval/metrics/summarization
     - https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task
+    - https://arxiv.org/abs/2009.01325
     - https://arxiv.org/abs/2407.01370v1
     - https://arxiv.org/html/2403.19889v1
     - https://github.com/salesforce/summary-of-a-haystack
@@ -334,7 +335,7 @@ Retrieval Granularity
 
 ----------------------------------------------------------------------------------------------------------------
 ### <a name="rag-eval"></a> RAG Evaluation
-
+https://blog.streamlit.io/ai21_grounded_multi_doc_q-a/
 https://archive.is/OtPVh
 https://towardsdatascience.com/how-to-create-a-rag-evaluation-dataset-from-documents-140daa3cbe71
 - **101**