copy

Aider-AI · Dec 19, 2023 · 837fd9e · 837fd9e
1 parent 3e63963
commit 837fd9e
Showing 1 changed file with 7 additions and 5 deletions.
diff --git a/docs/unified-diffs.md b/docs/unified-diffs.md
@@ -7,7 +7,9 @@
 Aider now asks GPT-4 Turbo to use
 [unified diffs](#choose-a-familiar-editing-format)
 to edit your code.
-This dramatically improves GPT-4 Turbo's performance on a complex benchmark 
+This dramatically improves GPT-4 Turbo's performance on a
+challenging
+new benchmark 
 and significantly reduces its bad habit of "lazy" coding,
 where it writes
 code with comments
@@ -17,15 +19,15 @@ Aider also has a new "laziness" benchmark suite
 designed to both provoke and quantify lazy coding.
 It consists of
 89 python refactoring tasks
-which tend to make GPT-4 Turbo very lazy.
-On these tasks it often produces comments like
+which tend to make GPT-4 Turbo lazy
+and write comments like
 "...include the original method body...".
 
 This new laziness benchmark produced the following results with `gpt-4-1106-preview`:
 
-- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It output "lazy comments" on 12 of the tasks.
+- **GPT-4 Turbo only scored 20% as a baseline** using aider's existing "SEARCH/REPLACE block" edit format. It outputs "lazy comments" on 12 of the tasks.
 - **Aider's new unified diff edit format raised the score to 61%**. Using this format reduced laziness by 3X, with GPT-4 Turbo only using lazy comments on 4 of the tasks.
-- **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did *slightly* reduce the amount of laziness, but at a large cost to successful benchmark outcomes.
+- **It's worse to prompt that the user is blind, without hands, will tip $2000 and fears truncated code trauma.** These widely circulated folk remedies performed worse on the benchmark when added to the system prompt for the baseline SEARCH/REPLACE and new unified diff editing formats. These prompts did slightly reduce the amount of laziness against baseline (to 8 lazy tasks). It increased the lazy tasks to 5 when added to the unified diff prompt.
 
 The older `gpt-4-0613` also did better on the laziness benchmark using unified diffs: