Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BEAMEngineUpdates_1.0 #496

Open
wants to merge 12 commits into
base: v1-dev
Choose a base branch
from

Conversation

keithclift24
Copy link

Refines the BEAM "Compare", and "Fusion" features for improved results.

Copy link

vercel bot commented Apr 9, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
big-agi-open ✅ Ready (Inspect) Visit Preview Apr 15, 2024 1:26am

Copy link
Owner

@enricoros enricoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pardon the spelling errors, didn't use AI for this important stuff) Thanks, feedback inlined. I need to test side-by-side into my installs, on a dozen cases and then make the subjective call. I tested for days the existing prompts in the weeks prior to release, so naturally I'm more familiar.

Review (see inline comments):

  • highlight: increased precision, improved table, improve ranges, and better commands
  • I can attest some of the changes will make a positive impact and will be merged for sure

Important: some prompts are more commanding (larger) than the baseline today in Big-AGI. The tradeoff with giving more instructions is the following: we get a more predictable and controllable output, but often at the expense of degrees of freedom in the response space. Meaning we get what we ask for, but we take away the chance of LLMs performing some operation that would be surprising and useful to us.

This happens in:

  • the Table has 2 columns that are fixed, and set a "foundation" to build the other columns upon
  • the Fusion's user prompt is larger. This will make for better fusions but not all the time. Example below:

Fusion.userPrompt contains ".. ensuring the narrative is comprehensive, nuanced.. ". Now, imagine the first prompt of the user was to summarize a PDF. The Fusion user prompt will "undo" the summarization request, and instead fuse the elements in a comprehensive and nuanced way (and predictably will be "less" of the summary that the user expects).

Sorry for the length, but I owe you some explanations on why I'm not merging right away, as you have some very good work here. I have been thinking on this space at length, and I am glad I can have this sort of discussion with a peer.

Future: 1. (likely) selective merge, but need to get to a few more important issues first, 2. another approach would be to have a query compiler, which takes a good instruction (like yours!) and the initial user prompts, and reformulates the fusion instruction to tailor it to the query of the user. I have the feeling that 2 could work. 3. I will need to hire someone to work on this, so we can have the best fusion, compilers, and stores, etc.

Now I'll go to merge it into one of my private builds to test.

| R2 | ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... | ... |
| RN | ... | ... | ... | ... | ... | ... |
Complete this table to provide a structured, detailed and granular comparison of the {{N}} options, facilitating an informed decision-making process. Finally, are careful review of the results,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are -> After?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "after"

Synthesize the perfect response that merges the key insights and provides clear guidance or answers based on the collective intelligence of the alternatives.`.trim(),
Your task is to orchestrate a synthesis of elements from {{N}} response alternatives, derived from separate LLMs, each powered by unique architectures and training paradigms. Your role involves:

Analyzing the diverse array of responses to unearth common themes, address contradictions, exclude inaccuracies, and spotlight unique insights and content.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, I'll need to remove spacing in the lines as the ` ... ` blocks keep literal indentation (all the spaces at the start). I've also just found this package, "https://www.npmjs.com/package/dedent" that can do it. I'll take care of this.

Your response should integrate the most relevant insights from these inputs into a cohesive and actionable answer.

Synthesize the perfect response that merges the key insights and provides clear guidance or answers based on the collective intelligence of the alternatives.`.trim(),
Your task is to orchestrate a synthesis of elements from {{N}} response alternatives, derived from separate LLMs, each powered by unique architectures and training paradigms. Your role involves:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 like the improved precision of your commands


The checklist should contain no more than 3-9 items orthogonal items, especially points of difference, in a single brief line each (no end period).
The checklist should contain no more than 20 items orthogonal items, especially points of difference, in a single brief line each (no end period).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-9 was too low and vague, 20 possibly too much, depends on the scope of the answer. would be good to give a "sizing" of the checklist that's commensurate to the input, so for an easy job (a simple joke) you get 5 options, and for a legal doc you get 15.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, after testing 20 is a bit much. Could have it decide number based on its own given assessment. Did you try no limit?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tried and the models usually don't have a "scale" to refer to. Usually you get ~10 options. For a "hello" fusion, or a legal document.

@@ -122,44 +129,50 @@ The final output should reflect a deep understanding of the user's preferences a
addLabel: 'Add Breakdown',
cardTitle: 'Evaluation Table',
Icon: TableViewRoundedIcon,
description: 'Analyzes and compares AI responses, offering a structured framework to support your response choice.',
description: 'Analyzes and compares AI responses, offering a structured framework to support your response choice. Model names are hidden and coded (R1, R2, etc.) to remove potential bias.',
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the explanation of coding of the model names.

Now that you have reviewed the {{N}} alternatives, proceed with the following steps:

1. **Identify Criteria:** Define the most important orthogonal criteria for evaluating the responses. Identify up to 2 criteria for simple evaluations, or up to 6 for more complex evaluations. Ensure these criteria are distinct and relevant to the responses provided.
1. **Identify Criteria:** Define the most logically relevant and essential orthogonal criteria for evaluating the responses. Always include Accuracy and Pertinence as primary criteria.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think Accuracy and Pertinence are a must? It's a good idea, but have to see if adding this constraint removes degrees of freedom in the other criteria.

Selecting Accuracy and Pertinence defining those 2 as the most important vector in any message decomposition. It's possible that they are, and it's important to set those 2 vectors for setting a reliable and repeatable framework and not leave too much room to the RNG.

There's some brilliance to this - need to test.

( Accuracy may need to be defined further - Pertinence has probably a more narrow definition, good)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a lot of time debating with AI itself over what really matters to get a net higher quality fusion response. Relevancy never quite fit, and I think pertinence nails it. Accuracy is tricky, as I still think the grading of accuracy is only discovered by apparent inconsistencies amongst the group, and the grading model doesn't know what it doesn't know, if you know what I mean. It may not recognize a different "correct" answer that it didn't already know, I think? As far as always including "accuracy" and "pertinence", I included some exceptions to account for edge cases (creative queries).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accuracy is tricky also because it can mean different things to different models. I'm almost leaning towards preferring Pertinence over accuracy.

Copy link
Author

@keithclift24 keithclift24 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I need to think more. "correctness"? Here's some we can consider:

•	Precision: Exactness in measurement or performance.
•	Exactness: The degree of conformity to a standard or truth.
•	Fidelity: Faithfulness to the original or to a standard.
•	Veracity: Conformity to facts; accuracy.
•	Validity: The extent to which a concept, conclusion, or measurement is well-founded and corresponds accurately to the real world.
•	Credibility: Worthy of belief or confidence.
•	Accuracy: Correctness or precision of information or measurement.
•	Conformity: Agreement or compliance with standards, rules, or laws.
•	Consistency: Uniformity or steadiness in quality or performance.
•	Exactitude: The quality of being exact or precise in detail.
•	Meticulousness: Extreme or excessive care in the consideration or treatment of details.
•	Rigor: Strictness, severity, or thoroughness in maintaining standards.
•	Faithfulness: Accuracy in reproducing a sound or image.
•	Correctness: Free from error; in accordance with fact or truth.
•	Adherence: The quality of sticking strictly to standards, rules, or practices.
•	Exactness: The quality of being very accurate and precise.
•	Precision: Exactness in the language model’s responses to queries.
•	Fidelity: Faithfulness of the model’s output to the facts or source material.
•	Veracity: Adherence to truth and accuracy of factual information provided.
•	Relevance: The degree to which the model’s responses pertain to the given tasks or questions.
•	Adaptability: The model’s ability to adjust its responses based on new information or feedback.
•	Critical Thinking: The model’s ability to analyze, evaluate, and synthesize information in its responses.
•	Innovation: Originality and creativity in generating solutions or responses.
•	Factual Accuracy: Correctness of factual statements, essential for trustworthiness.
•	Logical Reasoning: Clear and sound reasoning in constructing arguments or explanations.
•	Problem-Solving Skills: Effectiveness in identifying and proposing solutions.
•	Detail Orientation: Attention to and incorporation of significant details in responses.
•	Coherence: Logical consistency and clarity in responses, ensuring they are understandable and follow a logical flow.
•	User Engagement: The ability to maintain the user’s interest and promote further interaction through engaging and relevant content.

• Comprehensiveness: The extent to which the model can cover all relevant topics or knowledge areas for a given task.
• Synthesis Ability: The model’s capacity to integrate and combine information from various sources into a coherent whole, showing depth of understanding.
• Clarity: The ease with which users can understand the model’s responses, emphasizing clear and accessible language.
• Insightfulness: The depth of understanding and the ability to provide novel insights or perspectives in responses.
• Responsiveness: The precision with which the model addresses and adapts to the specific parts of a prompt or question, including the nuanced understanding of user intent.
• Emotional Intelligence: The model’s ability to recognize and appropriately respond to emotional cues in text, demonstrating sensitivity to the emotional context.
• Technical Proficiency: The accuracy and depth of knowledge in responses to queries requiring specialized understanding or technical expertise.

3. **Generate Table:** Organize your analysis into a table. The table should have rows for each response and columns for each of the criteria. Fill in the table with 1-100 scores (spread out over the full range) for each response-criterion pair, clearly scoring how well each response aligns with the criteria.
3. **Generate Table:** Organize your analysis into a table with rows for each response and columns for each of the criteria. Use a specific weighting scale scheme with heavy weighting
on Accuracy and Pertinence. Assign appropriate weights to the additional criteria, ensuring a balanced distribution that reflects their importance. Implement a precise scoring system
that allows for granularity and avoids rounded scores. Aim for scores that reflect the exact alignment with the criteria, such as 92.3 or 87.6, rather than rounded figures like 90 or 85.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job in better defining the distribution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an another area where it could be tightened up lengthwise (and elsewhere), I don't know if "don't round" is really that important. Was just trying to yield more exact, differentiated results.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this one.

@keithclift24
Copy link
Author

I imagined you'd have comments 😃, and you spent time massaging the language already, since some parameters were very particular. I agree the length for some part is an issue. Notice I tried to account for creative and brevity query scenarios at the expense of tokens. From my testing it does actually a pretty good job at handling both.

@enricoros
Copy link
Owner

Yes, I can tell that you really did a lot of work. I need to BEAM this thread, to prioritize the changes :) I may just cherry-pick the safe parts in for now, while I test the more extensive parts (Fusion, Table)

@enricoros
Copy link
Owner

Merged to my personal branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants