Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added best practice guide for QNA #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/taxonomy/qna_yaml_best_practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@

- Things to Avoid
- Historically, LLM is bad in math
- Do not provide complex math calculation in Q&A seeds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this would look better as a single line because it references the same subject


- Context
- What if knowledge is based on documents not existing in the base model?
- In the qna.yaml file, you can pass context within a chunk of information (text from the document that Q&A are based on). Adding context to the skill QnA file might generate better-quality data.
Copy link
Contributor

@kelbrown20 kelbrown20 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the formatting. I was thinking instead of a second bullet point, just leave it as a paragraph.
For example

- What if knowledge is based on documents not existing in the base model?

  In the qna.yaml file, you can pass context within a chunk of information (text from the document 
  that Q&A are based on). Adding context to the skill QnA file might generate better-quality data.

And follow that pattern for the other as well. WDYT?


- Formatting & Front-End specific and may change
Copy link
Contributor

@kelbrown20 kelbrown20 Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, something like

- How to format data in the Q&A file especially how to format tables?

  Currently, only files in Markdown format are supported.
     - If the files are in any other format, they must be converted to Markdown format
     - For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’

- How to format data in the Q&A file especially how to format tables?
- Currently, only files in Markdown format are supported.
- If the files are in any other format, they must be converted to Markdown format
- For automatic converters, we recommend experimenting with other Markdown conversions like ‘markdown_strict’, ‘asciidoc’ and ‘gfm’

- Intervene in Training
- Can I used generated json files to prompt-tuning (watsonx.ai) or using HuggingFace directly?
- The output of SDG is in json format and can also be used for traditional fine-tuning.

- Quantities
- The number of seed examples
- How many seeds I should provide?
- The number of seeds:
- Generating ~300 QnA pairs from ~5 seed examples is recommended by InstructLab product team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the QnA, QNA and qna should be switched to Q&A to be more consistent. Right know Im between using Q&A or QnA, but I do think we should be consistent. What do folks think?

- Knowledge requires 5 pieces of context from document each with 3 QNAs specific to each context piece for a total of 15 qna pairs.
- We tried with less than 300 QnA pairs but found the QnA quality only satisfactory.
- The task description should be grounded in the domain/document.
- Due to this recommendation we should keep in mind that much complex cases can be splitted into smaller chunks of information

- What is the size limit of context window in the Q&A file (qna.yaml)?
- Context size limitation:
- There is a ~2300 context size limit in the QnA yaml file.
- It is advised to keep the ground truth answers concise to respect this limit.

- After Training
- How to check the quality of the data in a large data set of the qna.yaml file?
- You don’t have to check out synthetic data generated by the SDG process. After generating synthetic data internally, the IBM Research team is sampling to check quality (no need to check them all, especially for extensive set).

- Quality
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this section is formatted!

- How to measure quality of obtained data
- To evaluate SDG, you can use following a rating range (1-5):
- Irrelevant Answer
- Relevant but not close to ground truth, model might be hallucinating.
- Relevant, model not hallucinating, partly matching the ground truth.
- Relevant, model not hallucinating, model is adding irrelevant/unnecessary information
- Excellent Answer, Matches closely with Ground Truth

- Keep in mind:
- During the manual validation, it understood the entity and intent of the question and searched for the same entity and intent in the corresponding document. The document information was provided in the generated JSON file.
- At the next step, manual search validated it the steps or definitions contained in the answer were indeed in the corresponding document.

- How to enhance the quality of data generated in SDG 
- Task description: Add a task description relevant to the knowledge documents. We tried adding a custom task description to improve the SDG.
- Prompt template: Add guidelines for instruction and output to stick to document-related keywords and generate instructions from tables. We specifically added these instructions to the prompt template.
- Chunk word count: Increase the word count to increase the chunk sizes taken from the documents in SDG for long answered (Q-A pairs)
- Rouge threshold: To strictly enforce/penalize data quality, one can increase the rouge threshold in the iLab generate command.
- The question and answer pairs should be complete sentences, well formed, and use proper grammar. Longer answers are better than a short yes or no.
- Also, the question and answer pairs must be answered by the associated context.

- Formatting
- How many leaf nodes are kept in the taxonomy after adding a Q&A file?
- The documents are kept in single leaf node and has one qna file and one attribution.txt.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ nav:
- Slack Moderation Guide: community/InstructLab_SLACK_MODERATION_GUIDE.md
- Taxonomy:
- About Taxonomy: taxonomy/index.md
- qna.yaml Best Practices: taxonomy/qna_yaml_best_practices.md
- Skills Overview: taxonomy/skills/index.md
- Skills Guide: taxonomy/skills/skills_guide.md
- Knowledge Overview: taxonomy/knowledge/index.md
Expand Down