Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First input file is the only one processed #61

Open
mjh624 opened this issue Oct 1, 2024 · 3 comments
Open

First input file is the only one processed #61

mjh624 opened this issue Oct 1, 2024 · 3 comments

Comments

@mjh624
Copy link

mjh624 commented Oct 1, 2024

Our input folder contains 11 files. All appear to be read in:
Successfully read file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/innovationqkb.WordPress.2024-07-26.xml.md
JSON file saved successfully.
Successfully read file: ./input/ipcomkb.WordPress.2024-07-26.xml.md
JSON file saved successfully.
Successfully read file: ./input/iqideaskb.WordPress.2024-07-26.xml.md
JSON file saved successfully.
Successfully read file: ./input/ipcomkb.faq.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/priorartdatabasekb.WordPress.2024-07-26.xml.md
JSON file saved successfully.
Successfully read file: ./input/iqideaskb.faq.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/innovationqkb.faq.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/priorartdatabasekb.glossary.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/iqideaskb.glossary.WordPress.2024-07-27.xml.md
JSON file saved successfully.
Successfully read file: ./input/priorartdatabasekb.faq.WordPress.2024-07-27.xml.md
Pretraining set created.

However, only the first file: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md has question/answer pairs produced.

The augmentoolkit output messages do not appear to give an indication as to whether there is an issue.
COMPLETED PHASE 0
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a0db9260-500e>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/a7bc5e7d-950c>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bd42e735-6bba>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/c6ddcde3-8678>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/15a96815-222f>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/91488b07-8c85>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f8913bd2-afb5>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/8ef1f903-2906>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/ee46ce5b-8461>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/86b05937-c2a0>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1b6b185c-b3fd>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/0adb9229-3210>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/d51cf6a9-1745>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/921636fe-4076>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/456d0805-b8c7>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/f09a3fd9-3fcf>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/bf45b67b-81da>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/01b56723-abdf>
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/b89ac365-1286>
FAILED TO GENERATE QUESTIONS!
Output written to /tmp/augmentoolkit/original/output/question_generation_generations/question_generation_generations/1f0d3583-f9e7>
COMPLETED PHASE 1

Each file written in phase 1 appears to correspond to questions/answers related to a paragraph in the document: ./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md

What are some possible reasons that files in the input folder are skipped?

@mjh624
Copy link
Author

mjh624 commented Oct 1, 2024

A check of the metadata field in the qa_tuples_filtered folder shows only the first file lead to question/answer pairs:

grep -r metadata qatuples_filtered
qatuples_filtered/para_6_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_0_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_19_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_18_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_19_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_6_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_4_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_15_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_3_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_12_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_12_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_6_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_3_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_2_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_18_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_3_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_1_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_0_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_6_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_19_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_18_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_4_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_12_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_1_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_3_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_0_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_15_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_19_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_2_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_2_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_18_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_2_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_15_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_8_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_1_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_6.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_5.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_9_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_1_q_3.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_6_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_0_q_4.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_1_q_0.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_7_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_0_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_3_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_4_q_2.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",
qatuples_filtered/para_15_q_1.json: "metadata": "./input/innovationqkb.glossary.WordPress.2024-07-27.xml.md",

@e-p-armstrong
Copy link
Owner

Hmm that's strange. I'm not able to reproduce this using the toy example the repo starts with, so that leaves a few possibilities:

  1. we're running into an edge case with the code that isn't triggered with the three default input files
  2. somehow everything sourced from the other files is failing validation and never gets to question generation
  3. All questions made from those files get dropped for some reason
  4. something else

Would you be against sharing your input files and maybe your config so I can try to repro it on my end, or is that stuff confidential?

@mjh624
Copy link
Author

mjh624 commented Oct 5, 2024

Here is the config file:
API:
API_KEY: xxxx
BASE_URL: http://localhost:11434/v1
LARGE_LOGICAL_MODEL: llama3.1:70b
LOGICAL_MODEL: llama3.1:70b
HUGGINGFACE:
HUB_PATH: < our info here >
PRIVATE: False
PUSH_TO_HUB: False
PATH:
DEFAULT_PROMPTS: ./prompts
INPUT: ./input
OUTPUT: ./output
PROMPTS: ./prompts
PHASE:
PHASE_INDEX: 3
WORK_IN_PHASES: False
SKIP:
ANSWER_RELEVANCY_CHECK: False
FILTER_CHUNKS: True
QUESTION_CHECK: False
SYSTEM:
CHUNK_SIZE: 1900
COMPLETION_MODE: False
CONCURRENCY_LIMIT: 3
CONVERSATION_INSTRUCTIONS: For this conversation, you are generating a chat between
a generalist, generic AI assistant, and a human.
DOUBLE_CHECK_COUNTER: 1
DO_NOT_USE_SYSTEM_PROMPTS: True
FINAL_ASSISTANT_PROMPT_NO_RAG: 'You are a helpful AI assistant.

'

FINAL_ASSISTANT_PROMPT_RAG: 'You are a helpful AI assistant.

Context information is below:


----------------------

{data}

'

MODE: api
STOP: True
SUBSET_SIZE: 15
USE_FILENAMES: False
USE_SUBSET: False

Unfortunately, I cannot share our input files.

I found the config file that was used to process army training manuals:

https://github.com/e-p-armstrong/augmentoolkit/blob/master/original/config_overrides/army_model/config.yaml

I modified the config to use our input files and model and it appears that most, if not all input files now are being processed.
However, processing started 10/1/2024 and after 4 days, it is still processing.
I would like to understand which settings may have allowed the other files to process, and, why is it taking so much longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants