Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaf nodes with empty sdg output #357

Open
acsankar opened this issue Nov 9, 2024 · 1 comment
Open

Leaf nodes with empty sdg output #357

acsankar opened this issue Nov 9, 2024 · 1 comment

Comments

@acsankar
Copy link

acsankar commented Nov 9, 2024

I am not getting error but after running q&a generation it took 20 minutes and got empty datasets
please let me know what can be the cause.

(granite1) sankar@Sankars-MacBook-Pro test1 % ilab data generate --model /Users/sankar/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf
INFO 2024-11-09 19:43:03,301 numexpr.utils:161: NumExpr defaulting to 11 threads.
INFO 2024-11-09 19:43:03,536 datasets:59: PyTorch version 2.4.1 available.
INFO 2024-11-09 19:43:04,174 instructlab.model.backends.llama_cpp:125: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-11-09 19:43:04,193 instructlab.data.generate_data:72: Disabling SDG batching - unsupported with llama.cpp serving
INFO 2024-11-09 19:43:04,200 instructlab.data.generate_data:82: Generating synthetic data using 'full' pipeline, '/Users/sankar/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf' model, '/Users/sankar/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:8000/v1 server
INFO 2024-11-09 19:43:05,341 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-11-09 19:43:06,222 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/sankar/.local/share/instructlab/datasets/checkpoints/knowledge_application_InfyBill_overview, generating from scratch
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:197: Running block: duplicate_document_col
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5'],
num_rows: 20
})
INFO 2024-11-09 19:43:07,527 instructlab.sdg.llmblock:52: LLM server supports batched inputs: False
INFO 2024-11-09 19:43:07,527 instructlab.sdg.pipeline:197: Running block: gen_spellcheck
INFO 2024-11-09 19:43:07,527 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'base_document'],
num_rows: 20
})
gen_spellcheck Prompt Generation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [14:54<00:00, 44.72s/it]
INFO 2024-11-09 19:58:01,978 instructlab.sdg.pipeline:197: Running block: flatten_auxiliary_columns
INFO 2024-11-09 19:58:01,979 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'base_document', 'spellcheck'],
num_rows: 20
})
INFO 2024-11-09 19:58:02,013 instructlab.sdg.pipeline:197: Running block: rename_to_document_column
INFO 2024-11-09 19:58:02,013 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'dataset_type', 'corrected_document'],
num_rows: 40
})
INFO 2024-11-09 19:58:02,015 instructlab.sdg.pipeline:197: Running block: gen_knowledge
INFO 2024-11-09 19:58:02,015 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'raw_document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'dataset_type', 'document'],
num_rows: 40
})
gen_knowledge Prompt Generation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [06:05<00:00, 9.15s/it]
WARNING 2024-11-09 20:04:07,835 instructlab.sdg.generate_data:386: Empty dataset for qna node: knowledge_application_InfyBill_overview
INFO 2024-11-09 20:04:07,837 instructlab.sdg.generate_data:420: Generation took 1263.64s
WARNING 2024-11-09 20:04:07,837 instructlab.sdg.generate_data:422: Leaf nodes with empty sdg output: knowledge_application_InfyBill_overview

@bbrowning
Copy link
Contributor

Unfortunately the merlinite-7b-lab-Q4_K_M.gguf model often results in empty datasets when using the full data generation pipeline. If you are on a resource constrained system and trying the run the full pipeline, then a model such as https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ is known to work more reliably. The full pipeline is generally designed to work with the much larger unquantized Mixtral-8x7B-Instruct, but the quantized version linked here also works fairly well.

If that quantized mixtral is too large for your system, then you can continue using the merlinite gguf but you'll need to use --pipeline simple to run the simple pipeline. That pipeline produces substantially worse results, but will run on smaller hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants