Leaf nodes with empty sdg output #357

acsankar · 2024-11-09T15:13:46Z

I am not getting error but after running q&a generation it took 20 minutes and got empty datasets
please let me know what can be the cause.

(granite1) sankar@Sankars-MacBook-Pro test1 % ilab data generate --model /Users/sankar/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf
INFO 2024-11-09 19:43:03,301 numexpr.utils:161: NumExpr defaulting to 11 threads.
INFO 2024-11-09 19:43:03,536 datasets:59: PyTorch version 2.4.1 available.
INFO 2024-11-09 19:43:04,174 instructlab.model.backends.llama_cpp:125: Trying to connect to model server at http://127.0.0.1:8000/v1
WARNING 2024-11-09 19:43:04,193 instructlab.data.generate_data:72: Disabling SDG batching - unsupported with llama.cpp serving
INFO 2024-11-09 19:43:04,200 instructlab.data.generate_data:82: Generating synthetic data using 'full' pipeline, '/Users/sankar/.cache/instructlab/models/merlinite-7b-lab-Q4_K_M.gguf' model, '/Users/sankar/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:8000/v1 server
INFO 2024-11-09 19:43:05,341 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-11-09 19:43:06,222 instructlab.sdg.checkpointing:59: No existing checkpoints found in /Users/sankar/.local/share/instructlab/datasets/checkpoints/knowledge_application_InfyBill_overview, generating from scratch
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:153: Running pipeline single-threaded
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:197: Running block: duplicate_document_col
INFO 2024-11-09 19:43:06,223 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5'],
num_rows: 20
})
INFO 2024-11-09 19:43:07,527 instructlab.sdg.llmblock:52: LLM server supports batched inputs: False
INFO 2024-11-09 19:43:07,527 instructlab.sdg.pipeline:197: Running block: gen_spellcheck
INFO 2024-11-09 19:43:07,527 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'base_document'],
num_rows: 20
})
gen_spellcheck Prompt Generation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [14:54<00:00, 44.72s/it]
INFO 2024-11-09 19:58:01,978 instructlab.sdg.pipeline:197: Running block: flatten_auxiliary_columns
INFO 2024-11-09 19:58:01,979 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'base_document', 'spellcheck'],
num_rows: 20
})
INFO 2024-11-09 19:58:02,013 instructlab.sdg.pipeline:197: Running block: rename_to_document_column
INFO 2024-11-09 19:58:02,013 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'dataset_type', 'corrected_document'],
num_rows: 40
})
INFO 2024-11-09 19:58:02,015 instructlab.sdg.pipeline:197: Running block: gen_knowledge
INFO 2024-11-09 19:58:02,015 instructlab.sdg.pipeline:198: Dataset({
features: ['icl_document', 'raw_document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_query_4', 'icl_query_5', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'icl_response_4', 'icl_response_5', 'dataset_type', 'document'],
num_rows: 40
})
gen_knowledge Prompt Generation: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [06:05<00:00, 9.15s/it]
WARNING 2024-11-09 20:04:07,835 instructlab.sdg.generate_data:386: Empty dataset for qna node: knowledge_application_InfyBill_overview
INFO 2024-11-09 20:04:07,837 instructlab.sdg.generate_data:420: Generation took 1263.64s
WARNING 2024-11-09 20:04:07,837 instructlab.sdg.generate_data:422: Leaf nodes with empty sdg output: knowledge_application_InfyBill_overview

bbrowning · 2024-11-10T02:51:17Z

Unfortunately the merlinite-7b-lab-Q4_K_M.gguf model often results in empty datasets when using the full data generation pipeline. If you are on a resource constrained system and trying the run the full pipeline, then a model such as https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ is known to work more reliably. The full pipeline is generally designed to work with the much larger unquantized Mixtral-8x7B-Instruct, but the quantized version linked here also works fairly well.

If that quantized mixtral is too large for your system, then you can continue using the merlinite gguf but you'll need to use --pipeline simple to run the simple pipeline. That pipeline produces substantially worse results, but will run on smaller hardware.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaf nodes with empty sdg output #357

Leaf nodes with empty sdg output #357

acsankar commented Nov 9, 2024

bbrowning commented Nov 10, 2024

Leaf nodes with empty sdg output #357

Leaf nodes with empty sdg output #357

Comments

acsankar commented Nov 9, 2024

bbrowning commented Nov 10, 2024