Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

Open
freefer opened this issue Nov 13, 2024 · 3 comments

Comments

@freefer
Copy link

freefer commented Nov 13, 2024

Description

model https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q5_k_m.gguf
Generate using GPU source code
Run the source code example LLama KernelMemory. cs for Examples.

Reproduction Steps

[llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[llama 1]: ggml_cuda_init: found 1 CUDA devices:
[llama 1]: Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
[llama 1]: llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2080) - 7113 MiB free
[llama 1]: llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\qwen2.5-coder-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
[llama 1]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[llama 1]: llama_model_loader: - kv 0: general.architecture str = qwen2
[llama 1]: llama_model_loader: - kv 1: general.type str = model
[llama 1]: llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 7B Instruct GGUF
[llama 1]: llama_model_loader: - kv 3: general.finetune str = Instruct-GGUF
[llama 1]: llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
[llama 1]: llama_model_loader: - kv 5: general.size_label str = 7B
[llama 1]: llama_model_loader: - kv 6: qwen2.block_count u32 = 28
[llama 1]: llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
[llama 1]: llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
[llama 1]: llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
[llama 1]: llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
[llama 1]: llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
[llama 1]: llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
[llama 1]: llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
[llama 1]: llama_model_loader: - kv 14: general.file_type u32 = 17
[llama 1]: llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
[llama 1]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
[llama 1]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
[llama 1]: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[llama 1]: llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
[llama 1]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
[llama 1]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
[llama 1]: llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
[llama 1]: llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
[llama 1]: llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
[llama 1]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
[llama 1]: llama_model_loader: - kv 26: split.no u16 = 0
[llama 1]: llama_model_loader: - kv 27: split.count u16 = 0
[llama 1]: llama_model_loader: - kv 28: split.tensors.count i32 = 339
[llama 1]: llama_model_loader: - type f32: 141 tensors
[llama 1]: llama_model_loader: - type q5_K: 169 tensors
[llama 1]: llama_model_loader: - type q6_K: 29 tensors
[llama Info]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
[llama 1]: llm_load_vocab: special tokens cache size = 22
[llama 1]: llm_load_vocab: token to piece cache size = 0.9310 MB
[llama 1]: llm_load_print_meta: format = GGUF V3 (latest)
[llama 1]: llm_load_print_meta: arch = qwen2
[llama 1]: llm_load_print_meta: vocab type = BPE
[llama 1]: llm_load_print_meta: n_vocab = 152064
[llama 1]: llm_load_print_meta: n_merges = 151387
[llama 1]: llm_load_print_meta: vocab_only = 0
[llama 1]: llm_load_print_meta: n_ctx_train = 131072
[llama 1]: llm_load_print_meta: n_embd = 3584
[llama 1]: llm_load_print_meta: n_layer = 28
[llama 1]: llm_load_print_meta: n_head = 28
[llama 1]: llm_load_print_meta: n_head_kv = 4
[llama 1]: llm_load_print_meta: n_rot = 128
[llama 1]: llm_load_print_meta: n_swa = 0
[llama 1]: llm_load_print_meta: n_embd_head_k = 128
[llama 1]: llm_load_print_meta: n_embd_head_v = 128
[llama 1]: llm_load_print_meta: n_gqa = 7
[llama 1]: llm_load_print_meta: n_embd_k_gqa = 512
[llama 1]: llm_load_print_meta: n_embd_v_gqa = 512
[llama 1]: llm_load_print_meta: f_norm_eps = 0.0e+00
[llama 1]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
[llama 1]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
[llama 1]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[llama 1]: llm_load_print_meta: f_logit_scale = 0.0e+00
[llama 1]: llm_load_print_meta: n_ff = 18944
[llama 1]: llm_load_print_meta: n_expert = 0
[llama 1]: llm_load_print_meta: n_expert_used = 0
[llama 1]: llm_load_print_meta: causal attn = 1
[llama 1]: llm_load_print_meta: pooling type = 0
[llama 1]: llm_load_print_meta: rope type = 2
[llama 1]: llm_load_print_meta: rope scaling = linear
[llama 1]: llm_load_print_meta: freq_base_train = 1000000.0
[llama 1]: llm_load_print_meta: freq_scale_train = 1
[llama 1]: llm_load_print_meta: n_ctx_orig_yarn = 131072
[llama 1]: llm_load_print_meta: rope_finetuned = unknown
[llama 1]: llm_load_print_meta: ssm_d_conv = 0
[llama 1]: llm_load_print_meta: ssm_d_inner = 0
[llama 1]: llm_load_print_meta: ssm_d_state = 0
[llama 1]: llm_load_print_meta: ssm_dt_rank = 0
[llama 1]: llm_load_print_meta: ssm_dt_b_c_rms = 0
[llama 1]: llm_load_print_meta: model type = ?B
[llama 1]: llm_load_print_meta: model ftype = Q5_K - Medium
[llama 1]: llm_load_print_meta: model params = 7.62 B
[llama 1]: llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
[llama 1]: llm_load_print_meta: general.name = Qwen2.5 Coder 7B Instruct GGUF
[llama 1]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: EOS token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: EOT token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: LF token = 148848 '脛默'
[llama 1]: llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
[llama 1]: llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
[llama 1]: llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
[llama 1]: llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
[llama 1]: llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
[llama 1]: llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
[llama 1]: llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: EOG token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
[llama 1]: llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
[llama 1]: llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
[llama 1]: llm_load_print_meta: max token length = 256
[llama 1]: llm_load_tensors: ggml ctx size = 0.30 MiB
[llama 1]: llm_load_tensors: offloading 20 repeating layers to GPU
[llama 1]: llm_load_tensors: offloaded 20/29 layers to GPU
[llama 1]: llm_load_tensors: CPU buffer size = 5186.92 MiB
[llama 1]: llm_load_tensors: CUDA0 buffer size = 3136.56 MiB

[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 731.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 989
[llama 1]: llama_new_context_with_model: graph splits = 116
Importing 1 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-SK-Readme.pdf
Completed in 00:00:04.0542873

Importing 2 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-KM-Readme.pdf
Completed in 00:00:01.8282991

Question: What formats does KM support
Generating answer...
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama Warning]: llama_get_logits_ith: invalid logits id 343, reason: no logits
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:

at LLama.Native.SafeLLamaSamplerChainHandle.g__llama_sampler_sample|4_0(LLama.Native.SafeLLamaSamplerChainHandle, LLama.Native.SafeLLamaContextHandle, Int32)

at LLama.Native.SafeLLamaSamplerChainHandle.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Sampling.BaseSamplingPipeline.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.StatelessExecutor+d__18.MoveNext()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext(System.Threading.Thread)
at System.Runtime.CompilerServices.TaskAwaiter+<>c.b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)
at System.Threading.Tasks.Task.RunContinuations(System.Object)
at System.Threading.Tasks.Task.FinishSlow(Boolean)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Environment & Configuration

windows 10 x64
LLamaSharp-0.19.0

Known Workarounds

No response

@freefer
Copy link
Author

freefer commented Nov 19, 2024

image
Setting it to false seems to work properly

@antoniovalentini
Copy link

I'm having the same issue, and the workaround is working fine. What are the downsides of setting Embeddings = false ?

Environment & Configuration

windows 11 x64
LLamaSharp-0.19.0
NVIDIA GeForce RTX 4070 Laptop GPU

@martindevans
Copy link
Member

What are the downsides of setting Embeddings = false ?

embeddings generation won't work. So line 89 of the image above is suspect, it's creating an embeddings generator with embeddings = false.

This isn't fatal though, you can toggle embeddings on an offat runtime. We just need a PR to make the text generator and embeddings generator do that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants