[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

freefer · 2024-11-13T08:23:33Z

Description

model https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q5_k_m.gguf
Generate using GPU source code
Run the source code example LLama KernelMemory. cs for Examples.

Reproduction Steps

[llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: [llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: [llama 1]: ggml_cuda_init: found [llama 1]: Device 0: NVIDIA GeForce [llama 1]: llama_load_model_from_file: [llama 1]: llama_model_loader: [llama 1]: llama_model_loader: [llama 1]: llama_model_loader: - kv 0: [llama 1]: llama_model_loader: - kv 1: [llama 1]: llama_model_loader: - kv 2: [llama 1]: llama_model_loader: - kv 3: [llama 1]: llama_model_loader: - kv 4: [llama 1]: llama_model_loader: - kv 5: [llama 1]: llama_model_loader: - kv 6: [llama 1]: llama_model_loader: - kv 7: [llama 1]: llama_model_loader: - kv 8: [llama 1]: llama_model_loader: - kv 9: [llama 1]: llama_model_loader: - kv 10: [llama 1]: llama_model_loader: - kv 11: [llama 1]: llama_model_loader: - kv 12: [llama 1]: llama_model_loader: - kv 13: [llama 1]: llama_model_loader: - kv 14: [llama 1]: llama_model_loader: - kv 15: [llama 1]: llama_model_loader: - kv 16: [llama 1]: llama_model_loader: - kv 17: [llama 1]: llama_model_loader: - kv 18: [llama 1]: llama_model_loader: - kv 19: [llama 1]: llama_model_loader: - kv 20: [llama 1]: llama_model_loader: - kv 21: [llama 1]: llama_model_loader: - kv 22: [llama 1]: llama_model_loader: - kv 23: [llama 1]: llama_model_loader: - kv 24: [llama 1]: llama_model_loader: - kv 25: [llama 1]: llama_model_loader: - kv 26: [llama 1]: llama_model_loader: - kv 27: [llama 1]: llama_model_loader: - kv 28: [llama 1]: llama_model_loader: - type [llama 1]: llama_model_loader: - type q5_K: [llama 1]: llama_model_loader: - type q6_K: [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama Info]: llm_load_vocab: control [llama 1]: llm_load_vocab: special [llama 1]: llm_load_vocab: token [llama 1]: llm_load_print_meta: format [llama 1]: llm_load_print_meta: arch [llama 1]: llm_load_print_meta: vocab type [llama 1]: llm_load_print_meta: n_vocab [llama 1]: llm_load_print_meta: n_merges [llama 1]: llm_load_print_meta: vocab_only [llama 1]: llm_load_print_meta: n_ctx_train [llama 1]: llm_load_print_meta: n_embd [llama 1]: llm_load_print_meta: n_layer [llama 1]: llm_load_print_meta: n_head [llama 1]: llm_load_print_meta: n_head_kv [llama 1]: llm_load_print_meta: n_rot [llama 1]: llm_load_print_meta: n_swa [llama 1]: llm_load_print_meta: n_embd_head_k [llama 1]: llm_load_print_meta: n_embd_head_v [llama 1]: llm_load_print_meta: n_gqa [llama 1]: llm_load_print_meta: n_embd_k_gqa [llama 1]: llm_load_print_meta: n_embd_v_gqa [llama 1]: llm_load_print_meta: f_norm_eps [llama 1]: llm_load_print_meta: f_norm_rms_eps [llama 1]: llm_load_print_meta: f_clamp_kqv [llama 1]: llm_load_print_meta: [llama 1]: llm_load_print_meta: f_logit_scale [llama 1]: llm_load_print_meta: n_ff [llama 1]: llm_load_print_meta: n_expert [llama 1]: llm_load_print_meta: n_expert_used [llama 1]: llm_load_print_meta: causal attn [llama 1]: llm_load_print_meta: pooling type [llama 1]: llm_load_print_meta: rope type [llama 1]: llm_load_print_meta: rope scaling [llama 1]: llm_load_print_meta: freq_base_train [llama 1]: llm_load_print_meta: [llama 1]: llm_load_print_meta: n_ctx_orig_yarn [llama 1]: llm_load_print_meta: rope_finetuned [llama 1]: llm_load_print_meta: ssm_d_conv [llama 1]: llm_load_print_meta: ssm_d_inner [llama 1]: llm_load_print_meta: ssm_d_state [llama 1]: llm_load_print_meta: ssm_dt_rank [llama 1]: llm_load_print_meta: ssm_dt_b_c_rms [llama 1]: llm_load_print_meta: model type [llama 1]: llm_load_print_meta: model ftype [llama 1]: llm_load_print_meta: model params [llama 1]: llm_load_print_meta: model size [llama 1]: llm_load_print_meta: general.name [llama 1]: llm_load_print_meta: BOS token [llama 1]: llm_load_print_meta: EOS token [llama 1]: llm_load_print_meta: EOT token [llama 1]: llm_load_print_meta: PAD token [llama 1]: llm_load_print_meta: LF token [llama 1]: llm_load_print_meta: FIM PRE token [llama 1]: llm_load_print_meta: FIM SUF token [llama 1]: llm_load_print_meta: FIM MID token [llama 1]: llm_load_print_meta: FIM PAD token [llama 1]: llm_load_print_meta: FIM REP token [llama 1]: llm_load_print_meta: FIM SEP token [llama 1]: llm_load_print_meta: EOG token [llama 1]: llm_load_print_meta: EOG token [llama 1]: llm_load_print_meta: EOG token [llama 1]: llm_load_print_meta: EOG token [llama 1]: llm_load_print_meta: EOG token [llama 1]: llm_load_print_meta: [llama 1]: llm_load_tensors: ggml ctx size = [llama 1]: llm_load_tensors: offloading [llama 1]: llm_load_tensors: offloaded [llama 1]: llm_load_tensors: [llama 1]: llm_load_tensors: no
no
1 CUDA devices:
RTX 2080, compute capability 7.5, VMM: yes
using device CUDA0 (NVIDIA GeForce RTX 2080) - 7113 MiB free
loaded meta data with 29 key-value pairs and 339 tensors from I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\qwen2.5-coder-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
Dumping metadata keys/values. Note: KV overrides do not apply in this output.
general.architecture str = qwen2
general.type str = model
general.name str = Qwen2.5 Coder 7B Instruct GGUF
general.finetune str = Instruct-GGUF
general.basename str = Qwen2.5-Coder
general.size_label str = 7B
qwen2.block_count u32 = 28
qwen2.context_length u32 = 131072
qwen2.embedding_length u32 = 3584
qwen2.feed_forward_length u32 = 18944
qwen2.attention.head_count u32 = 28
qwen2.attention.head_count_kv u32 = 4
qwen2.rope.freq_base f32 = 1000000.000000
qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
general.file_type u32 = 17
tokenizer.ggml.model str = gpt2
tokenizer.ggml.pre str = qwen2
tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
tokenizer.ggml.merges arr[str,151387] = ["臓臓", "臓臓臓臓", "i n", "臓 t",...
tokenizer.ggml.eos_token_id u32 = 151645
tokenizer.ggml.padding_token_id u32 = 151643
tokenizer.ggml.bos_token_id u32 = 151643
tokenizer.ggml.add_bos_token bool = false
tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
general.quantization_version u32 = 2
split.no u16 = 0
split.count u16 = 0
split.tensors.count i32 = 339
f32: 141 tensors
169 tensors
29 tensors
token: 151661 '<|fim_suffix|>' is not marked as EOG
token: 151649 '<|box_end|>' is not marked as EOG
token: 151647 '<|object_ref_end|>' is not marked as EOG
token: 151654 '<|vision_pad|>' is not marked as EOG
token: 151659 '<|fim_prefix|>' is not marked as EOG
token: 151648 '<|box_start|>' is not marked as EOG
token: 151644 '<|im_start|>' is not marked as EOG
token: 151646 '<|object_ref_start|>' is not marked as EOG
token: 151650 '<|quad_start|>' is not marked as EOG
token: 151651 '<|quad_end|>' is not marked as EOG
token: 151652 '<|vision_start|>' is not marked as EOG
token: 151653 '<|vision_end|>' is not marked as EOG
token: 151655 '<|image_pad|>' is not marked as EOG
token: 151656 '<|video_pad|>' is not marked as EOG
token: 151660 '<|fim_middle|>' is not marked as EOG
tokens cache size = 22
to piece cache size = 0.9310 MB
= GGUF V3 (latest)
= qwen2
= BPE
= 152064
= 151387
= 0
= 131072
= 3584
= 28
= 28
= 4
= 128
= 0
= 128
= 128
= 7
= 512
= 512
= 0.0e+00
= 1.0e-06
= 0.0e+00
f_max_alibi_bias = 0.0e+00
= 0.0e+00
= 18944
= 0
= 0
= 1
= 0
= 2
= linear
= 1000000.0
freq_scale_train = 1
= 131072
= unknown
= 0
= 0
= 0
= 0
= 0
= ?B
= Q5_K - Medium
= 7.62 B
= 5.07 GiB (5.71 BPW)
= Qwen2.5 Coder 7B Instruct GGUF
= 151643 '<|endoftext|>'
= 151645 '<|im_end|>'
= 151645 '<|im_end|>'
= 151643 '<|endoftext|>'
= 148848 '脛默'
= 151659 '<|fim_prefix|>'
= 151661 '<|fim_suffix|>'
= 151660 '<|fim_middle|>'
= 151662 '<|fim_pad|>'
= 151663 '<|repo_name|>'
= 151664 '<|file_sep|>'
= 151643 '<|endoftext|>'
= 151645 '<|im_end|>'
= 151662 '<|fim_pad|>'
= 151663 '<|repo_name|>'
= 151664 '<|file_sep|>'
max token length = 256
0.30 MiB
20 repeating layers to GPU
20/29 layers to GPU
CPU buffer size = 5186.92 MiB
CUDA0 buffer size = 3136.56 MiB

[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 731.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 989
[llama 1]: llama_new_context_with_model: graph splits = 116
Importing 1 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-SK-Readme.pdf
Completed in 00:00:04.0542873

Importing 2 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-KM-Readme.pdf
Completed in 00:00:01.8282991

Question: What formats does KM support
Generating answer...
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama Warning]: llama_get_logits_ith: invalid logits id 343, reason: no logits
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:

at LLama.Native.SafeLLamaSamplerChainHandle.g__llama_sampler_sample|4_0(LLama.Native.SafeLLamaSamplerChainHandle, LLama.Native.SafeLLamaContextHandle, Int32)

at LLama.Native.SafeLLamaSamplerChainHandle.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Sampling.BaseSamplingPipeline.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.StatelessExecutor+d__18.MoveNext()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext(System.Threading.Thread)
at System.Runtime.CompilerServices.TaskAwaiter+<>c.b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)
at System.Threading.Tasks.Task.RunContinuations(System.Object)
at System.Threading.Tasks.Task.FinishSlow(Boolean)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()

Environment & Configuration

windows 10 x64
LLamaSharp-0.19.0

Known Workarounds

No response

The text was updated successfully, but these errors were encountered:

freefer · 2024-11-19T07:09:10Z

Setting it to false seems to work properly

antoniovalentini · 2024-12-01T14:56:33Z

I'm having the same issue, and the workaround is working fine. What are the downsides of setting Embeddings = false ?

Environment & Configuration

windows 11 x64
LLamaSharp-0.19.0
NVIDIA GeForce RTX 4070 Laptop GPU

martindevans · 2024-12-01T15:46:30Z

What are the downsides of setting Embeddings = false ?

embeddings generation won't work. So line 89 of the image above is suspect, it's creating an embeddings generator with embeddings = false.

This isn't fatal though, you can toggle embeddings on an offat runtime. We just need a PR to make the text generator and embeddings generator do that :)

github-actions · 2025-04-27T00:33:44Z

This issue has been automatically marked as stale due to inactivity. If no further activity occurs, it will be closed in 7 days.

martindevans · 2025-04-27T14:33:09Z

I'm going to protect this one from auto closing until the work I mentioned (toggling embeddings at runtime) has been done.

github-actions bot added the stale Stale issue will be autoclosed soon label Apr 27, 2025

martindevans added good first issue Good for newcomers help wanted Extra attention is needed do not close Protect this issue from auto closing labels Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

freefer commented Nov 13, 2024

freefer commented Nov 19, 2024

Uh oh!

antoniovalentini commented Dec 1, 2024

Uh oh!

martindevans commented Dec 1, 2024

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

martindevans commented Apr 27, 2025 •

edited

Loading

Uh oh!

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.” #980

Comments

freefer commented Nov 13, 2024

Description

Reproduction Steps

at LLama.Native.SafeLLamaSamplerChainHandle.g__llama_sampler_sample|4_0(LLama.Native.SafeLLamaSamplerChainHandle, LLama.Native.SafeLLamaContextHandle, Int32)

Environment & Configuration

Known Workarounds

freefer commented Nov 19, 2024

Uh oh!

antoniovalentini commented Dec 1, 2024

Environment & Configuration

Uh oh!

martindevans commented Dec 1, 2024

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

martindevans commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindevans commented Apr 27, 2025 •

edited

Loading