You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG]:RUN LLama.Examples =>KernelMemory.cs System.AccessViolationException:“Attempted to read or write protected memory. This is often an indication that other memory is corrupt.”
#980
Open
freefer opened this issue
Nov 13, 2024
· 3 comments
Importing 2 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-KM-Readme.pdf
Completed in 00:00:01.8282991
Question: What formats does KM support
Generating answer...
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama Warning]: llama_get_logits_ith: invalid logits id 343, reason: no logits
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
at LLama.Native.SafeLLamaSamplerChainHandle.g__llama_sampler_sample|4_0(LLama.Native.SafeLLamaSamplerChainHandle, LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Native.SafeLLamaSamplerChainHandle.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Sampling.BaseSamplingPipeline.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.StatelessExecutor+d__18.MoveNext()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder1+AsyncStateMachineBox1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext(System.Threading.Thread)
at System.Runtime.CompilerServices.TaskAwaiter+<>c.b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)
at System.Threading.Tasks.Task.RunContinuations(System.Object)
at System.Threading.Tasks.Task.FinishSlow(Boolean)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
Environment & Configuration
windows 10 x64
LLamaSharp-0.19.0
Known Workarounds
No response
The text was updated successfully, but these errors were encountered:
What are the downsides of setting Embeddings = false ?
embeddings generation won't work. So line 89 of the image above is suspect, it's creating an embeddings generator with embeddings = false.
This isn't fatal though, you can toggle embeddings on an offat runtime. We just need a PR to make the text generator and embeddings generator do that :)
Description
model https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q5_k_m.gguf
Generate using GPU source code
Run the source code example LLama KernelMemory. cs for Examples.
Reproduction Steps
[llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[llama 1]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[llama 1]: ggml_cuda_init: found 1 CUDA devices:
[llama 1]: Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
[llama 1]: llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 2080) - 7113 MiB free
[llama 1]: llama_model_loader: loaded meta data with 29 key-value pairs and 339 tensors from I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\qwen2.5-coder-7b-instruct-q5_k_m.gguf (version GGUF V3 (latest))
[llama 1]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[llama 1]: llama_model_loader: - kv 0: general.architecture str = qwen2
[llama 1]: llama_model_loader: - kv 1: general.type str = model
[llama 1]: llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 7B Instruct GGUF
[llama 1]: llama_model_loader: - kv 3: general.finetune str = Instruct-GGUF
[llama 1]: llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
[llama 1]: llama_model_loader: - kv 5: general.size_label str = 7B
[llama 1]: llama_model_loader: - kv 6: qwen2.block_count u32 = 28
[llama 1]: llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
[llama 1]: llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
[llama 1]: llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
[llama 1]: llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
[llama 1]: llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
[llama 1]: llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
[llama 1]: llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
[llama 1]: llama_model_loader: - kv 14: general.file_type u32 = 17
[llama 1]: llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
[llama 1]: llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
[llama 1]: llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
[llama 1]: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[llama 1]: llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["臓 臓", "臓臓 臓臓", "i n", "臓 t",...
[llama 1]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
[llama 1]: llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
[llama 1]: llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
[llama 1]: llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
[llama 1]: llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
[llama 1]: llama_model_loader: - kv 25: general.quantization_version u32 = 2
[llama 1]: llama_model_loader: - kv 26: split.no u16 = 0
[llama 1]: llama_model_loader: - kv 27: split.count u16 = 0
[llama 1]: llama_model_loader: - kv 28: split.tensors.count i32 = 339
[llama 1]: llama_model_loader: - type f32: 141 tensors
[llama 1]: llama_model_loader: - type q5_K: 169 tensors
[llama 1]: llama_model_loader: - type q6_K: 29 tensors
[llama Info]: llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151649 '<|box_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151647 '<|object_ref_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151648 '<|box_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151644 '<|im_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151646 '<|object_ref_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
[llama Info]: llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
[llama 1]: llm_load_vocab: special tokens cache size = 22
[llama 1]: llm_load_vocab: token to piece cache size = 0.9310 MB
[llama 1]: llm_load_print_meta: format = GGUF V3 (latest)
[llama 1]: llm_load_print_meta: arch = qwen2
[llama 1]: llm_load_print_meta: vocab type = BPE
[llama 1]: llm_load_print_meta: n_vocab = 152064
[llama 1]: llm_load_print_meta: n_merges = 151387
[llama 1]: llm_load_print_meta: vocab_only = 0
[llama 1]: llm_load_print_meta: n_ctx_train = 131072
[llama 1]: llm_load_print_meta: n_embd = 3584
[llama 1]: llm_load_print_meta: n_layer = 28
[llama 1]: llm_load_print_meta: n_head = 28
[llama 1]: llm_load_print_meta: n_head_kv = 4
[llama 1]: llm_load_print_meta: n_rot = 128
[llama 1]: llm_load_print_meta: n_swa = 0
[llama 1]: llm_load_print_meta: n_embd_head_k = 128
[llama 1]: llm_load_print_meta: n_embd_head_v = 128
[llama 1]: llm_load_print_meta: n_gqa = 7
[llama 1]: llm_load_print_meta: n_embd_k_gqa = 512
[llama 1]: llm_load_print_meta: n_embd_v_gqa = 512
[llama 1]: llm_load_print_meta: f_norm_eps = 0.0e+00
[llama 1]: llm_load_print_meta: f_norm_rms_eps = 1.0e-06
[llama 1]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
[llama 1]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
[llama 1]: llm_load_print_meta: f_logit_scale = 0.0e+00
[llama 1]: llm_load_print_meta: n_ff = 18944
[llama 1]: llm_load_print_meta: n_expert = 0
[llama 1]: llm_load_print_meta: n_expert_used = 0
[llama 1]: llm_load_print_meta: causal attn = 1
[llama 1]: llm_load_print_meta: pooling type = 0
[llama 1]: llm_load_print_meta: rope type = 2
[llama 1]: llm_load_print_meta: rope scaling = linear
[llama 1]: llm_load_print_meta: freq_base_train = 1000000.0
[llama 1]: llm_load_print_meta: freq_scale_train = 1
[llama 1]: llm_load_print_meta: n_ctx_orig_yarn = 131072
[llama 1]: llm_load_print_meta: rope_finetuned = unknown
[llama 1]: llm_load_print_meta: ssm_d_conv = 0
[llama 1]: llm_load_print_meta: ssm_d_inner = 0
[llama 1]: llm_load_print_meta: ssm_d_state = 0
[llama 1]: llm_load_print_meta: ssm_dt_rank = 0
[llama 1]: llm_load_print_meta: ssm_dt_b_c_rms = 0
[llama 1]: llm_load_print_meta: model type = ?B
[llama 1]: llm_load_print_meta: model ftype = Q5_K - Medium
[llama 1]: llm_load_print_meta: model params = 7.62 B
[llama 1]: llm_load_print_meta: model size = 5.07 GiB (5.71 BPW)
[llama 1]: llm_load_print_meta: general.name = Qwen2.5 Coder 7B Instruct GGUF
[llama 1]: llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: EOS token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: EOT token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: LF token = 148848 '脛默'
[llama 1]: llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
[llama 1]: llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
[llama 1]: llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
[llama 1]: llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
[llama 1]: llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
[llama 1]: llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
[llama 1]: llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
[llama 1]: llm_load_print_meta: EOG token = 151645 '<|im_end|>'
[llama 1]: llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
[llama 1]: llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
[llama 1]: llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
[llama 1]: llm_load_print_meta: max token length = 256
[llama 1]: llm_load_tensors: ggml ctx size = 0.30 MiB
[llama 1]: llm_load_tensors: offloading 20 repeating layers to GPU
[llama 1]: llm_load_tensors: offloaded 20/29 layers to GPU
[llama 1]: llm_load_tensors: CPU buffer size = 5186.92 MiB
[llama 1]: llm_load_tensors: CUDA0 buffer size = 3136.56 MiB
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CPU output buffer size = 0.00 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 731.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 989
[llama 1]: llama_new_context_with_model: graph splits = 116
Importing 1 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-SK-Readme.pdf
Completed in 00:00:04.0542873
Importing 2 of 2: I:\LLamaSharp-0.19.0\LLama.Examples\bin\x64\Release\net8.0\Assets\sample-KM-Readme.pdf
Completed in 00:00:01.8282991
Question: What formats does KM support
Generating answer...
[llama 1]: llama_new_context_with_model: n_ctx = 2048
[llama 1]: llama_new_context_with_model: n_batch = 512
[llama 1]: llama_new_context_with_model: n_ubatch = 512
[llama 1]: llama_new_context_with_model: flash_attn = 0
[llama 1]: llama_new_context_with_model: freq_base = 1000000.0
[llama 1]: llama_new_context_with_model: freq_scale = 1
[llama 1]: llama_kv_cache_init: CUDA_Host KV buffer size = 32.00 MiB
[llama 1]: llama_kv_cache_init: CUDA0 KV buffer size = 80.00 MiB
[llama 1]: llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host output buffer size = 0.01 MiB
[llama 1]: llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB
[llama 1]: llama_new_context_with_model: CUDA_Host compute buffer size = 11.01 MiB
[llama 1]: llama_new_context_with_model: graph nodes = 986
[llama 1]: llama_new_context_with_model: graph splits = 116
[llama Warning]: llama_get_logits_ith: invalid logits id 343, reason: no logits
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
Repeat 2 times:
at LLama.Native.SafeLLamaSamplerChainHandle.g__llama_sampler_sample|4_0(LLama.Native.SafeLLamaSamplerChainHandle, LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Native.SafeLLamaSamplerChainHandle.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.Sampling.BaseSamplingPipeline.Sample(LLama.Native.SafeLLamaContextHandle, Int32)
at LLama.StatelessExecutor+d__18.MoveNext()
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder
1+AsyncStateMachineBox
1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext(System.Threading.Thread)at System.Runtime.CompilerServices.TaskAwaiter+<>c.b__12_0(System.Action, System.Threading.Tasks.Task)
at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Action, Boolean)
at System.Threading.Tasks.Task.RunContinuations(System.Object)
at System.Threading.Tasks.Task.FinishSlow(Boolean)
at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
Environment & Configuration
windows 10 x64
LLamaSharp-0.19.0
Known Workarounds
No response
The text was updated successfully, but these errors were encountered: