out of RAM #209

ChichoSkruch · 2025-02-11T12:56:58Z

I use ollama in terminal with deepseek-r1:14b even 32b (very slow) in this linux zenbook pro duo UX851G. When ask chat with deepseek-coder-v2 or qwen2.5-coder it goes OK. If I try code review or suggestions I receive this error:

Debugger entered--Lisp error: (error "Error calling the LLM: model requires more system ...")
error("Error calling the LLM: %s" "model requires more system memory (49.0 GiB) than ...")
#("model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (_ msg) #<bytecode 0x1dd94b5c526701a6>)(error "model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (type err) #<bytecode 0x1c89dbf61d0fce3b>)(error "model requires more system memory (49.0 GiB) than ...")
llm-provider-utils-callback-in-buffer(# #f(compiled-function (type err) #<bytecode 0x1c89dbf61d0fce3b>) error "model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (_ data) #<bytecode -0xe7e2c5f10e42101>)(error ((error . "model requires more system memory (49.0 GiB) than ...")))
llm-request-plz--handle-error(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32:35 GMT") (content-length . "85$
#f(compiled-function (error) #<bytecode -0x8cecea1549f9e66>)(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32$
#f(compiled-function (error) #<bytecode 0x14bca7baea1431be>)(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32$
plz--respond(# #<buffer plz-request-curl> "finished\n")
apply(plz--respond (# #<buffer plz-request-curl> "finished\n"))
timer-event-handler([t 26539 17251 109626 nil plz--respond (# #<buffer plz-request-curl> "finished\n") nil 478000 nil])

Is there a configuration I can make so to use models that runs in my terminal?

s-kostyaev · 2025-02-11T13:01:23Z

Show me your ellama configuration

s-kostyaev · 2025-02-11T13:04:27Z

Also you can try to enable flash attention and cache quantization: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention

s-kostyaev · 2025-02-11T13:06:08Z

in your configuration I want to see num_ctx - context length. The more value you set the more VRAM/RAM you need. Models also can differ in size. You can pick lesser model or lesser quant of model. If not set, by default it's 2k on ollama AFAIK.

s-kostyaev · 2025-02-11T13:07:27Z

Also, you can mark region before calling ellama-code-review to send not all buffer content but only active region.

ChichoSkruch · 2025-02-11T15:23:07Z

(use-package ellama
  :demand t
  :bind ("C-c m" . ellama-transient-main-menu)
  :init
  ;; setup key bindings
  (setq ellama-keymap-prefix "C-c e")
  ;; language you want ellama to translate to
  (setq ellama-language "German")
  ;; could be llm-openai for example
  (require 'llm-ollama)
  (setq ellama-provider
          (make-llm-ollama
           ;; this model should be pulled to use it
           ;; value should be the same as you print in terminal during pull
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 8192))))
  (setq ellama-summarization-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 32768))))
  (setq ellama-coding-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 32768))))
  ;; Predefined llm providers for interactive switching.
  ;; You shouldn't add ollama providers here - it can be selected interactively
  ;; without it. It is just example.
  (setq ellama-providers
          '(("deepseek-r1:14b" . (make-llm-ollama
                         :chat-model "qwen2.5-coder"
                         :embedding-model "qwen2.5-coder"))
            ("deepseek-coder-v2" . (make-llm-ollama
                          :chat-model "qwen2.5-coder"
                          :embedding-model "nomic-embed-text"))
            ("qwen2.5-coder" . (make-llm-ollama
                          :chat-model "qwen2.5-coder"
                          :embedding-model "nomic-embed-text"))))
  ;; Naming new sessions with llm
  (setq ellama-naming-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("stop" . ("\n")))))
  (setq ellama-naming-scheme 'ellama-generate-name-by-llm)
  ;; Translation llm provider
  (setq ellama-translation-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params
           '(("num_ctx" . 32768))))
  (setq ellama-extraction-provider (make-llm-ollama
                                      :chat-model "qwen2.5-coder"
                                      :embedding-model "nomic-embed-text"
                                      :default-chat-non-standard-params
                                      '(("num_ctx" . 32768))))
  ;; customize display buffer behaviour
  ;; see ~(info "(elisp) Buffer Display Action Functions")~
  (setq ellama-chat-display-action-function #'display-buffer-full-frame)
  (setq ellama-instant-display-action-function #'display-buffer-at-bottom)
  :config
  ;; send last message in chat buffer with C-c C-c
  (add-hook 'org-ctrl-c-ctrl-c-hook #'ellama-chat-send-last-message))

I set the flag but nothing chages - same result when buffer sent and if code is select.

As you can see num_ctxs are the default values (the one provided here). The deepseek-codeer is not so big and I use it with no problem in terminal ollama run mode.

I have mention that after setting the lash attention and cache quantization flag with qwen2.5-coder code suggestion works.

s-kostyaev · 2025-02-16T20:17:42Z

Your configuration contains only "qwen2.5-coder" (see :chat-model field) with different context length and different aliases. Looks like you can't use it with full context length 32768.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of RAM #209

out of RAM #209

ChichoSkruch commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025 •

edited

Loading

s-kostyaev commented Feb 11, 2025

ChichoSkruch commented Feb 11, 2025 •

edited by s-kostyaev

Loading

s-kostyaev commented Feb 16, 2025

out of RAM #209

out of RAM #209

Comments

ChichoSkruch commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025

s-kostyaev commented Feb 11, 2025 • edited Loading

s-kostyaev commented Feb 11, 2025

ChichoSkruch commented Feb 11, 2025 • edited by s-kostyaev Loading

I set the flag but nothing chages - same result when buffer sent and if code is select.

As you can see num_ctxs are the default values (the one provided here). The deepseek-codeer is not so big and I use it with no problem in terminal ollama run mode.

s-kostyaev commented Feb 16, 2025

s-kostyaev commented Feb 11, 2025 •

edited

Loading

ChichoSkruch commented Feb 11, 2025 •

edited by s-kostyaev

Loading