Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of RAM #209

Open
ChichoSkruch opened this issue Feb 11, 2025 · 6 comments
Open

out of RAM #209

ChichoSkruch opened this issue Feb 11, 2025 · 6 comments

Comments

@ChichoSkruch
Copy link

I use ollama in terminal with deepseek-r1:14b even 32b (very slow) in this linux zenbook pro duo UX851G. When ask chat with deepseek-coder-v2 or qwen2.5-coder it goes OK. If I try code review or suggestions I receive this error:

Debugger entered--Lisp error: (error "Error calling the LLM: model requires more system ...")
error("Error calling the LLM: %s" "model requires more system memory (49.0 GiB) than ...")
#("model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (_ msg) #<bytecode 0x1dd94b5c526701a6>)(error "model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (type err) #<bytecode 0x1c89dbf61d0fce3b>)(error "model requires more system memory (49.0 GiB) than ...")
llm-provider-utils-callback-in-buffer(# #f(compiled-function (type err) #<bytecode 0x1c89dbf61d0fce3b>) error "model requires more system memory (49.0 GiB) than ...")
#f(compiled-function (_ data) #<bytecode -0xe7e2c5f10e42101>)(error ((error . "model requires more system memory (49.0 GiB) than ...")))
llm-request-plz--handle-error(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32:35 GMT") (content-length . "85$
#f(compiled-function (error) #<bytecode -0x8cecea1549f9e66>)(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32$
#f(compiled-function (error) #<bytecode 0x14bca7baea1431be>)(#s(plz-error :curl-error nil :response #s(plz-response :version 1.1 :status 500 :headers ((content-type . "application/json; charset=utf-8") (date . "Tue, 11 Feb 2025 12:32$
plz--respond(# #<buffer plz-request-curl> "finished\n")
apply(plz--respond (# #<buffer plz-request-curl> "finished\n"))
timer-event-handler([t 26539 17251 109626 nil plz--respond (# #<buffer plz-request-curl> "finished\n") nil 478000 nil])

Is there a configuration I can make so to use models that runs in my terminal?

@s-kostyaev
Copy link
Owner

Show me your ellama configuration

@s-kostyaev
Copy link
Owner

Also you can try to enable flash attention and cache quantization: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention

@s-kostyaev
Copy link
Owner

s-kostyaev commented Feb 11, 2025

in your configuration I want to see num_ctx - context length. The more value you set the more VRAM/RAM you need. Models also can differ in size. You can pick lesser model or lesser quant of model. If not set, by default it's 2k on ollama AFAIK.

@s-kostyaev
Copy link
Owner

Also, you can mark region before calling ellama-code-review to send not all buffer content but only active region.

@ChichoSkruch
Copy link
Author

ChichoSkruch commented Feb 11, 2025

(use-package ellama
  :demand t
  :bind ("C-c m" . ellama-transient-main-menu)
  :init
  ;; setup key bindings
  (setq ellama-keymap-prefix "C-c e")
  ;; language you want ellama to translate to
  (setq ellama-language "German")
  ;; could be llm-openai for example
  (require 'llm-ollama)
  (setq ellama-provider
          (make-llm-ollama
           ;; this model should be pulled to use it
           ;; value should be the same as you print in terminal during pull
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 8192))))
  (setq ellama-summarization-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 32768))))
  (setq ellama-coding-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("num_ctx" . 32768))))
  ;; Predefined llm providers for interactive switching.
  ;; You shouldn't add ollama providers here - it can be selected interactively
  ;; without it. It is just example.
  (setq ellama-providers
          '(("deepseek-r1:14b" . (make-llm-ollama
                         :chat-model "qwen2.5-coder"
                         :embedding-model "qwen2.5-coder"))
            ("deepseek-coder-v2" . (make-llm-ollama
                          :chat-model "qwen2.5-coder"
                          :embedding-model "nomic-embed-text"))
            ("qwen2.5-coder" . (make-llm-ollama
                          :chat-model "qwen2.5-coder"
                          :embedding-model "nomic-embed-text"))))
  ;; Naming new sessions with llm
  (setq ellama-naming-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params '(("stop" . ("\n")))))
  (setq ellama-naming-scheme 'ellama-generate-name-by-llm)
  ;; Translation llm provider
  (setq ellama-translation-provider
          (make-llm-ollama
           :chat-model "qwen2.5-coder"
           :embedding-model "nomic-embed-text"
           :default-chat-non-standard-params
           '(("num_ctx" . 32768))))
  (setq ellama-extraction-provider (make-llm-ollama
                                      :chat-model "qwen2.5-coder"
                                      :embedding-model "nomic-embed-text"
                                      :default-chat-non-standard-params
                                      '(("num_ctx" . 32768))))
  ;; customize display buffer behaviour
  ;; see ~(info "(elisp) Buffer Display Action Functions")~
  (setq ellama-chat-display-action-function #'display-buffer-full-frame)
  (setq ellama-instant-display-action-function #'display-buffer-at-bottom)
  :config
  ;; send last message in chat buffer with C-c C-c
  (add-hook 'org-ctrl-c-ctrl-c-hook #'ellama-chat-send-last-message))

I set the flag but nothing chages - same result when buffer sent and if code is select.

As you can see num_ctxs are the default values (the one provided here). The deepseek-codeer is not so big and I use it with no problem in terminal ollama run mode.

I have mention that after setting the lash attention and cache quantization flag with qwen2.5-coder code suggestion works.

@s-kostyaev
Copy link
Owner

Your configuration contains only "qwen2.5-coder" (see :chat-model field) with different context length and different aliases. Looks like you can't use it with full context length 32768.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants