Skip to content

Ability to use local LLM(LM Studio or Ollama) #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
eliliam opened this issue Apr 14, 2025 · 20 comments
Open

Ability to use local LLM(LM Studio or Ollama) #13

eliliam opened this issue Apr 14, 2025 · 20 comments

Comments

@eliliam
Copy link

eliliam commented Apr 14, 2025

This is such an amazing package, but with some larger codebases that I work with the costs would just be too high to run this using cloud models. How difficult would it be to support locally running LLM models through the likes of LM Studio or Ollama? I know both provide OpenAI compatible APIs as well as a suite of other ways to interact with the locally running model. This feature would be killer and would set this out as a tool similar to Claude Code for local codebase analysis.

@zachary62
Copy link
Contributor

zachary62 commented Apr 14, 2025

Hi @eliliam,
The tool needs a call_llm function that takes a string as input and outputs a string: https://github.com/The-Pocket/Tutorial-Codebase-Knowledge/blob/main/utils/call_llm.py
So you can simply replace the function with an implementation based on LM Studio or Ollama: https://the-pocket.github.io/PocketFlow/utility_function/llm.html
Let me know if this makes sense. Thanks!

@sitestudio
Copy link

So I asked an LLM what to put in here and it came back with this (which works beautifully):

def call_llm(prompt, use_cache: bool = True):
    """
    Calls an Ollama model to generate a text response.

    Args:
        prompt (str): The prompt to send to the model.
        use_cache (bool, optional): Whether to use Ollama's caching mechanism. Defaults to True.

    Returns:
        str: The generated text response from the model.
    """
    try:
        response = ollama.chat(
            model='cogito:14b',  # deepcoder:14b  gemma3:12b  phi4:14b Replace with your desired Ollama model name
            messages=[
                {
                    'role': 'user',
                    'content': prompt,
                },
            ],
            stream=False, # important to set stream to false to get the response.
            options = {
                'use_cache': use_cache,
            }

        )
        return response['message']['content']
    except ollama.ResponseError as e:
        print(f"Ollama Error: {e}")
        return None  # Or handle the error as needed.

Also had to add the following to requirements.txt:

ollama >=0.4.7

@zachary62
Copy link
Contributor

@sitestudio
Amazing! One minor change I would suggest is to remove the except: return None
It may generate, e.g., a tutorial with empty chapters. Just let it fail, as pocket flow has native node retry.

@xiongyw
Copy link

xiongyw commented Apr 16, 2025

This is such an amazing package, but with some larger codebases that I work with the costs would just be too high to run this using cloud models. How difficult would it be to support locally running LLM models through the likes of LM Studio or Ollama? I know both provide OpenAI compatible APIs as well as a suite of other ways to interact with the locally running model. This feature would be killer and would set this out as a tool similar to Claude Code for local codebase analysis.

FYI: the following simple update works for me, for both ollama and grok as tested. call_llm() for ollama and grok

@FatCache
Copy link

Are there any plans to merge the Ollama into the code base?

@zachary62
Copy link
Contributor

Are there any plans to merge the Ollama into the code base?

Check out the code snippet provided by @xiongyw for ollama support: call_llm() for ollama and grok

@zachary62 zachary62 mentioned this issue Apr 23, 2025
@piranna
Copy link

piranna commented Apr 23, 2025

Ollama also allows to work with DeepSeek, I don't know if it supports Gemini and OpenAI and other closed ones, but if so, we could replace call_llm() function for a single one using Ollama, delegating the actual model to be used to it :-)

@eliliam
Copy link
Author

eliliam commented Apr 23, 2025

@zachary62 the code provided by @xiongyw looks good, could we have that made into a PR to get merged into master?

@zachary62
Copy link
Contributor

@zachary62 the code provided by @xiongyw looks good, could we have that made into a PR to get merged into master?

Yes! Could you make this code commented out by default, and say something like "uncomment it for ollama". Thank you!

@Le09
Copy link

Le09 commented Apr 24, 2025

I cherry-picked the commit an implemented a simple switch in #50, let me know if you prefer some changes to it.

@gethari
Copy link

gethari commented Apr 29, 2025

What's the best model to use if using for a React+TS codebase ? I tried the generation with llama3.2 and the output generated is just hot garbage, none of the content generated is relevant to the codebase I pointed at.

@sitestudio
Copy link

sitestudio commented Apr 29, 2025

Try the brand new Qwen3:8b (or bigger if your hardware can handle it) - switched to it yesterday and was getting much better results than with anything else. I was only generating python and have limited hardware atm, but was very impressed with the step up in it's "cognition".

@gethari
Copy link

gethari commented Apr 30, 2025

@sitestudio I tried using codellama:13b, the script seems to be failing randomly, I tried copilot to fix this, but not a python expert so, I could'nt use this model.

 yaml_str = response.strip().split("```yaml")[1].split("```")[0].strip()
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Looks like the response is not in expected format and differs across models

@sitestudio
Copy link

@gethari my experience has been similar now that I actually tried to compile the python code that comes from qwen3-8b - I suspect it is a lack of "RAM/VRAM" locally. Waiting to a get my M1 Macbook repaired to see if that helps or finding some bigger hardware.

My other issue is that if I run in Plan mode in Cline then I am unable to respond to the first output - an error message about tool_name and setting up a Modelfile with settings like PARAMETER stop "</attempt_completion>"

however my research also suggests that this issue can be related to various things such as the version of Ollama, the individual model or the amount of RAM/VRAM I have.

So for now I have resorted to running in Plan mode, including answers in subsequent Plan mode prompts and then running an individual Act mode prompt and then using that code to move forward for now.

@gethari
Copy link

gethari commented Apr 30, 2025

So for now I have resorted to running in Plan mode

How to do this @sitestudio

@sitestudio
Copy link

@gethari I am using Cline and at the bottom right corner of the Extension window (just below where your prompt would go) you can toggle between Plan and Act mode.

@TheHawk3r
Copy link

@gethari i had the issue with this too.

yaml_str = response.strip().split("```yaml")[1].split("```")[0].strip()
              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

@zachary62
Copy link
Contributor

This issue is caused because the model is not capable enough to output yaml string. Please use a more capable model.
Check out #61

@TheHawk3r
Copy link

import requests
from termcolor import cprint
import json

# Configurează conexiunea către LM Studio local
local_api_url = "http://127.0.0.1:1234/v1/completions"  # Adresa corectă pentru serverul tău local
api_key = "gemma-3-27b-it"  # Dacă ai nevoie de un API key pentru acces

def call_llm(prompt: str, use_cache: bool = False):
    cprint("[Querying local LLM via LM Studio]", "cyan")
    try:
        # Setează datele pentru cererea POST
        data = {
            "model": "gemma-3-27b-it",  # Modelul specificat în LM Studio
            "prompt": prompt,
            "max_tokens": 100,  # Poți ajusta acest parametru
            "temperature": 0.7,  # Poți ajusta acest parametru
        }

        # Setează antetul pentru cererea POST
        headers = {
            "Authorization": f"Bearer {api_key}",  # Dacă API key-ul este necesar
            "Content-Type": "application/json"
        }

        # Trimite cererea POST către LM Studio local
        response = requests.post(local_api_url, headers=headers, data=json.dumps(data))

        # Verifică dacă cererea a avut succes
        if response.status_code == 200:
            result = response.json()
            response_text = result.get("choices")[0].get("text").strip()  # Extrage răspunsul din JSON
            return response_text
        else:
            cprint(f"[Error calling LM Studio] {response.status_code}: {response.text}", "red")
            return "[Error calling LLM]"

    except Exception as e:
        cprint(f"[Error calling LM Studio] {e}", "red")
        return "[Error calling LLM]"

Here is the call_llm.py script i modified with chatgpt to work using lm studio. gemma-3-27b-it gives me a an error so another model most probably is needed.

ValueError: Missing keys in abstraction item: {'name': 'Logging System', 'description': 'Imagine it as'}

@TheHawk3r
Copy link

i managed to make it work with deepseek-r1-distill-qwen-14b .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants