Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

Llama2 quantized q5_1 #108

Open
HolmesDomain opened this issue Jul 25, 2023 · 1 comment
Open

Llama2 quantized q5_1 #108

HolmesDomain opened this issue Jul 25, 2023 · 1 comment

Comments

@HolmesDomain
Copy link

HolmesDomain commented Jul 25, 2023

I am getting this error:

llama.cpp: loading model from /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin
error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: Failed to initialize LLama context from file: /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin] {
  code: 'GenericFailure'
}

My index.js:

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./llama-2-7b-chat/ggml-model-q5_1.bin");
const llama = new LLM(LLamaCpp);
const config = {
    modelPath: model,
    enableLogging: false,
    nCtx: 1024,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
    nGpuLayers: 0
};

const run = async () => {
    await llama.load(config);
  
    await llama.createCompletion({
        prompt: "My favorite movie is",
        nThreads: 4,
        nTokPredict: 1024,
        topK: 40,
        topP: 0.1,
        temp: 0.3,
        repeatPenalty: 1,
      }, (response) => {
        process.stdout.write(response.token)
      })
  }
  
  run();

It worked before I quantized, but I am hoping quantization makes it faster because it is so slow right now (I'm assuming this would have fixed the speed).

@HolmesDomain
Copy link
Author

HolmesDomain commented Jul 25, 2023

Got it running by using the .bin file from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main

Had no luck generating the q5_1 from here (via the instructions): https://github.com/ggerganov/llama.cpp#prepare-data--run

If this is a common problem maybe you can point people in the direction of just doing a direct download from TheBloke.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant