Provides an interface for extensions to use language models directly in the browser.
Powered by @mlc-ai/web-llm.
Requirements: SillyTavern 1.12.4 or later.
Install using the link:
https://github.com/SillyTavern/Extension-WebLLM
Note: For simplicity, concurrent requests are disabled and queued.
Access the default API engine instance from the SillyTavern.llm
object, or create your private instance using the SillyTavern.llm.getEngine
method.
Method arguments:
- modelId: The model ID to use.
- silent: If
true
, the engine will not display any toast messages.
const modelId = 'gemma-2-2b-it-q4f16_1-MLC';
const silent = false;
const engine = SillyTavern.llm.getEngine(modelId, silent);
To get all the available models, use the getModels
method.
Returns an array of objects with the following properties:
- id: The model ID.
- vram_required: The amount of VRAM required to load the model in MB.
- context_size: The maximum number of tokens that can be processed in a single request.
const models = await SillyTavern.llm.getModels();
// models = [{id: 'gemma-2-2b-it-q4f16_1-MLC', vram_required: 1895.3, context_size: 4096}, ...]
To get the current model (even if not loaded), use the getCurrentModel
method.
If the engine has not been initialized, it will return null
.
const modelId = 'gemma-2-2b-it-q4f16_1-MLC';
const engine = SillyTavern.llm.getEngine(modelId);
const model = engine.getCurrentModel();
// model = {id: 'gemma-2-2b-it-q4f16_1-MLC', vram_required: 1895.3, context_size: 4096}
The model will be loaded the first time it is accessed, or when the loadModel
method is called.
const modelId = 'gemma-2-2b-it-q4f16_1-MLC';
const engine = SillyTavern.llm.getEngine();
await engine.loadModel(modelId);
To count the number of tokens in a string with the loaded model's tokenizer, use the countTokens
method.
const text = 'Hello, world!';
const engine = SillyTavern.llm.getEngine();
const tokens = await engine.countTokens(text);
// tokens = 5
Set the parameters to be used in all subsequent requests with setDefaultParams
. If a parameter is not set, the model default will be used.
Supported parameters:
- max_tokens: The maximum number of tokens to generate.
- temperature: Controls the randomness of the generated text.
- top_p: An alternative way to control the randomness of the generated text.
- frequency_penalty: Controls the diversity of the generated text.
- presence_penalty: Controls the diversity of the generated text.
- stop: A list of tokens where the model should stop generating text.
const params = {
max_tokens: 100,
temperature: 0.7,
top_p: 1,
frequency_penalty: 0,
presence_penalty: 0,
stop: ['\n']
};
engine.setDefaultParams(params);
generateChatPrompt
accepts a list of chat message objects (similar to OpenAI format) and returns a generated text.
Additionally, specify override parameters to use in this request only.
const prompt = [
{role: 'user', content: 'Hello!'}
];
const overrideParams = {
max_tokens: 50
};
const response = await engine.generateChatPrompt(prompt, overrideParams);
// response = 'Hello! How are you?'
generateJSON
accepts a list of chat message objects (similar to OpenAI format) and returns a generated JSON object. If a model fails to produce a valid JSON object, it will return null. You must instruct the model to generate JSON in the prompt!
Additionally, specify override parameters to use in this request only.
const prompt = [
{ role:'system',content:'Generate a JSON object.' }
{ role: 'user', content: 'Describe a person with the following attributes: name, age, city.' }
];
const overrideParams = {
temperature: 0.2
};
const response = await engine.generateJSON(prompt, overrideParams);
// response = {name: 'John', age: 25, city: 'New York'}
generateChatStream
accepts a list of chat message objects (similar to OpenAI format) and returns an AsyncGenerator that yields the text already generated. Returns the full text, not deltas.
Additionally, specify override parameters to use in this request only.
const prompt = [
{role: 'user', content: 'Hello!'}
];
const overrideParams = { /* ... */};
const stream = await engine.generateChatStream(prompt, overrideParams);
for await (const { text } of stream) {
console.log(text);
}
Default parameters can be configured in the extension settings, including the preferred model.
These parameters will be used in requests for the default engine instance unless overridden by the calling code.
To open a demo playground, use the "Try it out!" button in the extension settings.
This will use the default engine instance with the default parameters.
npm install
npm run build
AGPL-3.0