Releases: LostRuins/koboldcpp
koboldcpp-1.45.2
koboldcpp-1.45.2
- Improved embedded horde worker: more responsive, and added Session Stats (Total Kudos Earned, EarnRate, Timings)
- Added a new parameter to grammar sampler API
grammar_retain_state
which lets you persist the grammar state across multiple requests. - Allow launching by picking a .kcpps file in the file selector GUI combined with
--skiplauncher
. That settings file must already have a model selected. (Similar to--config
, but that one doesn't use GUI at all.) - Added a new flag toggle
--foreground
for windows users. This sends the console terminal to the foreground every time a new prompt is generated, to avoid some idling slowdown issues. - Increased max support context with
--contextsize
to 32k, but only for GGUF models. It's still limited to 16k for older model versions. GGUF now actually has no hard limit to max context since it switched to using allocators, but it's not be compatible with older models. Additionally, models not trained with extended context are unlikely to work when RoPE scaled beyond 32k. - Added a simple OpenAI compatible completions API, which you can access at
/v1/completions
. You're still recommended to use the Kobold API as it has many more settings. - Increased stop_sequence limit to 16.
- Improved SSE streaming by batching pending tokens between events.
- Upgraded Lite polled-streaming to work even in multiuser mode. This works by sending a unique key for each request.
- Improved Makefile to reduce unnecessary builds, added flag for skipping K-quants.
- Enhanced Remote-Link.cmd to also work on Linux, simply run it to create a Cloudflare tunnel to access koboldcpp anywhere.
- Improved the default colab notebook to use mmq.
- Updated Lite and pulled other fixes and improvements from upstream llama.cpp.
Important: Deprecation Notice for KoboldCpp 1.45.1
The following command line arguments are considered deprecated and will be removed soon, in a future version.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
Hotfix for 1.45.2 - Fixed a bug with reading thread counts in 1.45 and 1.45.1, also moved the OpenAI endpoint from /api/extra/oai/v1/completions
to just /v1/completions
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.44.2
koboldcpp-1.44.2
A.K.A The "Mom: we have SillyTavern at home edition"
- Added multi-user mode with
--multiuser
which allows up to 5 concurrent incoming/generate
requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the/check
and/abort
endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request. - Added a new launcher argument
--onready
which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc. - Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
- Added a new API endpoint
/api/extra/true_max_context_length
which allows fetching the true max context limit, separate from the horde-friendly value. - Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
- Tweaked automatic RoPE scaling
- Pulled other fixes and improvements from upstream.
- Note: Using
--usecublas
with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.
Major Update for Kobold Lite:
- Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
- Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
- Added Mirostat UI configs to settings panel.
- Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
- Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
- Added support for automagically increasing the max context size slider limit, if a larger context is detected.
- Added scenario for importing characters from Chub.Ai
- Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the
--stream
flag is still kept for compatibility. - Added a few Instruct Tag Presets in a dropdown.
- Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like
{{[INPUT]}}
and{{[OUTPUT]}}
- Added a toggle for "Newline After Memory" which can be set in the memory panel.
- Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
- You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
- Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.
v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.43
koboldcpp-1.43
- Re-added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a
--ropeconfig
. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified--contextsize
. Setting--ropeconfig
will override this. This was bugged and removed in the previous release, but it should be working fine now. - HIP and CUDA visible devices set to that GPU only, if GPU number is provided and tensor split is not specified.
- Fixed RWKV models being broken after recent upgrades.
- Tweaked
--unbantokens
to decrease the banned token logit values further, as very rarely they could still appear. Still not using-inf
as that causes issues with typical sampling. - Integrate SSE streaming improvements from @kalomaze
- Added mutex for thread-safe polled-streaming from @Elbios
- Added support for older GGML (ggjt_v3) for 34B llama2 models by @vxiiduu, note that this may still have issues if n_gqa is not 1, in which case using GGUF would be better.
- Fixed support for Windows 7, which should work in noavx2 and failsafe modes again. Also, SSE3 flags are now enabled for failsafe mode.
- Updated Kobold Lite, now uses placeholders for instruct tags that get swapped during generation.
- Tab navigation order improved in GUI launcher, though some elements like checkboxes still require mouse to toggle.
- Pulled other fixes and improvements from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
Of Note:
- Reminder that HIPBLAS requires self compilation, and is not included by default in the prebuilt executables.
- Remember that token unbans can now be set via API (and Lite) in addition to the command line.
koboldcpp-1.42.1
koboldcpp-1.42.1
- Added support for LLAMA GGUFv2 models, handled automatically. All older models will still continue to work normally.
- Fixed a problem with certain logit values that were causing segfaults when using the Typical sampler. Please let me know if it happens again.
- Merged rocm support from @YellowRoseCx so you should now be able to build AMD compatible GPU builds with HIPBLAS, which should be faster than using CLBlast.
- Merged upstream support for GGUF Falcon models. Note that GPU layer offload for Falcon is unavailable with
--useclblast
but works with CUDA. Older pre-gguf Falcon models are not supported. - Added support for unbanning EOS tokens directly from API, and by extension it can now be triggered from Lite UI settings. Note: Your command line
--unbantokens
flag will force override this.
- Added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a(reverted in 1.42.1 for now, it was not setup correctly)--ropeconfig
. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified--contextsize
. Setting--ropeconfig
will override this. - Updated Kobold Lite, now with tavern style portraits in Aesthetic Instruct mode.
- Pulled other fixes and improvements from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.41 (beta)
koboldcpp-1.41 (beta)
It's been a while since the last release and quite a lot upstream has changed under the hood, so consider this release a beta.
- Added support for LLAMA GGUF models, handled automatically. All older models will still continue to work normally. Note that GGUF format support for other non-llama architectures has not been added yet.
- Added
--config
flag to load a.kcpps
settings file when launching from command line (Credits: @poppeman), these files can also be imported/exported from the GUI. - Added a new endpoint
/api/extra/tokencount
which can be used to tokenize and accurately measure how many tokens any string has. - Fix for bell characters occasionally causing the terminal to beep in debug mode.
- Fix for incorrect list of backends & missing backends displayed in the GUI.
- Set MMQ to be the default for CUDA when running from GUI.
- Updated Lite, and merged all the improvements and fixes from upstream.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.40.1
koboldcpp-1.40.1
This release is mostly for bugfixes to the previous one, but enough small stuff has changed that I chose to make it a new version instead of a patch for the previous one.
- Fixed a regression in format detection for LLAMA 70B.
- Converted the embedded horde worker into daemon mode, hopefully solves the occasional exceptions
- Fixed some OOMs for blasbatchsize 2048, adjusted buffer sizes
- Slight modification to the look ahead (2 to 5%) for the cuda pool malloc.
- Pulled some bugfixes from upstream
- Added a new field
idle
for the/api/extra/perf
endpoint, allows checking if a generation is in progress without sending one. - Fixed cmake compilation for cudatoolkit 12.
- Updated Lite, includes option for aesthetic instruct UI (early beta by @Lyrcaxis, please send them your feedback)
hotfix 1.40.1:
- handle stablecode-completion-alpha-3b
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.39.1
koboldcpp-1.39.1
- Fix SSE streaming to handle headers correctly during abort (Credits: @duncannah)
- Bugfix for
--blasbatchsize -1
and1024
(fix alloc blocks error) - Added experimental support for
--blasbatchsize 2048
(note, buffers are doubled if that is selected, using much more memory) - Added support for 12k and 16k
--contextsize
options. Please let me know if you encounter issues. - Pulled upstream improvements, further CUDA speedups for MMQ mode for all quant types.
- Fix for some LLAMA 65B models being detected as LLAMA2 70B models.
- Revert to upstream approach for CUDA pool malloc (1.39.1 - done only for MMQ).
- Updated Lite, includes adding support for importing Tavern V2 card formats, with world info (character book) and clearer settings edit boxes.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.38
koboldcpp-1.38
- Added upstream support for Quantized MatMul (MMQ) prompt processing, a new option for CUDA (enabled by adding
--usecublas mmq
or toggle in GUI). This uses slightly less memory, and is slightly faster for Q4_0 but slower for K-quants. - Fixed SSE streaming for multibyte characters (For Tavern compatibility)
--noavx2
mode now does not use OpenBLAS (same as Failsafe), this is due to numerous compatibility complaints.- GUI dropdown preset only displays built platforms (Credit: @YellowRoseCx)
- Added a Help button in the GUI
- Fixed an issue with mirostat not reading correct value from GUI
- Fixed an issue with context size slider being limited to 4096 in the GUI
- Displays a terminal warning if received context exceeds max launcher allocated context
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.37.1
koboldcpp-1.37.1
- NEW: KoboldCpp now comes with an embedded Horde Worker which allows anyone to share their ggml models with the AI Horde without downloading additional dependences.
--hordeconfig
now accepts 5 parameters[hordemodelname] [hordegenlength] [hordemaxctx] [hordeapikey] [hordeworkername]
, filling up all 5 will start a Horde worker for you that serves horde requests automatically in the background. For previous behavior, exclude the last 2 parameters to continue using your own Horde worker (e.g. HaidraScribe/KAIHordeBridge). This feature can also be enabled via the GUI. - Added Support for LLAMA2 70B models. This should work automatically, GQA will be set to 8 if it's detected.
- Fixed a bug with mirostat v2 that was causing overly deterministic results. Please try it again. (Credit: @ycros)
- Added addition information to
/api/extra/perf
for the last generation info, including the stopping reason as well as generated token counts. - Exposed the parameter for
--tensor_split
which works exactly like it does upstream. Only for CUDA. - Try to support Kepler as a target for CUDA as well on henky's suggestion, can't guarantee it will work as I don't have a K80, but it might.
- Retained support for
--blasbatchsize 1024
after it was removed upstream. Scratch & KV buffer sizes will be larger when using this. - Minor bugfixes, pulled other upstream fixes and optimizations, updated Kobold Lite (chat mode improvements)
Hotfix 1.37.1
- Fixed clblast to work correctly for LLAMA2 70B
- Fixed sending Client-Agent for embedded horde worker in addition to Bridge Agent and User Agent
- Changed
rms_norm_eps
to5e-6
for better results for both llama1 and 2 - Fixed some streaming bugs in Lite
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.
koboldcpp-1.36
koboldcpp-1.36
- Reverted an upstream change to
sched_yield()
that caused slowdowns for certain systems. This should fix speed regressions in 1.35. If you're still experiencing poorer speeds compared to earlier versions, please raise an issue with details. - Reworked command line args on RoPE for extended context to be similar to upstream. Thus,
--linearrope
has been removed. Instead, you can now use--ropeconfig
to customize both RoPE frequency scale (Linear) and RoPE frequency base (NTK-Aware) values, e.g.--ropeconfig 0.5 10000
for a 2x linear scale. By default, long contextNTK-Aware RoPE
will be automatically configured based on your--contextsize
parameter, similar to previously. If you're using LLAMA2 at 4K context, you'd probably want to use--ropeconfig 1.0 10000
to take advantage of the native 4K tuning without scaling. For ease of use, this can be set in the GUI too. - Expose additional token counter information through the API
/api/extra/perf
- The warning for poor sampler orders has been limited to show only once per session, and excludes mirostat. I've heard some people have issues with it, so please let me know if it's still causing problems, though it's only a text warning and should not affect actual operation.
- Model busy flag replaced by Thread Lock, credits @ycros.
- Tweaked scratch and KV buffer allocation sizes for extended context.
- Updated Kobold Lite, with better whitespace trim support and a new toggle for partial chat responses.
- Pulled other upstream fixes and optimizations.
- Downgraded CUDA windows libraries to 11.4 for smaller exe filesizes, same version previously tried by @henk717. Please do report any issues or regressions encountered with this version.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.