A minimal, cross-platform LLM chat app with BELLE using quantized on-device offline models and Flutter UI, running on macOS (done), Windows, Android, iOS(see Known Issues) and more.
Please refer to Releases.
Downloading and usage for different platforms: Usage.
Only macOS supported by now. More platforms coming soon!
You can download from huggingface repo, ChatBELLE-int4
You need to first execute the ChatBELLE app, which will create a folder~/Library/Containers/com.barius.chatbelle. Then rename the downloaded model and move it to the path displayed on the app. The default is ~/Library/Containers/com.barius.chatbell/Data/belle-model.bin.
Utilizes llama.cpp's 4bit quantization to optimize on-device inferencing speed and RAM occupation. Quantization leads to accuracy loss and model performance degradation. 4-bit quantization trades accuracy for model size, our current 4-bit model sees significant performance gap compared with fp32 or fp16 ones and is just for users to take a try. With better algorithms being developed and more powerful chips landing on mobile devices, we believe on-device model performance will thrive and will keep a close track on this.
GPTQ employs one-shot quantization to achieve lower accuracy loss or higher model compression rate. We will keep track of this line of work.
- More devices
- Multiround chat
- Model selection
- Chat history
- Chat list
Recommend using M1/M2 series CPU with 16GB RAM to have the best experience. If you encounter slow inference, try closing other apps to release more memory. Inference on 8G RAM will be very slow. Intel CPUs could possibly run as well (not tested) but could be very slow.
- Download chatbelle.dmg from Releases page, double click to open it, then drag
Chat Belle.dmg
intoApplications
folder. - Open the
Chat Belle
app inApplications
folder by right click then Ctrl-clickOpen
, then clickOpen
. - The app will prompt the intended model file path and fail to load the model. Close the app.
- Download quantized model from ChatBELLE-int4.
- Move and rename the model to the path prompted by the app. Defaults to
~/Library/Containers/com.barius.chatbelle/Data/belle-model.bin
. - Reopen the app again (double clicking is now OK).
- Stay tuned
- Stay tuned
- Stay tuned
- On macOS devices with 8GB RAM, inference is really slow due to constant swapping. 16GB RAM devices might see the same slowdown if RAM occupation by other applications is high.
- Inferencing on Macs with Intel chips is slow.
- The 3GB App RAM constraint on iOS devices won't allow even the smallest model (~4.3G) from loading. Reference
This program is for learning and research purposes only. The devs take no responsibilities in any damage caused by using or distributing this program.
- LLaMa model inferencing code uses llama.cpp
- Flutter chat UI uses flyer.chat