Skip to content

shubham0204/SmolChat-Android

Repository files navigation

SmolChat - On-Device Inference of SLMs in Android

app_img_01 app_img_02 app_img_03 app_img_03
app_img_04 app_img_05 app_img_06 app_img_07

Project Goals

  • Provide a usable user interface to interact with local SLMs (small language models) locally, on-device
  • Allow users to add/remove SLMs (GGUF models) and modify their system prompts or inference parameters (temperature, min-p)
  • Allow users to create specific-downstream tasks quickly and use SLMs to generate responses
  • Simple, easy to understand, extensible codebase

Setup

  1. Clone the repository with its submodule originating from llama.cpp,
git clone --depth=1 https://github.com/shubham0204/SmolChat-Android
cd SmolChat-Android
git submodule update --init --recursive
  1. Android Studio starts building the project automatically. If not, select Build > Rebuild Project to start a project build.

  2. After a successful project build, connect an Android device to your system. Once connected, the name of the device must be visible in top menu-bar in Android Studio.

Working

  1. The application uses llama.cpp to load and execute GGUF models. As llama.cpp is written in pure C/C++, it is easy to compile on Android-based targets using the NDK.

  2. The smollm module uses a llm_inference.cpp class which interacts with llama.cpp's C-style API to execute the GGUF model and a JNI binding smollm.cpp. Check the C++ source files here. On the Kotlin side, the SmolLM class provides the required methods to interact with the JNI (C++ side) bindings.

  3. The app module contains the application logic and UI code. Whenever a new chat is opened, the app instantiates the SmolLM class and provides it the model file-path which is stored by the LLMModel entity in the ObjectBox. Next, the app adds messages with role user and system to the chat by retrieving them from the database and using LLMInference::add_chat_message.

  4. For tasks, the messages are not persisted, and we inform to LLMInference by passing store_chats=false to LLMInference::load_model.

Technologies

  • ggerganov/llama.cpp is a pure C/C++ framework to execute machine learning models on multiple execution backends. It provides a primitive C-style API to interact with LLMs converted to the GGUF format native to ggml/llama.cpp. The app uses JNI bindings to interact with a small class smollm. cpp which uses llama.cpp to load and execute GGUF models.

  • ObjectBox is a on-device, high-performance NoSQL database with bindings available in multiple languages. The app uses ObjectBox to store the model, chat and message metadata.

  • noties/Markwon is a markdown rendering library for Android. The app uses Markwon and Prism4j (for code syntax highlighting) to render Markdown responses from the SLMs.

Future

The following features/tasks are planned for the future releases of the app:

  • Assign names to chats automatically (just like ChatGPT and Claude)
  • Add a search bar to the navigation drawer to search for messages within chats using ObjectBox's query capabilities
  • Add a background service which uses BlueTooth/HTTP/WiFi to communicate with a desktop application to send queries from the desktop to the mobile device for inference
  • Enable auto-scroll when generating partial response in ChatActivity
  • Measure RAM consumption
  • Add app shortcuts for tasks
  • Integrate Android-Doc-QA for on-device RAG-based question answering from documents
  • Check if llama.cpp can be compiled to use Vulkan for inference on Android devices (and use the mobile GPU)
  • Check if multilingual GGUF models can be supported