llmedge

llmedge is a lightweight Android library for running GGUF language models fully on-device, powered by llama.cpp.

See the examples repository for sample usage.

Acknowledgments to Shubham Panchal and upstream projects are listed in CREDITS.md.

Note

This library is in early development and may change significantly.

Features

LLM Inference: Run GGUF models directly on Android using llama.cpp (JNI)
Model Downloads: Download and cache models from Hugging Face Hub
Optimized Inference: KV Cache reuse for multi-turn conversations
Speech-to-Text (STT): Whisper.cpp integration with timestamp support, language detection, streaming transcription, and SRT generation
Text-to-Speech (TTS): Bark.cpp integration with ARM optimizations
Image Generation: Stable Diffusion with EasyCache and LoRA support
Video Generation: Wan 2.1 models (4-64 frames) with sequential loading
On-device RAG: PDF indexing, embeddings, vector search, Q&A
OCR: Google ML Kit text extraction
Memory Metrics: Built-in RAM usage monitoring
Vision Models: Architecture prepared for LLaVA-style models (requires specific model formats)
Vulkan Acceleration: Optional GPU acceleration (Android 11+ with Vulkan 1.2)

Installation

Warning

For development, it is strongly recommended to work on linux due to Vulkan backend build for Stable Diffusion not working on Windows.

Clone the repository along with the llama.cpp and stable-diffusion.cpp submodule:

git clone --depth=1 https://github.com/Aatricks/llmedge
cd llmedge
git submodule update --init --recursive

Open the project in Android Studio. If it does not build automatically, use Build > Rebuild Project.

Usage

Quick Start

The easiest way to use the library is via the LLMEdgeManager singleton, which handles model loading, caching, and threading automatically.

// Generate text using a default lightweight model (SmolLM-135M)
// The model is automatically downloaded if needed.
CoroutineScope(Dispatchers.IO).launch {
    val reply = LLMEdgeManager.generateText(
        context = context,
        params = LLMEdgeManager.TextGenerationParams(
            prompt = "Summarize on-device LLMs in one sentence."
        )
    )

    withContext(Dispatchers.Main) {
        outputView.text = reply
    }
}

For streaming responses or custom model paths, you can also access the underlying SmolLM instance directly or use LLMEdgeManager's streaming callbacks (see Usage).

Downloading Models

llmedge can download and cache GGUF model weights directly from Hugging Face:

val smol = SmolLM()

val download = smol.loadFromHuggingFace(
    context = context,
    modelId = "unsloth/Qwen3-0.6B-GGUF",
    filename = "Qwen3-0.6B-Q4_K_M.gguf", // optional
    forceDownload = false,
    preferSystemDownloader = true
)

Log.d("llmedge", "Loaded ${download.file.name} from ${download.file.parent}")

Key points:

loadFromHuggingFace downloads (if needed) and loads the model immediately after.
Supports onProgress callbacks and private repositories via token.
Requests to old mirrors automatically resolve to up-to-date Hugging Face repos.
Automatically uses the model's declared context window (minimum 1K tokens) and caps it to a heap-aware limit (2K–8K). Override with InferenceParams(contextSize = …) if needed.
Large downloads use Android's DownloadManager when preferSystemDownloader = true to keep transfers out of the Dalvik heap.
Advanced users can call HuggingFaceHub.ensureModelOnDisk() to manage caching and quantization manually.

Reasoning Controls

SmolLM lets you disable or re-enable "thinking" traces produced by reasoning-aware models through the ThinkingMode enum and the optional reasoningBudget parameter. The default configuration keeps thinking enabled (ThinkingMode.DEFAULT, reasoning budget -1). To start a session with thinking disabled (equivalent to passing --no-think or --reasoning-budget 0), specify it when loading the model:

val smol = SmolLM()

val params = SmolLM.InferenceParams(
    thinkingMode = SmolLM.ThinkingMode.DISABLED,
    reasoningBudget = 0, // explicit override, optional when the mode is DISABLED
)
smol.load(modelPath, params)

At runtime you can flip the behaviour without reloading the model:

smol.setThinkingEnabled(true)              // restore the default
smol.setReasoningBudget(0)                 // force-disable thoughts again
val budget = smol.getReasoningBudget()     // inspect the current budget
val mode = smol.getThinkingMode()          // inspect the current mode

Setting the budget to 0 always disables thinking, while -1 leaves it unrestricted. If you omit reasoningBudget, the library chooses 0 when the mode is DISABLED and -1 otherwise. The API also injects the /no_think tag automatically when thinking is disabled, so you do not need to modify prompts manually.

Image Text Extraction (OCR)

llmedge uses Google ML Kit Text Recognition for extracting text from images.

Quick Start

import io.aatricks.llmedge.LLMEdgeManager

// Process an image (Bitmap)
val text = LLMEdgeManager.extractText(context, bitmap)
println("Extracted text: $text")

OCR Engines

Google ML Kit Text Recognition

Fast and lightweight
No additional data files needed
Good for Latin scripts
Add dependency: implementation("com.google.mlkit:text-recognition:16.0.0")

Processing Modes

enum class VisionMode {
        AUTO_PREFER_VISION,  // Try vision model first, fall back to OCR
        AUTO_PREFER_OCR,     // Try OCR first (ML Kit)
        FORCE_VISION,        // Vision model only (error if unavailable)
        FORCE_MLKIT          // ML Kit OCR only
}

Vision Models

Analyze images using Vision Language Models (like LLaVA or Phi-3 Vision) via LLMEdgeManager.

val description = LLMEdgeManager.analyzeImage(
    context = context,
    params = LLMEdgeManager.VisionAnalysisParams(
        image = bitmap,
        prompt = "Describe this image in detail."
    )
) { status ->
    Log.d("Vision", "Status: $status")
}

The manager handles the complex pipeline of:

Preprocessing the image
Loading the vision projector and model
Encoding the image to embeddings
Generating the textual response

Vision model support is currently experimental and requires specific model architectures (like LLaVA-Phi-3).

Speech-to-Text (Whisper)

Transcribe audio using the high-level LLMEdgeManager API:

import io.aatricks.llmedge.LLMEdgeManager

// Simple transcription
val text = LLMEdgeManager.transcribeAudioToText(
    context = context,
    audioSamples = audioSamples  // 16kHz mono PCM float32
)

// Full transcription with timing
val segments = LLMEdgeManager.transcribeAudio(
    context = context,
    params = LLMEdgeManager.TranscriptionParams(
        audioSamples = audioSamples,
        language = "en"  // null for auto-detect
    )
) { progress ->
    Log.d("Whisper", "Progress: $progress%")
}
segments.forEach { segment ->
    println("[${segment.startTimeMs}ms] ${segment.text}")
}

// Language detection
val lang = LLMEdgeManager.detectLanguage(context, audioSamples)

// Generate SRT subtitles
val srt = LLMEdgeManager.transcribeToSrt(context, audioSamples)

Real-time Streaming Transcription

For live captioning, use the streaming transcription API with a sliding window approach:

import io.aatricks.llmedge.LLMEdgeManager
import kotlinx.coroutines.launch

// Create a streaming transcriber
val transcriber = LLMEdgeManager.createStreamingTranscriber(
    context = context,
    params = LLMEdgeManager.StreamingTranscriptionParams(
        stepMs = 3000,      // Process every 3 seconds
        lengthMs = 10000,   // Use 10-second windows
        keepMs = 200,       // Keep 200ms overlap for context
        language = "en",    // null for auto-detect
        useVad = true       // Skip silent segments
    )
)

// Start collecting transcription results
launch {
    transcriber.start().collect { segment ->
        // Update UI with real-time captions
        updateCaptions(segment.text)
    }
}

// Feed audio samples from microphone as they become available
audioRecorder.onAudioChunk { samples ->
    transcriber.feedAudio(samples)  // 16kHz mono PCM float32
}

// When done recording
transcriber.stop()
LLMEdgeManager.stopStreamingTranscription()

Streaming parameters:

stepMs: How often transcription runs (default: 3000ms). Lower = faster updates, higher CPU usage.
lengthMs: Audio window size (default: 10000ms). Longer windows improve accuracy.
keepMs: Overlap with previous window (default: 200ms). Helps maintain context.
useVad: Voice Activity Detection - skips silent audio (default: true).

For low-level control, see Whisper Low-Level API.

Recommended models:

ggml-tiny.bin (~75MB) - Fast, lower accuracy
ggml-base.bin (~142MB) - Good balance
ggml-small.bin (~466MB) - Higher accuracy

Text-to-Speech (Bark)

Generate speech using the high-level LLMEdgeManager API:

import io.aatricks.llmedge.LLMEdgeManager

// Generate speech
val audio = LLMEdgeManager.synthesizeSpeech(
    context = context,
    params = LLMEdgeManager.SpeechSynthesisParams(
        text = "Hello, world!"
    )
) { step, progress ->
    Log.d("Bark", "${step.name}: $progress%")
}

// Save directly to file
LLMEdgeManager.synthesizeSpeechToFile(
    context = context,
    text = "Hello, world!",
    outputFile = File(cacheDir, "output.wav")
)

For low-level control, see Bark Low-Level API.

Stable Diffusion (image generation)

Generate images on-device using LLMEdgeManager. This automatically handles model downloading (MeinaMix), caching, and memory safety.

val bitmap = LLMEdgeManager.generateImage(
    context = context,
    params = LLMEdgeManager.ImageGenerationParams(
        prompt = "a cute pastel anime cat, soft colors, high quality <lora:detail_tweaker:1.0>",
        width = 512,
        height = 512,
        steps = 20,
        // Optional: Apply a LoRA model from a directory
        loraModelDir = "/path/to/loras",
        loraApplyMode = StableDiffusion.LoraApplyMode.AUTO
    )
)
imageView.setImageBitmap(bitmap)

Key Optimizations:

EasyCache: Automatically detected and enabled for supported models (like Flux/Wan), speeding up generation by reusing intermediate diffusion states.
Flash Attention: Automatically enabled for compatible image dimensions.
LoRA: Apply fine-tuned weights on the fly without merging models.

For advanced usage (custom models, explicit memory control), you can still use the StableDiffusion class directly as shown in the llmedge-examples repository.

Video Generation

Generate short video clips using Wan models via LLMEdgeManager. This handles the complex requirement of loading three separate models (Diffusion, VAE, T5 Encoder) and offers sequential loading for devices with limited RAM.

Hardware Requirements:

12GB+ RAM recommended for standard loading.
8GB+ RAM supported via forceSequentialLoad = true (slower but memory-safe).

// 1. Define parameters
val params = LLMEdgeManager.VideoGenerationParams(
    prompt = "a cat walking in a garden, high quality",
    videoFrames = 8,  // Start small: 8-16 frames
    width = 512,
    height = 512,
    steps = 20,
    cfgScale = 7.0f,
    flowShift = 3.0f,
    forceSequentialLoad = true // Recommended for mobile
)

// 2. Generate video
val frames = LLMEdgeManager.generateVideo(
    context = context,
    params = params
) { status, current, total ->
    // Update progress: "Generating frame 1/8"
    Log.d("VideoGen", "$status")
}

// 3. Use the frames (List<Bitmap>)
previewImageView.setImageBitmap(frames.first())

LLMEdgeManager automatically:

Downloads the necessary Wan 2.1 model files (Diffusion, VAE, T5).
Sequentially loads components to minimize peak memory usage (if requested).
Manages the generation loop and frame conversion.

See llmedge-examples for a complete UI implementation.

Running the example app:

Build the library AAR and copy it into the example app (from the repo root):

./gradlew :llmedge:assembleRelease
cp llmedge/build/outputs/aar/llmedge-release.aar llmedge-examples/app/libs/llmedge-release.aar

Build and install the example app:

cd llmedge-examples
../gradlew :app:assembleDebug
../gradlew :app:installDebug

Open the app on device and pick the "Stable Diffusion" demo from the launcher. The demo downloads any missing files from Hugging Face and runs a quick txt2img generation.

Notes:

The example explicitly downloads a VAE safetensors file for the Meina/MeinaMix demo; many repos include VAE files, but some GGUF model repos bundle everything you need. If the repo lacks a GGUF model file you'll get an obvious IllegalArgumentException — provide a filename or choose a different repo in that case.
Use the system downloader for large safetensors/gguf files to avoid heap pressure on Android.

On-device RAG

The library includes a minimal on-device RAG pipeline, similar to Android-Doc-QA, built with:

Sentence embeddings (ONNX)
Whitespace TextSplitter
In-memory cosine VectorStore with JSON persistence
SmolLM for context-aware responses

Setup

Download embeddings

From the Hugging Face repository sentence-transformers/all-MiniLM-L6-v2, place:

llmedge/src/main/assets/embeddings/all-minilm-l6-v2/model.onnx
llmedge/src/main/assets/embeddings/all-minilm-l6-v2/tokenizer.json

Build the library

./gradlew :llmedge:assembleRelease

Use in your application

    val smol = SmolLM()
    val rag = RAGEngine(context = this, smolLM = smol)

    CoroutineScope(Dispatchers.IO).launch {
        rag.init()
        val count = rag.indexPdf(pdfUri)
        val answer = rag.ask("What are the key points?")
        withContext(Dispatchers.Main) {
            // render answer
        }
    }

Building

Building with Vulkan enabled

If you want to enable Vulkan acceleration for the native inference backend, follow these additional notes and requirements. The project supports building the Vulkan backend for Android but the runtime device must also support Vulkan 1.2.

Prerequisites

Android NDK r27 or newer (NDK r27 used in development; NDK provides the Vulkan C headers). Ensure your NDK matches the version used by your build environment.
CMake 3.22+ and Ninja (the Android Gradle plugin will pick up CMake when configured).
Gradle (use the wrapper: ./gradlew).
Android API (minSdk) 30 or higher when enabling the Vulkan backend — ggml-vulkan requires Vulkan 1.2 which is guaranteed on Android 11+ devices.
(Optional) VULKAN_SDK set in environment if you build shaders or use Vulkan SDK tools on the host. The build will fetch a matching vulkan.hpp header if needed.

Build flags

Enable Vulkan at CMake configure time using Gradle external native build arguments. For example (bash/fish):

./gradlew :llmedge:assembleRelease -Pandroid.injected.build.api=30 -Pandroid.jniCmakeArgs="-DSD_VULKAN=ON -DGGML_VULKAN=ON"

Alternatively, set these flags in llmedge/src/main/cpp/CMakeLists.txt or in your Android Studio CMake configuration. The important flags are -DSD_VULKAN=ON and -DGGML_VULKAN=ON so ggml's Vulkan backend and the Stable Diffusion integration compile with Vulkan support.

Notes about headers and toolchain

The build fetches Vulkan-Hpp (vulkan.hpp) and pins it to the NDK's Vulkan headers to avoid API mismatch. If you have a local VULKAN_SDK you can point to it, otherwise the project will use the fetched headers.
The repository also builds a small host toolchain to generate SPIR-V shaders at build time; ensure your build host has a working C++ toolchain (clang/gcc) and CMake configured.

Runtime verification

The AAR will include native libraries that link against libvulkan.so when Vulkan is enabled. To verify Vulkan is actually being used at runtime:
- Run the app on a device with Android 11+ and a Vulkan 1.2-capable GPU.
- Use the Kotlin API SmolLM.isVulkanEnabled() to check whether the library thinks Vulkan is available.
- Inspect runtime logs: filter logcat for the SmolSD tag (the native logger) and look for backend initialization messages. Example:

adb logcat -s SmolSD:* | sed -n '1,200p'

Look for messages indicating successful Vulkan initialization. Note: some builds logged "Using Vulkan backend" before initialization completed — make sure you see no subsequent "Failed to initialize Vulkan backend" or "Using CPU backend" messages.

Troubleshooting

If you see "Vulkan 1.2 required" or linker errors for Vulkan symbols, confirm minSdk is set to 30 or higher in llmedge/build.gradle.kts and that your NDK provides the expected Vulkan headers.
If your device lacks Vulkan 1.2 support, the native code will fall back to the CPU backend. Use a modern device (Android 11+) or an emulator/image with Vulkan 1.2 support.

Notes:

Uses com.tom-roush:pdfbox-android for PDF parsing.
Embeddings library: io.gitlab.shubham0204:sentence-embeddings:v6.
Scanned PDFs require OCR (e.g., ML Kit or Tesseract) before indexing.
ONNX token_type_ids errors are automatically handled; override via EmbeddingConfig if required.

Architecture

llama.cpp (C/C++) provides the core inference engine, built via the Android NDK.
Stable Diffusion (C/C++) provides the image generation backend, built via the Android NDK.
whisper.cpp (C/C++) provides speech-to-text transcription, built via the Android NDK.
bark.cpp (C/C++) provides text-to-speech synthesis, built via the Android NDK.
LLMInference.cpp wraps the llama.cpp C API.
smollm.cpp, whisper_jni.cpp, bark_jni.cpp expose JNI bindings for Kotlin.
The SmolLM, Whisper, and BarkTTS Kotlin classes provide high-level APIs.

Technologies

llama.cpp — Core LLM backend
stable-diffusion.cpp — Image/video generation backend
whisper.cpp — Speech-to-text backend
bark.cpp — Text-to-speech backend
GGUF / GGML — Model formats
Android NDK / JNI — Native bindings
ONNX Runtime — Sentence embeddings
Android DownloadManager — Large file downloads

Memory Metrics

You can measure RAM usage at runtime:

val snapshot = MemoryMetrics.snapshot(context)
Log.d("Memory", snapshot.toPretty(context))

Typical measurement points:

Before model load
After model load
After blocking prompt
After streaming prompt

Key fields:

totalPssKb: Total proportional RAM usage. Best for overall tracking.
dalvikPssKb: JVM-managed heap and runtime.
nativePssKb: Native heap (llama.cpp, ONNX, tensors, KV cache).
otherPssKb: Miscellaneous memory.

Monitor nativePssKb closely during model loading and inference to understand LLM memory footprint.

Notes

Vulkan SDK may be required; set the VULKAN_SDK environment variable when building with Vulkan.
Vulkan acceleration can be checked via SmolLM.isVulkanEnabled().

ProGuard/R8 Configuration

The library includes consumer ProGuard rules. If you need to add custom rules:

# Keep OCR engines
-keep class io.aatricks.llmedge.vision.** { *; }
-keep class org.bytedeco.** { *; }
-keep class com.google.mlkit.** { *; }

# Suppress warnings for optional dependencies
-dontwarn org.bytedeco.**
-dontwarn com.google.mlkit.**

Licenses

llmedge: Apache 2.0
llama.cpp: MIT
stable-diffusion.cpp: MIT
whisper.cpp: MIT
bark.cpp: MIT
Leptonica: Custom (BSD-like)
Google ML Kit: Proprietary (see ML Kit terms)
JavaCPP: Apache 2.0

License and Credits

This project builds upon work by Shubham Panchal, ggerganov, and PABannier. See CREDITS.md for full details.

Testing

Looking to run unit and instrumentation tests locally, including optional native txt2img E2E checks? See the step-by-step guide in docs/testing.md.

Name		Name	Last commit message	Last commit date
Latest commit History 170 Commits
.github		.github
bark.cpp @ 5d5be84		bark.cpp @ 5d5be84
docs		docs
gradle		gradle
llama.cpp @ 9b9201f		llama.cpp @ 9b9201f
llmedge-examples @ 14bf55c		llmedge-examples @ 14bf55c
llmedge		llmedge
mods		mods
scripts		scripts
stable-diffusion.cpp @ c3ad6a1		stable-diffusion.cpp @ c3ad6a1
whisper.cpp @ 2551e4c		whisper.cpp @ 2551e4c
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.prettierignore		.prettierignore
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
local.properties		local.properties
mkdocs.yml		mkdocs.yml
settings.gradle.kts		settings.gradle.kts

License

Aatricks/llmedge

Folders and files

Latest commit

History

Repository files navigation

llmedge

Features

Table of Contents

Installation

Usage

Quick Start

Downloading Models

Key points:

Reasoning Controls

Image Text Extraction (OCR)

Quick Start

OCR Engines

Processing Modes

Vision Models

Speech-to-Text (Whisper)

Real-time Streaming Transcription

Text-to-Speech (Bark)

Stable Diffusion (image generation)

Video Generation

On-device RAG

Setup

Building

Building with Vulkan enabled

Notes:

Architecture

Technologies

Memory Metrics

Key fields:

Notes

ProGuard/R8 Configuration

Licenses

License and Credits

Testing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 2

Uh oh!

Languages

Packages