EdgeMind: Edge MLOps Experiments on Android

experimental October 2025

Android Kotlin ONNX Runtime Machine Learning Edge AI NNAPI

EdgeMind: Edge MLOps Experiments on Android

EdgeMind: Bringing LLMs to the Edge

EdgeMind is an experimental on-device LLM inference system for Android. The project explores how large language models can run on mobile devices with proper optimization, achieving 4 tokens/second on consumer hardware with complete privacy - no cloud, no data transmission, 100% local. Currently in research and experimentation phase.

Project Evolution

Three Major Pivots

Initial Vision (Android Auto Voice Assistant)

Goal: Privacy-first voice assistant for cars
Stack: Whisper (STT) + Llama 3.2 (LLM) + Pico TTS
Reality: Too ambitious for first iteration
Learning: Start with core technology, expand later

First Pivot (Local AI Assistant)

Focus: Nail on-device LLM inference first
Scope: Single model, production quality
Decision: Android Auto features become future work
Impact: Allowed deep focus on hard problems

Second Pivot (Two Branches)

Branch 1: Document parsing with Granite Docling (258M params)
Branch 2: Conversational AI with Phi-3 mini (3.8B params)
Rationale: Different use cases, different optimizations
Status: Conversational branch production-ready

Technical Architecture

Core Components

Model Layer

Phi-3 mini INT4 (3.8B parameters quantized to 2.6GB)
ONNX Runtime 1.19.2 with IR version 7
NNAPI hardware acceleration (NPU/TPU/DSP)
Memory-mapped model loading

Tokenization

Custom SentencePiece BPE implementation
32,064 vocabulary tokens
61,249 merge rules
Handles byte tokens, special tokens, space markers
Zero external dependencies (350 lines of Kotlin)

Inference Pipeline

KV cache implementation (32 layers × 2 tensors)
Streaming generation via Kotlin Flow
Smart stopping criteria (n-gram detection)
Proper memory management (no leaks)

User Interface

Jetpack Compose Material 3
Real-time token streaming
Manual stop button
Clean Architecture (MVVM + Domain/Data)

Key Technical Achievements

1. Custom Tokenizer from Scratch

Built complete SentencePiece BPE tokenizer by parsing tokenizer.json:

Avoided 50MB+ external dependency
Full control over encoding/decoding behavior
Deep understanding of tokenization process
Production-ready implementation in 350 lines

2. KV Cache Optimization

Implemented transformer key-value caching for 6x speedup:

Before: 800ms → 30s+ (exponential slowdown)
After: 120-250ms (consistent performance)
Complexity: O(n²) → O(n)
Critical for mobile viability

3. Memory Leak Resolution

Debugged and fixed critical OOM issues:

Symptom: App crashes after ~200 tokens
Root cause: 12,800+ leaked tensors
Solution: Proper cache cleanup lifecycle
Result: Stable memory, no crashes

4. Smart Stopping Criteria

Multi-layered approach to end generation naturally:

EOS token detection
Consecutive repetition (10+ same tokens)
N-gram pattern recognition (4-grams, 5-grams)
400 token safety limit
Manual user control via UI button

5. Hardware Acceleration

NNAPI integration for NPU utilization:

4x speedup from hardware acceleration
Combined with KV cache: 24x total speedup
Works across different device chipsets
Graceful CPU fallback

Production Metrics

Performance

Test: "Tell me a short story"
Generated: 311 tokens in 78 seconds
Throughput: 4 tokens/second
Per-token: ~250ms average
Memory: Stable at ~500MB
Crashes: None
Quality: Coherent narrative with plot and characters

Quality Example

Input: "Tell me a short story"

Output: "Certainly! Here's a short story for you:

Once upon a time, in a peaceful village nestled between lush green
hills, there lived a young girl named Lily. She had a kind heart and
a curious mind. One day, while exploring the nearby forest, she
discovered a mysterious, ancient artifact - a beautifully crafted
golden amulet with intricate engravings...

[Continues with coherent plot involving village elders, a historian
named Elara, and a wise decision about the amulet's use - 311 tokens
total, stopped naturally via n-gram repetition detection]"

Development Process

From Scratch Philosophy

Why Build Custom vs. Use Libraries?

Decision matrix applied to each component:

Tokenizer:

Library: ONNX Runtime GenAI (50MB+, manual AAR, limited control)
Custom: 350 lines Kotlin (3.5MB config, full control, deep learning)
Choice: Custom (worth the effort)

KV Cache:

Library: Not available in standard ONNX Runtime
Custom: Manual implementation required
Choice: Custom (only option)

Inference Runtime:

Library: ONNX Runtime (mature, well-tested)
Custom: Would take months, reinventing wheel
Choice: Library (right tool for the job)

The pattern: Build custom when it provides significant value (learning, control, size). Use libraries when they’re the right tool.

Testing Strategy

Multi-device testing with devices found around my flat:

Old S10+ from my mom
Pixel 7a
Various other Android devices

Why this matters:

Different NPU capabilities per chipset
Varying memory constraints
Different thermal throttling thresholds
Real-world hardware diversity

Edge ML must work on the chaos of actual devices, not just the development phone.

Technical Challenges Solved

Challenge 1: Tokenizer Mismatch

Problem: 99-token placeholder vs. 32k-token model Solution: Custom BPE implementation from tokenizer.json Result: Perfect text quality, zero dependencies

Challenge 2: Exponential Slowdown

Problem: O(n²) complexity without caching Solution: Transformer KV cache implementation Result: 6x speedup, constant-time tokens

Challenge 3: Memory Leaks

Problem: OOM crashes after ~200 tokens Solution: Proper tensor lifecycle management Result: Stable memory, unlimited generation

Challenge 4: Premature Stopping

Problem: Heuristics stopped mid-sentence Solution: N-gram detection + user control Result: Natural endings, manual override

Challenge 5: Performance

Problem: 800ms/token too slow for production Solution: NNAPI + KV cache + quantization Result: 250ms/token (24x total speedup)

Current Status

Production Ready: Conversational AI branch with Phi-3 mini

All critical issues resolved
Stable memory management
Hardware acceleration working
Quality text generation
User control implemented

Experimental: Document parsing branch with Granite Docling

OCR and table extraction
Multi-page document processing
Structured data export
Active research and development

Future Roadmap

Short Term

Temperature/top-p sampling (currently greedy)
Conversation history persistence with Room DB
Multiple concurrent chat sessions
Model parameter tuning UI

Medium Term

Speech-to-text integration (Whisper)
Text-to-speech synthesis
Android Auto voice interface
Voice-controlled document queries

Long Term

Model distillation for smaller variants
On-device LoRA fine-tuning
INT2 quantization experiments
Multi-model orchestration

Technology Stack

Android

Kotlin 1.9.20
Jetpack Compose
Material 3
Coroutines + Flow
Hilt dependency injection

Machine Learning

ONNX Runtime 1.19.2
NNAPI acceleration
Custom tokenization
INT4 quantization

Architecture

Clean Architecture
MVVM pattern
Repository pattern
Use case layer

Key Learnings

Edge MLOps is Different

Memory discipline: One leak kills mobile apps
Performance required: Users won’t wait 30s/token
Hardware diversity: Test on real, varied devices
Dependencies cost: APK size, complexity, control
User control: Models are unpredictable, give override

From Scratch Benefits

Deep understanding of underlying algorithms
Debugging becomes possible (not black box)
Full control over behavior and optimization
Minimal dependencies and APK size
Learning compounds over time

Mobile Constraints Drive Innovation

Forced quantization (discovered INT4 works)
Required KV cache (learned transformer internals)
Demanded memory discipline (better code)
Necessitated hardware acceleration (NNAPI expertise)
Enabled creativity (custom solutions)

Open Source

The complete source code is available on GitHub: github.com/IgnacioLD/edgemind

Contributions welcome. Issues and pull requests appreciated.

Conclusion

EdgeMind demonstrates that production-quality LLM inference on Android is achievable with proper engineering. The project evolved through three major pivots, each focusing the scope and deepening the technical work. What started as an ambitious multi-model system became a deep exploration of edge MLOps: quantization, tokenization, caching, acceleration, and memory management.

The result: 4 tokens/second, coherent text, stable memory, complete privacy. All running locally on Android. All from scratch. All open source.

The journey continues. Edge AI is just getting started.

Project Links

GitHub ↗