Running Transformer Models on Mobile NPUs

Key takeaways:

Generic LiteRT NPU dispatch doesn't work for transformers — LayerNorm and Softmax fragment the graph into 7 subgraphs, only 3 on NPU
Qualcomm AI Hub compilation fixes this — it fuses ops that the runtime dispatcher can't, putting all subgraphs on NPU
Use INT8 weights + INT16 activations (w8a16) — only 1.4% accuracy loss vs 8.1% with blind INT8
Always use a two-step pipeline — quantize first, then compile (workaround for AI Hub dtype conflict)
Result: 48.3% accuracy at 67ms on NPU — comparable to ONNX CPU (49.7% at 46.5ms), all 8 CPU cores freed
NPU value isn't speed — it's power — 5–10× less energy, no thermal throttle, buttery UI

The Bottom Line: 5–10× Battery Savings

Our RoBERTa transformer runs at 67ms on the Hexagon NPU vs 46.5ms on CPU — the NPU is actually slower. But that misses the point. The NPU uses 5–10× less energy per inference (~2–3mW vs ~15mW). In a 1-hour journaling session with ~4,500 sentiment inferences, that's the difference between noticeable battery drain and barely registering. All 8 CPU cores stay free for buttery 120fps UI, the phone doesn't thermal throttle, and background sync runs uninterrupted.

NPU isn't about speed. It's about building apps that don't kill the battery.

The Problem

We have a 28-emotion sentiment classifier (SamLowe/roberta-base-go_emotions, 125M parameters) running on-device in SentiLog, an Android journaling app. It detects emotions as you type — English, German, Japanese, and Korean — using ONNX Runtime at 46.5ms per inference on CPU. We wanted NPU acceleration for better battery life and CPU offload.

The hardware is good: OnePlus 13, Snapdragon 8 Elite, Hexagon HTP v79 NPU capable of 45 TOPS. In theory, a transformer should fly on this. In practice, getting a transformer to actually run on an NPU — not just fall back to CPU through the NPU codepath — required understanding how the model maps to hardware at the op level.

Inside the Transformer: What the NPU Actually Sees

Before you can diagnose NPU failures, you need to understand what happens when a text like "I can't stop thinking about her" enters the model. Here's every compute step, with what the NPU can and can't handle:

Step 1: Tokenization (~0.5ms, CPU-only) Text → WordPiece tokens → integer IDs "I can't stop..." → [101, 146, 1391, 1864, ...] CPU table lookup. Not a neural op. NPU never sees this. Step 2: Embedding Lookup (~2ms) [token_ids, position_ids] → 768-dim vectors per token Three embedding tables, each 768-wide, results summed + LayerNorm. ⚠ LayerNorm: Mean + Variance + Normalize + Scale = 4 ops NPU dispatcher sees 4 separate ops, not 1 fused instruction. Result: this entire step falls back to CPU. Step 3: 12× Transformer Layers (bulk of compute) Per layer: ┌─ Multi-Head Self-Attention (12 heads, 64 dim each) │ QKV projections 768→768 ← MatMul ✓ NPU │ Q·Kᵀ / √64 ← MatMul ✓ NPU │ Softmax(scores) ← Softmax ✗ CPU (FP intermediate) │ Attention × V ← MatMul ✓ NPU │ Output projection ← MatMul ✓ NPU │ LayerNorm ← 4 ops ✗ CPU └─ Feed-Forward Network (768→3072→768) Linear 768→3072 ← MatMul ✓ NPU GELU activation ← GELU ✗ CPU (unsupported) Linear 3072→768 ← MatMul ✓ NPU LayerNorm ← 4 ops ✗ CPU 12 layers × (2 LayerNorm + 1 Softmax + 1 GELU) = 48 CPU fallback points Each fallback = DMA copy NPU→CPU, compute, DMA copy CPU→NPU Each DMA: ~0.5–2ms. 48 transfers × avg 1ms = ~48ms overhead alone. Step 4: Pooler + Classifier (~1ms) CLS token → Linear 768→768 → Tanh → Linear 768→28 → Softmax Softmax again: CPU fallback

This is why generic NPU dispatch produces 3/7 subgraphs on NPU instead of all 7. The MatMul ops go to NPU. Everything else — LayerNorm, Softmax, GELU — falls back to CPU. The memory transfer overhead between those 7 subgraphs is what makes it 290ms instead of 46ms.

What Qualcomm AI Hub does differently: It compiles the model on cloud servers that know the exact Hexagon HTP v79 instruction set. LayerNorm gets fused into a single HTP-native instruction. Softmax gets fused. GELU gets fused. The compiled binary has zero CPU fallback points. That's why AI Hub gives 67ms instead of 290ms.

What We Tried (and What Failed)

1 LiteRT with Accelerator.NPU (Generic Dispatch)

val model = CompiledModel.create(
    modelPath,
    CompiledModel.Options(Accelerator.NPU, Accelerator.CPU)
)

IssueLayerNorm, Softmax, and GELU aren't recognized as NPU-fusable by the runtime dispatcher

Effect3/7 subgraphs on NPU, 4/7 on CPU. ~72 DMA transfers across 12 transformer layers. Latency: 290ms (slower than CPU-only at 270ms)

FixUse ahead-of-time compilation (AI Hub) — fuses these ops at the model level, not at runtime

2 QualcommOptions Tuning (Burst + O3)

IssueRuntime hints (burst mode, O3 optimization) can't add hardware support for structurally unsupported ops

EffectSame 3/7 split. ~290ms. No measurable improvement over default dispatch.

FixOp support is determined at compile time. Tuning hints only help if ops are already on NPU.

3 Smaller Models (DistilBERT / DistilRoBERTa)

IssueDistilBERT (67M params, 6 layers) has the same LayerNorm/Softmax/GELU pattern — just fewer of them

Effect3/5 subgraphs on NPU. 165–215ms. Still slower than ONNX CPU, and ~4% worse accuracy on our validation set.

FixModel size isn't the bottleneck. The architecture is. Fewer layers = less memory transfer overhead, but it doesn't solve the root problem.

4 Full Static INT8 Quantization + LiteRT

IssueINT8 quantization doesn't change which ops run on NPU — it just changes the dtype of the ops that already do

EffectSame 3/7 split. 385ms. Accuracy dropped erratically (41.6%) due to blind quantization without calibration data.

FixWhen using INT8, always provide calibration data. Without it, the quantizer guesses activation ranges and clips important signal.

5 NNAPI Delegate

IssueNNAPI is deprecated on Android 15+ and routes to the NNAPI reference CPU backend on modern devices

Effect~740ms. Either silently uses the slow software backend or fails entirely depending on device firmware.

FixDon't use NNAPI for new development. Use LiteRT with vendor-specific delegates or QNN EP directly.

What Actually Works: Qualcomm AI Hub Two-Step Pipeline

The key insight: compile the model on Qualcomm's cloud servers, not at runtime on the device. AI Hub's compiler knows the Hexagon HTP v79 instruction set and fuses LayerNorm, Softmax, and GELU into NPU-native instructions before the binary is ever deployed.

AI Hub Bug — Dtype Conflict: Combining --quantize_full_type w8a16 + --truncate_64bit_tensors + calibration data in a single submit_compile_job() causes internal dtype validation conflicts (the quantizer produces INT16 activations but the truncation pass expects INT8). Workaround: always separate quantization from compilation into two distinct jobs.

import qai_hub as hub
from qai_hub import QuantizeDtype

# STEP 1: Quantize with calibration data
# (no truncation — works with int64 model + int64 calibration)
quantize_job = hub.submit_quantize_job(
    model="model_fp32.onnx",          # clean FP32, no pre-quantization
    calibration_data=cal_dict,        # 100 representative texts (int64 IDs)
    weights_dtype=QuantizeDtype.INT8,
    activations_dtype=QuantizeDtype.INT16,  # w8a16
)
quantize_job.wait()
# Download quantized ONNX; merge external weights into single file

# STEP 2: Compile to QNN context binary
# (truncation here — already quantized, no calibration needed)
compile_job = hub.submit_compile_job(
    model="model_w8a16_merged.onnx",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    options="--target_runtime qnn_context_binary --truncate_64bit_io --truncate_64bit_tensors",
)
compile_job.wait()
compile_job.download_target_model("model_w8a16_qnn.bin")

Result: 48.3% accuracy at ~67ms on Hexagon NPU. Only 1.4% accuracy loss from the ONNX CPU baseline (49.7%), with all 8 CPU cores completely free for UI, sync, and background work.

Full Benchmark Results

Tested on OnePlus 13 (Snapdragon 8 Elite, Hexagon HTP v79, Android 16). Validation: 149 curated emotion texts, RoBERTa-base, top-1 accuracy.

Backend	Accuracy	Latency	Size	Notes
ONNX INT8 CPU	49.7%	46.5ms	120MB	Production baseline
QNN w8a16 NPU (AI Hub)	48.3%	~67ms	158MB	Best NPU result — all subgraphs on NPU
QNN FP32 NPU (AI Hub)	49.7%	228ms	240MB	Full accuracy, too slow
QNN INT8 NPU (no calibration)	41.6%	~86ms	121MB	Blind INT8 = 8.1% accuracy loss
AI Hub INT8 TFLite + LiteRT NPU	41.6%	170ms	122MB	2/3 subgraphs on NPU
LiteRT INT8 CPU (XNNPACK)	50.3%	270ms	122MB	No NPU
LiteRT INT8 NPU (generic dispatch)	50.3%	290ms	122MB	3/7 subgraphs — slower than CPU
DistilBERT LiteRT NPU	~46%	165–215ms	83MB	3/5 subgraphs, worse accuracy
NNAPI CPU Backend	49.7%	~740ms	120MB	Android 15+: deprecated, software fallback

Quantization Experiment: Where Accuracy Dies

The accuracy impact of each quantization step, measured against the 149-text validation set:

Quantization	Accuracy	vs FP32 Baseline	Model Size	Latency (NPU)
FP32 (no quant)	49.7%	—	476MB	228ms
INT8 dynamic (ONNX Runtime CPU)	49.7%	±0%	120MB	46.5ms (CPU)
INT8 static, no calibration	41.6%	−8.1%	121MB	~86ms
w8a16 with 100-sample calibration	48.3%	−1.4%	158MB	~67ms

The 6.7 percentage-point difference between blind INT8 and calibrated w8a16 comes from how transformers use activations. LayerNorm and Softmax produce narrow-range outputs that INT8 clips aggressively without calibration. INT16 activations give these layers the numerical headroom they need.

SentiLog Design Decisions

Theory is cheap. Here's every architectural decision that shaped SentiLog's on-device AI, with the reasoning and the alternatives we rejected.

Why offline-first architecture?

Chose: Fully on-device inference, zero cloud calls for AI
Rejected: API-based sentiment (OpenAI, Google Cloud NLP)

User journal entries are private. We don't trust cloud providers with unencrypted journal text — and users shouldn't have to. On-device also means: zero API cost per-inference, works on a flight, no latency from network round-trips, GDPR compliance by default. The engineering cost (120MB model bundle) was worth the trust guarantee.

Why RoBERTa-base, not a smaller model?

Chose: 125M parameter RoBERTa-base (fine-tuned on GoEmotions)
Rejected: DistilBERT (67M), MiniLM (22M), keyword classifier

We benchmarked DistilBERT at 165–215ms on NPU with 3/5 fragmented subgraphs — still slower than ONNX CPU, and 4% worse accuracy. The NPU bottleneck is architecture (LayerNorm/Softmax), not parameter count. A smaller model with the same attention mechanism hits the same wall. MiniLM achieved 31% top-1 accuracy on our test set, unusable for 28-emotion classification. The keyword classifier is a fallback, not a primary.

Why sequence length 128 and not 64?

Chose: 128 tokens max
Rejected: 64 tokens (tested as V4 in nightrun)

GoEmotions training data averages ~15 words (~20 tokens). Journal entries are typically 50–100 words. At 64 tokens, we truncate ~8% of entries and see a measurable accuracy drop on long-form reflective writing — exactly the entries where emotion detection matters most. CPU saving from seq=64 is ~15ms (46.5ms → ~31ms). NPU saving is larger (~15ms). The 0.8% accuracy loss made it not worth it for the initial release.

Why ship CPU as the production backend?

Chose: ONNX INT8 CPU as production default
Rejected: QNN NPU as production default (yet)

ONNX INT8 CPU at 46.5ms is reliable, well-understood, and works on every Android device. QNN NPU requires: the AI Hub-compiled binary (Snapdragon 8 Elite specific), the onnxruntime-android-qnn dependency, correct ADSP library path setup, and a QNN context binary that matches the exact ORT version. We hit QNN error 5000 (context binary load failure) during testing — a version mismatch between the AI Hub SDK and ORT 1.22.0. NPU is a v2.5 upgrade path, not a v2.4 production risk.

Why 800ms debounce on sentiment analysis?

Chose: Re-analyze every 800ms while typing
Rejected: Every keypress, every word, every sentence

Typing speed averages 40–60 WPM (~4–6 chars/sec). Analyzing on every keypress means ~5 inferences/second = 18,000/hour. At 46.5ms each, that's the CPU pegged. 800ms catches a natural typing pause (end of phrase), feels responsive, and keeps inferences at ~4,500/hour. At 15mW CPU power per inference: 18,000/hr × 15mW × 1ms ≈ 270mJ vs 4,500/hr × 15mW × 1ms ≈ 67.5mJ. Four times less battery just from the debounce.

Trade-offs & Decisions

Every architectural choice involves a trade-off. Here's the decision matrix for SentiLog's AI stack:

Decision	What We Chose	What We Rejected	Why
Quantization	w8a16 (INT8 weights + INT16 activations)	Blind INT8	8.1% accuracy loss with INT8 vs 1.4% with calibrated w8a16
NPU compilation	Ahead-of-time (AI Hub)	Runtime dispatch (LiteRT)	Runtime: 3/7 subgraphs on NPU, 290ms. AOT: all on NPU, 67ms
Sequence length	128 tokens	64 tokens	0.8% accuracy gain worth more than 15ms CPU saving
Production backend	ONNX INT8 CPU (46.5ms)	QNN NPU (67ms)	QNN hit error 5000; CPU is 20ms faster and has no vendor dependency
Model architecture	RoBERTa-base (125M params)	DistilBERT / MiniLM	Smaller models still hit the same NPU subgraph fragmentation, with worse accuracy
Inference trigger	800ms debounce	Per-keypress	4,500 inferences/hr vs ~18,000/hr — 4× battery reduction
Language models	Separate model per language	Single multilingual model for all languages	Language-specific fine-tuned models outperform generic multilingual on GoEmotions (DE: F1 0.447 vs ~0.31)

Best Practices

1. Don't Rely on Generic Runtime Dispatch

LiteRT's Accelerator.NPU does per-op dispatch. For CNNs with all NPU-native ops, this works. For transformers, it fragments the graph into alternating NPU/CPU subgraphs with ~0.5–2ms per memory transfer. Always use ahead-of-time compilation via Qualcomm AI Hub or equivalent vendor tools (MediaTek NeuroPilot, Samsung ONE, etc.).

2. Fix Your Input Shapes Before Submission

NPUs require static tensor shapes. Convert dynamic ONNX axes to fixed values before submitting to AI Hub:

import onnx
model = onnx.load("model.onnx")
for inp in model.graph.input:
    dims = inp.type.tensor_type.shape.dim
    dims[0].dim_value = 1    # batch size
    dims[1].dim_value = 128  # sequence length
onnx.save(model, "model_fixed.onnx")

3. Use w8a16, Not Blind INT8

INT8 weights + INT16 activations preserves numerical precision in LayerNorm and Softmax. Blind INT8 drops accuracy by ~8%; calibrated w8a16 loses only ~1.4%. Always provide 100+ representative calibration samples — the quantizer needs them to correctly set activation clipping ranges.

4. Submit Clean FP32 to AI Hub

Don't pre-quantize with ONNX Runtime before submitting to AI Hub. ONNX-specific quantization ops (QLinearMatMul, etc.) confuse the QNN converter and produce exit code 255. Let AI Hub handle quantization from a clean FP32 model with int64 inputs.

5. Ship CPU as Fallback — Always

// Graceful degradation across backends:
val result = qnn?.analyze(text)            // QNN NPU: 67ms, 5-10x less battery
    ?: onnxCpu?.analyze(text)              // ONNX CPU: 46.5ms, reliable
    ?: keywordClassifier.analyze(text)     // Keywords: 0ms, 100% compatibility

QNN binaries are device-specific (compiled for Snapdragon 8 Elite) and ORT-version-specific. Any device running a different chipset or a different ORT version will fall through to ONNX CPU automatically.

The Real Value of NPU

For transformer models, a well-optimized CPU is actually faster than NPU (46.5ms vs 67ms). The real NPU value is in power efficiency and sustained performance:

	CPU Inference	NPU Inference
Energy per inference	~15mW	~2–3mW (5–10× less)
CPU availability	1 core saturated during inference	All 8 cores free
Thermal after 30 min	Throttles: 46ms → 80ms+	Consistent 67ms (no throttle)
UI smoothness	Occasional frame drop during inference	Butter smooth 120fps
1-hour session battery	Noticeable drain	Barely registers

The right question isn't "is NPU faster?" It's "does NPU deliver a better user experience over time?" For a journaling app running sentiment analysis every 800ms: same perceived speed (both well under the 800ms debounce), 5–10× less battery drain, smoother UI, no thermal throttle during extended sessions.

Architecture

PyTorch fine-tuned model ↓ torch.onnx.export (FP32, static shapes, opset 17, int64 inputs) Clean FP32 ONNX (~476MB, with external weights) ↓ ├──→ AI Hub submit_quantize_job (w8a16, 100-sample calibration) │ ↓ Quantized ONNX (INT8 weights + INT16 activations, ~158MB) │ ↓ AI Hub submit_compile_job (QNN context binary, Snapdragon 8 Elite) │ ↓ │ QNN Context Binary (~158MB) → ONNX Runtime QNN EP → Hexagon NPU (67ms) │ └──→ ONNX Runtime quantize_dynamic (INT8, dynamic quantization) ↓ ONNX INT8 (~120MB) → ONNX Runtime CPU EP → CPU (46.5ms) [production] Fallback chain in Android app: QNN NPU → ONNX CPU → Keyword Classifier (0ms)

Tools Used

LiteRT 2.1.1 — Google's TFLite successor with CompiledModel API and QualcommOptions
ONNX Runtime 1.22.0 — CPU inference + QNN EP for NPU; 16KB page-aligned .so for Android 15+
Qualcomm AI Hub — Cloud compilation for Snapdragon NPU (free account, ~$0.40/job)
litert-torch 0.9.0 — PyTorch → TFLite conversion
ai-edge-quantizer — Post-training quantization for TFLite path
Kotlin 2.2.0 — Required for LiteRT 2.1.1 CompiledModel API

Device

OnePlus 13 — Snapdragon 8 Elite (SM8750), Hexagon HTP v79, 16GB LPDDR5X, Android 16. All benchmarks run on-device using a custom ADB Debug API with a curated 149-text multilingual validation dataset. Testing on mid-range devices (Snapdragon 7 series, Dimensity 9000, Exynos) is planned for v2.5; results will vary due to different NPU architectures and memory bandwidth.

This is based on real benchmarks from the SentiLog project (May 2026). Results are specific to Snapdragon 8 Elite — other chipsets will behave differently. Source code and test datasets are in the repository.