Key takeaways:
- Generic LiteRT NPU dispatch doesn't work for transformers — LayerNorm and Softmax create CPU fallback subgraphs that kill performance
- Qualcomm AI Hub compilation fixes this — it fuses ops that the runtime dispatcher can't
- Use INT8 weights + INT16 activations (w8a16) — Google's recipe for transformer NPU deployment
- Use a two-step pipeline — quantize first, then compile (workaround for AI Hub dtype conflict)
- Provide calibration data — blind INT8 drops accuracy by ~8%, w8a16 with calibration loses only ~1.4%
- Result: 48.3% accuracy at 67ms on NPU — comparable to ONNX CPU (49.7% at 46.5ms), CPU freed for other work
The Bottom Line: 5-10x Battery Savings
Our RoBERTa transformer runs at 67ms on the Hexagon NPU vs 46.5ms on CPU — the NPU is slightly slower. But that misses the point entirely. The NPU uses 5-10x less energy per inference (~2-3mW vs ~15mW). In a 1-hour journaling session with ~4,500 sentiment inferences, that's the difference between noticeable battery drain and barely registering. Plus: all 8 CPU cores stay free for buttery 120fps UI, the phone doesn't thermal throttle, and background sync runs uninterrupted.
NPU isn't about speed. It's about building apps that don't kill the battery.
The Problem
We have a 28-emotion sentiment classifier (SamLowe/roberta-base-go_emotions, 125M parameters) running on-device in SentiLog, an Android journaling app. It detects emotions as you type using ONNX Runtime at 46.5ms per inference on CPU. We wanted to explore NPU acceleration for better battery life and CPU offload.
What We Tried (and What Failed)
1 LiteRT with Accelerator.NPU (Generic Dispatch)
val model = CompiledModel.create(
modelPath,
CompiledModel.Options(Accelerator.NPU, Accelerator.CPU)
)
Result: 3 out of 7 subgraphs on NPU, 4 on CPU. 290ms (slower than CPU-only 270ms).
Why it failed: LiteRT's runtime dispatcher evaluates each op individually. Transformer ops like LayerNorm (Mean + Variance + Normalize + Scale) appear as 4 separate ops — the NPU doesn't recognize the pattern. Each CPU↔NPU boundary requires a memory copy (~0.5-2ms). With 12 transformer layers, that's ~72 memory transfers that negate any compute savings.
2 QualcommOptions Tuning (Burst + O3)
Result: Same 3/7 split. ~290ms. No improvement. Tuning can't add support for structurally unsupported ops.
3 Smaller Models (DistilBERT, DistilRoBERTa)
Tested 6-layer models (67M-82M params). Result: 3/5 subgraphs on NPU. ~165-215ms. Better, but still slower than ONNX CPU.
4 Full Static INT8 Quantization
Result: Same 3/7 split. 385ms. Accuracy broke on some runs. The subgraph split is about op support, not quantization format.
5 NNAPI Delegate
Result: NNAPI is deprecated on Android 15+. Either fails or uses the slow NNAPI reference CPU backend (~740ms). Don't use it for new development.
What Actually Works
Qualcomm AI Hub: Two-Step Compilation Pipeline
The key insight: compile the model on Qualcomm's cloud servers, not at runtime on the device. AI Hub's compiler knows the Hexagon hardware and fuses LayerNorm/Softmax into NPU-native instructions.
AI Hub Bug: Combining --quantize_full_type w8a16 + --truncate_64bit_tensors + calibration data in a single submit_compile_job() causes dtype validation conflicts. The workaround: separate quantization from compilation.
import qai_hub as hub
from qai_hub import QuantizeDtype
# STEP 1: Quantize (no truncation — works with int64 model + int64 calibration)
quantize_job = hub.submit_quantize_job(
model="model_fp32.onnx",
calibration_data=cal_dict, # 100 representative texts, int64
weights_dtype=QuantizeDtype.INT8,
activations_dtype=QuantizeDtype.INT16,
)
quantize_job.wait()
# Download & merge external weights into single ONNX file
# STEP 2: Compile (with truncation, no calibration — already quantized)
compile_job = hub.submit_compile_job(
model="model_w8a16_merged.onnx",
device=hub.Device("Snapdragon 8 Elite QRD"),
options="--target_runtime qnn_context_binary --truncate_64bit_io --truncate_64bit_tensors",
)
compile_job.wait()
compile_job.download_target_model("model_w8a16_qnn.bin")
Result: 48.3% accuracy at ~67ms on Hexagon NPU. Only 1.4% accuracy loss from the ONNX CPU baseline (49.7%), with the CPU completely freed for UI, sync, and other work.
Full Benchmark Results
Snapdragon 8 Elite, RoBERTa-base, 149 curated emotion texts:
| Backend | Accuracy | Latency | Size | Notes |
|---|---|---|---|---|
| ONNX INT8 CPU | 49.7% | 46.5ms | 120MB | Production baseline |
| QNN w8a16 NPU | 48.3% | ~67ms | 158MB | Best NPU result |
| QNN FP32 NPU | 49.7% | 228ms | 240MB | Perfect accuracy, slow |
| QNN INT8 NPU | 41.6% | ~86ms | 121MB | Blind INT8 = 8% loss |
| AI Hub INT8 TFLite + LiteRT NPU | 41.6% | 170ms | 122MB | 2/3 subgraphs on NPU |
| LiteRT INT8 NPU (generic) | 50.3% | 290ms | 122MB | 3/7 subgraphs (fragmented) |
| LiteRT INT8 CPU (XNNPACK) | 50.3% | 270ms | 122MB | No NPU |
Best Practices
1. Don't Rely on Generic Runtime Dispatch
LiteRT's Accelerator.NPU does per-op dispatch. For CNNs, this works — all ops are NPU-native. For transformers, it fragments the graph into alternating NPU/CPU subgraphs. Always use ahead-of-time compilation via Qualcomm AI Hub or similar vendor tools.
2. Fix Your Input Shapes
NPUs require static tensor shapes. Convert dynamic ONNX shapes to fixed before submission:
import onnx
model = onnx.load("model.onnx")
for inp in model.graph.input:
dims = inp.type.tensor_type.shape.dim
dims[0].dim_value = 1 # batch
dims[1].dim_value = 128 # sequence
onnx.save(model, "model_fixed.onnx")
3. Use w8a16, Not Blind INT8
INT8 weights + INT16 activations preserves numerical precision in LayerNorm and Softmax where transformers need it most. Blind INT8 drops accuracy by ~8%; w8a16 with calibration loses only ~1.4%.
4. Export Clean FP32 ONNX
Don't pre-quantize with ONNX Runtime before submitting to AI Hub. The ONNX-specific quantization ops confuse the QNN converter (exit code 255). Let AI Hub handle quantization from a clean FP32 model.
5. Provide Calibration Data
100 representative inputs are enough. The calibration data determines activation ranges — without it, the quantizer guesses and accuracy suffers.
6. Ship CPU as Fallback
// Backend priority with graceful fallback:
val result = qnn?.analyze(text) // QNN NPU (67ms)
?: onnx?.analyze(text) // ONNX CPU (46.5ms)
?: keywordClassifier.analyze(text) // Keywords (0ms)
The Real Value of NPU
For transformer models, a well-optimized CPU is actually faster than NPU (46.5ms vs 67ms). The real NPU value is elsewhere:
| CPU Inference | NPU Inference | |
|---|---|---|
| Energy per inference | ~15mW | ~2-3mW (5-10x less) |
| CPU availability | 1 core blocked | All cores free |
| Thermal after 30 min | Throttles (46ms → 80ms+) | Consistent 67ms |
| UI smoothness | Occasional jank | Butter smooth |
| Battery (1hr session) | Noticeable drain | Barely registers |
The right question isn't "is NPU faster?" It's "does NPU deliver a better user experience?" For a journaling app running sentiment analysis every 800ms: same perceived speed, 5-10x less battery drain, smoother UI, no thermal throttle.
Architecture
PyTorch Model
↓ torch.onnx.export (FP32, static shapes, opset 17)
Clean FP32 ONNX (int64 inputs, ~476MB)
↓
└—→ AI Hub submit_quantize_job (w8a16, calibration data)
│ ↓ Quantized ONNX (int8 weights + int16 activations)
│ ↓ AI Hub submit_compile_job (QNN context binary)
│ ↓
│ QNN Context Binary (~158MB) → ONNX Runtime QNN EP → Hexagon NPU
│
└—→ ONNX Runtime quantize_dynamic (INT8)
↓
ONNX INT8 (~120MB) → ONNX Runtime CPU (production fallback)
Tools Used
- LiteRT 2.1.1 — Google's TFLite successor with CompiledModel API
- ONNX Runtime 1.22.0 — CPU inference + QNN EP for NPU
- Qualcomm AI Hub — Cloud compilation for Snapdragon NPU (free account)
- litert-torch 0.9.0 — PyTorch → TFLite conversion
- ai-edge-quantizer — Post-training quantization
- Kotlin 2.2.0 — Required for LiteRT 2.1.1 CompiledModel API
Device
OnePlus 13 — Snapdragon 8 Elite, Hexagon NPU, Android 16. All benchmarks run on-device using a custom ADB Debug API with a curated 149-text multilingual validation dataset.
This guide is based on real benchmarks from the SentiLog project (May 2026). Results may vary with different model architectures, chipsets, and framework versions. Source code and test datasets are available in the repository.