May 1, 2026 15 min read Mobile AI • NPU • On-Device Inference

Running Transformer Models on Mobile NPUs

Lessons learned deploying RoBERTa-base (125M params) on Snapdragon 8 Elite's Hexagon NPU for real-time emotion detection. We tested 8 backends, discovered the w8a16 quantization sweet spot, and learned why NPU value is about power efficiency — not raw speed.

Snapdragon 8 Elite Hexagon NPU LiteRT / TFLite ONNX Runtime Qualcomm AI Hub RoBERTa w8a16 Quantization

Key takeaways:

  1. Generic LiteRT NPU dispatch doesn't work for transformers — LayerNorm and Softmax create CPU fallback subgraphs that kill performance
  2. Qualcomm AI Hub compilation fixes this — it fuses ops that the runtime dispatcher can't
  3. Use INT8 weights + INT16 activations (w8a16) — Google's recipe for transformer NPU deployment
  4. Use a two-step pipeline — quantize first, then compile (workaround for AI Hub dtype conflict)
  5. Provide calibration data — blind INT8 drops accuracy by ~8%, w8a16 with calibration loses only ~1.4%
  6. Result: 48.3% accuracy at 67ms on NPU — comparable to ONNX CPU (49.7% at 46.5ms), CPU freed for other work

The Bottom Line: 5-10x Battery Savings

Our RoBERTa transformer runs at 67ms on the Hexagon NPU vs 46.5ms on CPU — the NPU is slightly slower. But that misses the point entirely. The NPU uses 5-10x less energy per inference (~2-3mW vs ~15mW). In a 1-hour journaling session with ~4,500 sentiment inferences, that's the difference between noticeable battery drain and barely registering. Plus: all 8 CPU cores stay free for buttery 120fps UI, the phone doesn't thermal throttle, and background sync runs uninterrupted.

NPU isn't about speed. It's about building apps that don't kill the battery.

The Problem

We have a 28-emotion sentiment classifier (SamLowe/roberta-base-go_emotions, 125M parameters) running on-device in SentiLog, an Android journaling app. It detects emotions as you type using ONNX Runtime at 46.5ms per inference on CPU. We wanted to explore NPU acceleration for better battery life and CPU offload.

What We Tried (and What Failed)

1 LiteRT with Accelerator.NPU (Generic Dispatch)

val model = CompiledModel.create(
    modelPath,
    CompiledModel.Options(Accelerator.NPU, Accelerator.CPU)
)

Result: 3 out of 7 subgraphs on NPU, 4 on CPU. 290ms (slower than CPU-only 270ms).

Why it failed: LiteRT's runtime dispatcher evaluates each op individually. Transformer ops like LayerNorm (Mean + Variance + Normalize + Scale) appear as 4 separate ops — the NPU doesn't recognize the pattern. Each CPU↔NPU boundary requires a memory copy (~0.5-2ms). With 12 transformer layers, that's ~72 memory transfers that negate any compute savings.

2 QualcommOptions Tuning (Burst + O3)

Result: Same 3/7 split. ~290ms. No improvement. Tuning can't add support for structurally unsupported ops.

3 Smaller Models (DistilBERT, DistilRoBERTa)

Tested 6-layer models (67M-82M params). Result: 3/5 subgraphs on NPU. ~165-215ms. Better, but still slower than ONNX CPU.

4 Full Static INT8 Quantization

Result: Same 3/7 split. 385ms. Accuracy broke on some runs. The subgraph split is about op support, not quantization format.

5 NNAPI Delegate

Result: NNAPI is deprecated on Android 15+. Either fails or uses the slow NNAPI reference CPU backend (~740ms). Don't use it for new development.

What Actually Works

Qualcomm AI Hub: Two-Step Compilation Pipeline

The key insight: compile the model on Qualcomm's cloud servers, not at runtime on the device. AI Hub's compiler knows the Hexagon hardware and fuses LayerNorm/Softmax into NPU-native instructions.

AI Hub Bug: Combining --quantize_full_type w8a16 + --truncate_64bit_tensors + calibration data in a single submit_compile_job() causes dtype validation conflicts. The workaround: separate quantization from compilation.

import qai_hub as hub
from qai_hub import QuantizeDtype

# STEP 1: Quantize (no truncation — works with int64 model + int64 calibration)
quantize_job = hub.submit_quantize_job(
    model="model_fp32.onnx",
    calibration_data=cal_dict,         # 100 representative texts, int64
    weights_dtype=QuantizeDtype.INT8,
    activations_dtype=QuantizeDtype.INT16,
)
quantize_job.wait()
# Download & merge external weights into single ONNX file

# STEP 2: Compile (with truncation, no calibration — already quantized)
compile_job = hub.submit_compile_job(
    model="model_w8a16_merged.onnx",
    device=hub.Device("Snapdragon 8 Elite QRD"),
    options="--target_runtime qnn_context_binary --truncate_64bit_io --truncate_64bit_tensors",
)
compile_job.wait()
compile_job.download_target_model("model_w8a16_qnn.bin")

Result: 48.3% accuracy at ~67ms on Hexagon NPU. Only 1.4% accuracy loss from the ONNX CPU baseline (49.7%), with the CPU completely freed for UI, sync, and other work.

Full Benchmark Results

Snapdragon 8 Elite, RoBERTa-base, 149 curated emotion texts:

BackendAccuracyLatencySizeNotes
ONNX INT8 CPU49.7%46.5ms120MBProduction baseline
QNN w8a16 NPU48.3%~67ms158MBBest NPU result
QNN FP32 NPU49.7%228ms240MBPerfect accuracy, slow
QNN INT8 NPU41.6%~86ms121MBBlind INT8 = 8% loss
AI Hub INT8 TFLite + LiteRT NPU41.6%170ms122MB2/3 subgraphs on NPU
LiteRT INT8 NPU (generic)50.3%290ms122MB3/7 subgraphs (fragmented)
LiteRT INT8 CPU (XNNPACK)50.3%270ms122MBNo NPU

Best Practices

1. Don't Rely on Generic Runtime Dispatch

LiteRT's Accelerator.NPU does per-op dispatch. For CNNs, this works — all ops are NPU-native. For transformers, it fragments the graph into alternating NPU/CPU subgraphs. Always use ahead-of-time compilation via Qualcomm AI Hub or similar vendor tools.

2. Fix Your Input Shapes

NPUs require static tensor shapes. Convert dynamic ONNX shapes to fixed before submission:

import onnx
model = onnx.load("model.onnx")
for inp in model.graph.input:
    dims = inp.type.tensor_type.shape.dim
    dims[0].dim_value = 1    # batch
    dims[1].dim_value = 128  # sequence
onnx.save(model, "model_fixed.onnx")

3. Use w8a16, Not Blind INT8

INT8 weights + INT16 activations preserves numerical precision in LayerNorm and Softmax where transformers need it most. Blind INT8 drops accuracy by ~8%; w8a16 with calibration loses only ~1.4%.

4. Export Clean FP32 ONNX

Don't pre-quantize with ONNX Runtime before submitting to AI Hub. The ONNX-specific quantization ops confuse the QNN converter (exit code 255). Let AI Hub handle quantization from a clean FP32 model.

5. Provide Calibration Data

100 representative inputs are enough. The calibration data determines activation ranges — without it, the quantizer guesses and accuracy suffers.

6. Ship CPU as Fallback

// Backend priority with graceful fallback:
val result = qnn?.analyze(text)       // QNN NPU (67ms)
    ?: onnx?.analyze(text)            // ONNX CPU (46.5ms)
    ?: keywordClassifier.analyze(text) // Keywords (0ms)

The Real Value of NPU

For transformer models, a well-optimized CPU is actually faster than NPU (46.5ms vs 67ms). The real NPU value is elsewhere:

CPU InferenceNPU Inference
Energy per inference~15mW~2-3mW (5-10x less)
CPU availability1 core blockedAll cores free
Thermal after 30 minThrottles (46ms → 80ms+)Consistent 67ms
UI smoothnessOccasional jankButter smooth
Battery (1hr session)Noticeable drainBarely registers

The right question isn't "is NPU faster?" It's "does NPU deliver a better user experience?" For a journaling app running sentiment analysis every 800ms: same perceived speed, 5-10x less battery drain, smoother UI, no thermal throttle.

Architecture

PyTorch Model
    ↓ torch.onnx.export (FP32, static shapes, opset 17)
Clean FP32 ONNX (int64 inputs, ~476MB)
    ↓
    └—→ AI Hub submit_quantize_job (w8a16, calibration data)
    │       ↓ Quantized ONNX (int8 weights + int16 activations)
    │       ↓ AI Hub submit_compile_job (QNN context binary)
    │       ↓
    │   QNN Context Binary (~158MB) → ONNX Runtime QNN EP → Hexagon NPU
    │
    └—→ ONNX Runtime quantize_dynamic (INT8)
            ↓
        ONNX INT8 (~120MB) → ONNX Runtime CPU (production fallback)

Tools Used

Device

OnePlus 13 — Snapdragon 8 Elite, Hexagon NPU, Android 16. All benchmarks run on-device using a custom ADB Debug API with a curated 149-text multilingual validation dataset.

This guide is based on real benchmarks from the SentiLog project (May 2026). Results may vary with different model architectures, chipsets, and framework versions. Source code and test datasets are available in the repository.