Cold LLM inference on the GoS1 bench: TinyLlama (1.1B) at ~0.07 mWh/token, Mistr

What was measured

Two cold inferences on the same T3 task, executed via Ollama and measured at the wall through the Tapo P110:

tinyllama (1.1B parameters, GGUF Q4): single T3 run, 2026-04-24
mistral (7B v0.3, GGUF Q4): T3 run, 2026-05-08 (the closer-to-CLAUDE.md-prose match)

Each model was unloaded before the measurement, baseline was taken across 10 polls × 1 s, then the task was issued and polled at 1 Hz until completion.

Numbers, from the stored result files

| Model | mWh per token | Tokens / s | W_base / W_task | Confidence | |---|---|---|---|---| | TinyLlama 1.1B | 0.0718 | (per file) | (per file) | 🟡 | | Mistral 7B | 0.9639 | 49.7 | 61.8 W / 234.2 W | 🟢 |

Per-token ratio: Mistral 7B / TinyLlama ≈ 13.4× — the larger model uses an order of magnitude more energy per token in this measurement.

What this measurement does not establish

Answer quality is not measured here. CLAUDE.md notes that TinyLlama produced corpus-grounded but generic answers on the RAG faithfulness probe (see the separate rag-faithfulness-rem-question finding); per-token energy alone is not a complete picture.
The bench does not measure end-to-end LLM inference (no client device, no network).
These figures predate the S30 ladder refresh and the move to the dynamic model catalog. A current measurement on qwen3:1.7b and mistral-nemo:12b is a follow-up that would create a v2.

Read alongside

The RAG faithfulness finding (rag-faithfulness-rem-question) on the same TinyLlama version — same retrieval, smaller model generated a hallucinated answer. Energy per token and answer correctness are independent axes.

Cold LLM inference on the GoS1 bench: TinyLlama (1.1B) at ~0.07 mWh/token, Mistral 7B at ~0.96 mWh/token — both on the pre-S30 ladder

Caveats

What was measured

Numbers, from the stored result files

What this measurement does not establish

Read alongside