OWL Measurement Methodology
How OWL measures the energy cost of compute tasks — and what it doesn’t measure.
Scope
OWL measures what happens inside one machine when it performs a real task. This is intentionally narrow. The energy cost of streaming is distributed across data centres, networks, and consumer devices — each with different measurement challenges and attribution problems. We start with the layer we can measure directly, at the wall, with no modelling assumptions.
This scoping decision means OWL results are not lifecycle assessments and should not be cited as total-cost-of-delivery figures. They answer a specific question: how much additional energy does this server draw to perform this task, above its idle baseline?
Measurement Principle
OWL uses wall-power delta measurement: the difference between what the server draws at idle and what it draws under load, captured by an external smart plug.
This follows the GoS REM (Remote Energy Measurement) approach: real devices, real workloads, measured externally, at polling intervals short enough to capture the task’s energy profile.
Measurement Protocol
Every test in OWL — video, LLM, image generation, RAG — follows the same core protocol:
-
Focus mode. Suppress background system tasks (apt, cron, man-db, fwupd, etc.) that would introduce energy noise. Managed via
systemctl stopwith dedicated sudoers rules. -
Model unload (LLM/RAG only). Send
keep_alive=0to Ollama and wait 3 seconds for GPU memory release. Ensures a cold start when cold-inference mode is selected. -
Baseline capture. Poll the Tapo P110 at 1-second intervals for a configurable period (currently
5polls — configurable in Settings). The mean of these readings becomes Wbase — the server’s idle power draw. -
Lock. Acquire
/tmp/gos-measure.lockto prevent concurrent measurements from overlapping. A FIFO queue manages waiting jobs. - Execute task. Run the actual workload (ffmpeg, Ollama inference, SD-Turbo diffusion) while continuing to poll the P110 at 1-second intervals. Thermal sensors (CPU Tctl, GPU junction, GPU PPT) are read in parallel.
- Compute energy. Calculate delta power, total energy, and per-unit metrics (see formulas below).
- Persist. Write the full result to a JSON file — parameters, energy report, raw poll data, thermal readings, confidence flag. Every result is reproducible and exportable.
- Focus exit. Restart suppressed system timers in parallel (via ThreadPoolExecutor) to minimise downtime.
Between sequential runs (e.g., CPU vs GPU comparison), a configurable cooldown (currently 10 seconds — configurable in Settings) allows the system to return to thermal equilibrium.
Energy Calculation
where Δt = task duration in seconds
All formulas use wall-power from the P110 (system-level), not component-level readings. The GPU’s self-reported power (its vendor sensor — amdgpu PPT or nvidia-smi power draw) is captured for reference but is not used in the primary energy calculation — it covers only the GPU die/board, not the full system delta (CPU, RAM, drives, fans, PSU losses).
Confidence Framework
Every OWL result carries a traffic-light confidence flag. Under the CR-028 Phase 2 model (designed with Tania Pouli), the flag answers one defensible question per run: can this run be distinguished from idle? It is a per-run confidence interval, not a fixed-watt rule of thumb.
We keep the raw per-poll power samples from both the baseline window and the task window, form a standard error on the measured power increase ΔW, then convert ΔW into a one-sided confidence that the task really draws above idle:
SEcalibrated = (variance_idle_pct/100 · Wbase) × √(1/nbase + 1/ntask)
SEper-run = √(σ²base/nbase + σ²task/ntask)
SEdrift = (variance_idle_drift_pct/100) · Wbase Confidence the task draws above idle confidencepositive = Φ(ΔW / SEfinal)
| Flag | Meaning | Criteria (defaults) |
|---|---|---|
| 🟢 | Repeatable — the task is almost certainly above idle, with enough samples to be reliable. | confidencepositive ≥ 95% and ≥ 9 task polls |
| 🟡 | Early insight — directional evidence; a longer run would strengthen it. | confidencepositive ≥ 80% and ≥ 4 task polls |
| 🔴 | Need more data — cannot yet be distinguished from idle. | below the yellow threshold |
SEper-run), takes the worst case against a calibrated idle floor (SEcalibrated), and adds a drift term for the time gap between the baseline and task windows — so it reflects real signal quality on the day, not an assumed noise floor. The minimum task-sample counts remain because 1 s power samples are autocorrelated: a very short task should not turn green on one or two lucky readings.
variance_idle_pct as the calibrated idle noise floor. The per-codec calibration CVs (variance_cpu_pct / variance_gpu_pct) are run-to-run repeatability measures, reserved for a future aggregate-confidence layer rather than mixed into the single-run formula. The first pass uses raw sample counts and a 1.96 (95%) critical value; an autocorrelation correction (effective sample count) and a Student-t critical value are documented future refinements.
variance_pct × Wbase), so historical runs keep their badge.
tapo library’s reliability). The underlying instrument is more precise — ~1 mW resolution via direct device read — so future versions could lower the hardware noise floor by ~3 orders of magnitude if needed. In practice, however, the dominant noise sources are OS background processes (apt, cron, systemd timers) and thermal drift between runs, not hardware quantisation. Focus mode suppresses the worst offenders, but residual variance remains. The variance calibration process measures this combined noise empirically and stores it as the reference for all confidence calculations.
The confidence framework follows GoS’s broader principle: if it can’t be measured, it shouldn’t be asserted. A 🔴 result is not a failure — it’s an honest signal that the measurement instrument isn’t sensitive enough for that task. Publishing it transparently is more useful than hiding it.
Calibration integrity
The variance calibration runner (/variance/run) executes 12 pairs of H.264 CPU + H.265 GPU encodes with 70 seconds between them, and computes three coefficients of variation: idle (raw P110 baseline readings, captures system noise), CPU (run-to-run reproducibility of the CPU encode ΔW), GPU (same for GPU). Their mean becomes variance_pct.
The runner is hardened against silent encode failures: every ffmpeg invocation’s exit code is checked, only successful encodes contribute ΔW, and per-side failure counters are tracked. If ≥50% of either side fails, the runner refuses to update settings — the result JSON is still returned (with cpu_failed, gpu_failed, failure_stderr, abort_reason fields) for forensics, but variance_pct stays unchanged on disk. This protects against the failure mode where partial-encode ΔW values contaminate the calibration without the operator noticing.
Diagnostics & Pre-calibration
Two layers of measurement-discipline tooling sit alongside the calibration:
Thermal-recovery probe
Before trusting a calibration result, the system needs to know that variance_cooldown_s is long enough — the idle samples taken between encodes must come from a thermally recovered system, not from the tail of the previous workload. The bin/probe-thermal-recovery diagnostic characterises this empirically. For a sequence of distances d after each of a CPU and a GPU encode (defaults: 0, 2, 5, 8, 12, 18, 25, 35, 50, 70, 95, 120 seconds), the probe samples idle power for 8 polls and writes the mean / std / CV to a CSV under results/diagnostics/.
On the GoS1 hardware the recovery is fast (see chart above): post-CPU and post-GPU baselines converge to the settled idle floor by d = 5–8 s with within-window CV around 1–2.5%. So the configured cooldown of 70 seconds is comfortably more than necessary — useful as a margin, not as a correction.
The same curve is also on the Settings page (lab access), where it refreshes live from the probe endpoint. Each probe run overwrites nothing — it leaves a fresh timestamped CSV pair under results/diagnostics/ so historical curves can be diffed if hardware or thermal conditions change.
Why the probe matters
The probe was the seam that exposed the scale_vaapi leak (the GPU encode failed within 90 seconds of starting the diagnostic) and the silent-failure path in the calibration loop. Generalisable lesson: measurement code should fail loudly, not interpolate around brokenness. The probe predates being a first-class server feature, so its on-server execution is currently CLI-only; a queue-aware /precalibration/run endpoint with an in-page “Re-run” button is captured as a follow-up.
Hardware Disclosure
All results are tied to specific hardware. Different CPUs, GPUs, RAM configurations, and PSU efficiencies will produce different numbers. OWL results should always be cited with their hardware context.
| Server | GoS1 — custom build, Ubuntu 24, kernel 6.17 |
| CPU | AMD Ryzen 9 7900, 24 cores (12C/24T), 65W TDP |
| GPU | NVIDIA GeForce RTX 5080, NVENC + CUDA |
| RAM | 61 GB DDR5 |
| Storage | 500 GB NVMe SSD (OS + working set) + 4 TB NVMe SSD (test media & result archive, mounted /srv/data) |
| Idle power | ~79W at the wall (settled, display-blanked). The mid-2026 RTX 5080 swap raised idle ~+20W over the prior AMD 7800 XT (~57–59W) — intrinsic to the larger card, not a fault. The 5080 idle is display-state-sensitive: a blanked desktop sits at ~79W, an active (non-blanked) desktop ~101W; GoS1 blanks ~15 min after the last input, so the like-for-like figure is ~79W |
| Measurement | Tapo P110, 1-second polling via local API (tapo 0.8.12) |
| Video | ffmpeg current master build (/usr/local/bin/ffmpeg-master — ships the NVENC encoders + scale_cuda filter) — libx264, libx265, libsvtav1 (CPU); h264_nvenc, hevc_nvenc, av1_nvenc (GPU, full NVENC/CUDA pipeline) |
| LLM | Ollama 0.20.2 — ladder of TinyLlama 1.1B, Qwen3 1.7B/4B/8B, Mistral-NeMo 12B, Phi-4 14B, GPT-OSS 20B (CPU + CUDA GPU); Qwen3 4B is the canonical RAG model |
| Image | PyTorch + diffusers — SD-Turbo (~1B), SDXL-Turbo (~3.5B, GPU only); CPU + CUDA GPU |
Test Types
Video transcoding
Transcode a source file (default: Netflix Meridian 4K, CC BY 4.0) to a target codec and 1080p. Measures the energy cost of the full encode pipeline — decode, colour-space conversion, scale, encode. Supports CPU vs GPU comparison: both paths are run sequentially with a cooldown between them, and results are presented side by side.
Six presets across three codecs: H.264 (libx264 / h264_nvenc, 4000 kbps), H.265 (libx265 / hevc_nvenc, 2000 kbps), AV1 (libsvtav1 / av1_nvenc, 1500 kbps). A seventh Compare all codecs preset runs all six in sequence and produces a cross-codec energy matrix. (Encoder names track the installed GPU — the live list is in the Hardware Disclosure table above.)
All presets use ABR (Average Bit Rate) rate control at a shared per-codec bitrate target, so CPU and GPU receive the identical encoding task — output file sizes match across devices as confirmation. All GPU presets use the full hardware pipeline: hardware decode (-hwaccel cuda) + scale_cuda + hardware encode, with frames GPU-resident throughout. This represents real live-encoding workflows (Harmonic, Ateme); an earlier partial pipeline (CPU decode + GPU encode) has been replaced because it was unrepresentative and bottlenecked on CPU decode overhead.
The ffmpeg command used for each run is logged in the result JSON, editable from the page (signed-in GoS members and lab access), and reproduced in the result card for full transparency.
Perceptual quality (VMAF). Comparison runs (CPU vs GPU, or all codecs) also report VMAF — Netflix’s perceptual quality metric (0–100, higher is better) — so the energy figures sit next to a quality figure rather than an unstated assumption that the encodes are equivalent. It is computed at the delivered 1080p, comparing each encoded output against the source downscaled to 1080p (the distorted side is cropped to strip hardware-encoder padding, never upscaled). VMAF runs after the measurement window closes, so its compute cost is excluded from the reported energy. It is a quality cross-check, not a primary GoS measurement.
AI workloads — beta, exploratory
Video transcoding is OWL’s core benchmark. Three AI workloads run alongside it on the same protocol and confidence framework, but they are explicitly beta — useful for relative comparisons, with headline numbers still being hardened (see Open Questions). In brief:
- LLM inference — mWh/token across a model ladder (TinyLlama 1.1B, Qwen3 1.7B/4B/8B, Mistral-NeMo 12B, Phi-4 14B, up to GPT-OSS 20B), cold or warm, CPU or GPU, with an optional batch mode. Prompts are saved in the result JSON; output streams word-by-word as live-run proof.
- Image generation — Wh/image for the SD-Turbo (~1B) and SDXL-Turbo (~3.5B) distilled diffusion models, CPU or GPU, with a Compare-Models mode that fixes prompt, seed and resolution so model size is the only variable.
- RAG — the energy delta of retrieval: baseline (no retrieval) vs RAG with 3 context chunks vs 8, retrieved from a document corpus via ChromaDB + sentence-transformer embeddings, compared side by side.
Framing (GoS Language Lab position paper, Jan 2026): AI in streaming is neither inherently sustainable nor unsustainable — type, size and deployment decide net impact. The type matters enormously: streaming leans on small specialised CNNs (per-title encoding, scene classification, super-resolution) that are orders of magnitude cheaper than the general-purpose LLMs and diffusion models these tabs measure as an upper bound. OWL measures the energy AI adds (inference only); it does not measure the infrastructure energy AI avoids through better compression, caching or routing — both halves are needed for net impact, and OWL has the first. Each AI result is also shown as a multiple of a real video encode (the pinned canonical H.265 GPU encode of Meridian-120s) so the number stays anchored to a streaming workload rather than floating free. Full framing: Language Lab AI position paper →.
Known Limitations
From energy to CO2e — for reference only
OWL is a power meter, not a carbon calculator. The number OWL produces and stands behind is energy — watts at the wall and watt-hours per task, measured directly by the P110. Everything else on this page is about getting that energy number right. We lead with power because it is what we can measure at the wall with no modelling assumptions; carbon is one modelling layer removed.
Every result also carries a gCO2e figure, but only as a downstream convenience: we multiply the measured energy by a grid carbon-intensity factor (Wh × gCO2e/kWh) so the energy can be read against everyday activities. That makes it a reference estimate, never a GoS measurement. Carbon attribution — allocation, boundaries, double-counting, marginal vs average intensity — is a hard problem that GoS deliberately leaves to the bodies whose job it is. This follows the GoS principle directly: “if it can’t be measured, it shouldn’t be asserted” — and what OWL measures directly is energy. Read the energy figure as the result; the CO2e is a footnote.
🟢 Direct = the energy figure (P110 polling at the wall, validated method, GoS primary measurement — this is what we cite). 🟡 Indicative = the gCO2e figure (Wh × third-party grid intensity — context, not citable as GoS data). Vocabulary follows the Greening of Streaming Language Lab AI position paper (Jan 2026), which proposes this 🟢/🟡/🔴 traffic-light for the entire ICT-energy-measurement landscape and rates IEA top-down energy figures as 🟡 Amber. OWL applies the same framework to its own outputs — every result-card carbon block carries the 🟡 chip; the energy headline retains the green palette.
For what it’s worth, the intensity used is lifecycle-basis (IPCC AR6 factors): the live French grid mix via Eco2mix when reachable, ElectricityMaps as a backup, and Ember annual country means as the fallback (also used for the stable comparison cities). The value and which source produced it are recorded in every result JSON and CSV export (CSV header carries a leading comment marking the carbon columns indicative). A result’s carbon dropdown also shows the same energy on a few past French grids for context. Module status — live cache, source, age, fallback — is at /carbon.
Open Questions
These are questions OWL has surfaced but not yet answered. They are published here in the interest of transparency.