OWL Measurement Methodology

How OWL measures the energy cost of compute tasks — and what it doesn’t measure.

Source on GitHub → Report an issue / feature request →

Scope Principle Protocol Energy maths Confidence Diagnostics Hardware Test types Limitations CO₂e Open questions

Scope

OWL measures what happens inside one machine when it performs a real task. This is intentionally narrow. The energy cost of streaming is distributed across data centres, networks, and consumer devices — each with different measurement challenges and attribution problems. We start with the layer we can measure directly, at the wall, with no modelling assumptions.

This scoping decision means OWL results are not lifecycle assessments and should not be cited as total-cost-of-delivery figures. They answer a specific question: how much additional energy does this server draw to perform this task, above its idle baseline?

Measurement Principle

OWL uses wall-power delta measurement: the difference between what the server draws at idle and what it draws under load, captured by an external smart plug.

The plug measures the entire system — not a model, not a software estimate, not a per-component reading. If the CPU fan spins faster, the PSU runs less efficiently, or the GPU draws from the 12V rail, it’s all in the number.

This follows the GoS REM (Remote Energy Measurement) approach: real devices, real workloads, measured externally, at polling intervals short enough to capture the task’s energy profile.

Measurement Protocol

Every test in OWL — video, LLM, image generation, RAG — follows the same core protocol:

Focus mode. Suppress background system tasks (apt, cron, man-db, fwupd, etc.) that would introduce energy noise. Managed via systemctl stop with dedicated sudoers rules.
Model unload (LLM/RAG only). Send keep_alive=0 to Ollama and wait 3 seconds for GPU memory release. Ensures a cold start when cold-inference mode is selected.
Baseline capture. Poll the Tapo P110 at 1-second intervals for a configurable period (currently 5 polls — configurable in Settings). The mean of these readings becomes W_base — the server’s idle power draw.
Lock. Acquire /tmp/gos-measure.lock to prevent concurrent measurements from overlapping. A FIFO queue manages waiting jobs.
Execute task. Run the actual workload (ffmpeg, Ollama inference, SD-Turbo diffusion) while continuing to poll the P110 at 1-second intervals. Thermal sensors (CPU Tctl, GPU junction, GPU PPT) are read in parallel.
Compute energy. Calculate delta power, total energy, and per-unit metrics (see formulas below).
Persist. Write the full result to a JSON file — parameters, energy report, raw poll data, thermal readings, confidence flag. Every result is reproducible and exportable.
Focus exit. Restart suppressed system timers in parallel (via ThreadPoolExecutor) to minimise downtime.

Between sequential runs (e.g., CPU vs GPU comparison), a configurable cooldown (currently 10 seconds — configurable in Settings) allows the system to return to thermal equilibrium.

Energy Calculation

Delta power (average above idle) ΔW = mean(W_polls) − W_base

Total energy consumed by task ΔE = ΔW × (Δt / 3600) [Wh]

where Δt = task duration in seconds

Per-token energy (LLM / RAG) E_token = ΔE / N_tokens [mWh/token]

Per-image energy (image generation) E_image = ΔE / N_images [Wh/image]

All formulas use wall-power from the P110 (system-level), not component-level readings. The GPU’s self-reported power (its vendor sensor — amdgpu PPT or nvidia-smi power draw) is captured for reference but is not used in the primary energy calculation — it covers only the GPU die/board, not the full system delta (CPU, RAM, drives, fans, PSU losses).

Confidence Framework

Every OWL result carries a traffic-light confidence flag. Under the CR-028 Phase 2 model (designed with Tania Pouli), the flag answers one defensible question per run: can this run be distinguished from idle? It is a per-run confidence interval, not a fixed-watt rule of thumb.

We keep the raw per-poll power samples from both the baseline window and the task window, form a standard error on the measured power increase ΔW, then convert ΔW into a one-sided confidence that the task really draws above idle:

Standard error — conservative (worst case of the calibrated and per-run estimates, plus drift) SE_final = max(SE_calibrated, SE_per-run) + SE_drift
SE_calibrated = (variance_idle_pct/100 · W_base) × √(1/n_base + 1/n_task)
SE_per-run = √(σ²_base/n_base + σ²_task/n_task)
SE_drift = (variance_idle_drift_pct/100) · W_base Confidence the task draws above idle confidence_positive = Φ(ΔW / SE_final)

Flag	Meaning	Criteria (defaults)
🟢	Repeatable — the task is almost certainly above idle, with enough samples to be reliable.	confidence_positive ≥ 95% and ≥ `9` task polls
🟡	Early insight — directional evidence; a longer run would strengthen it.	confidence_positive ≥ 80% and ≥ `4` task polls
🔴	Need more data — cannot yet be distinguished from idle.	below the yellow threshold

Why a confidence interval, not a fixed-watt rule? The flag uses this run’s own observed noise (SE_per-run), takes the worst case against a calibrated idle floor (SE_calibrated), and adds a drift term for the time gap between the baseline and task windows — so it reflects real signal quality on the day, not an assumed noise floor. The minimum task-sample counts remain because 1 s power samples are autocorrelated: a very short task should not turn green on one or two lucky readings.

Inputs (CR-028 Phase 2, “option C”). The single-run flag uses only variance_idle_pct as the calibrated idle noise floor. The per-codec calibration CVs (variance_cpu_pct / variance_gpu_pct) are run-to-run repeatability measures, reserved for a future aggregate-confidence layer rather than mixed into the single-run formula. The first pass uses raw sample counts and a 1.96 (95%) critical value; an autocorrelation correction (effective sample count) and a Student-t critical value are documented future refinements.

Legacy results. Results saved before raw per-poll samples were persisted fall back to the earlier variance-threshold flag (ΔW against a multiple of variance_pct × W_base), so historical runs keep their badge.

P110 and total system noise: The Tapo P110 smart plug exposes power readings at 1 W resolution via its local API (the path OWL currently uses, chosen for portability and the Python tapo library’s reliability). The underlying instrument is more precise — ~1 mW resolution via direct device read — so future versions could lower the hardware noise floor by ~3 orders of magnitude if needed. In practice, however, the dominant noise sources are OS background processes (apt, cron, systemd timers) and thermal drift between runs, not hardware quantisation. Focus mode suppresses the worst offenders, but residual variance remains. The variance calibration process measures this combined noise empirically and stores it as the reference for all confidence calculations.

The confidence framework follows GoS’s broader principle: if it can’t be measured, it shouldn’t be asserted. A 🔴 result is not a failure — it’s an honest signal that the measurement instrument isn’t sensitive enough for that task. Publishing it transparently is more useful than hiding it.

Calibration integrity

The variance calibration runner (/variance/run) executes 12 pairs of H.264 CPU + H.265 GPU encodes with 70 seconds between them, and computes three coefficients of variation: idle (raw P110 baseline readings, captures system noise), CPU (run-to-run reproducibility of the CPU encode ΔW), GPU (same for GPU). Their mean becomes variance_pct.

The runner is hardened against silent encode failures: every ffmpeg invocation’s exit code is checked, only successful encodes contribute ΔW, and per-side failure counters are tracked. If ≥50% of either side fails, the runner refuses to update settings — the result JSON is still returned (with cpu_failed, gpu_failed, failure_stderr, abort_reason fields) for forensics, but variance_pct stays unchanged on disk. This protects against the failure mode where partial-encode ΔW values contaminate the calibration without the operator noticing.

Diagnostics & Pre-calibration

Two layers of measurement-discipline tooling sit alongside the calibration:

Thermal-recovery probe

Before trusting a calibration result, the system needs to know that variance_cooldown_s is long enough — the idle samples taken between encodes must come from a thermally recovered system, not from the tail of the previous workload. The bin/probe-thermal-recovery diagnostic characterises this empirically. For a sequence of distances d after each of a CPU and a GPU encode (defaults: 0, 2, 5, 8, 12, 18, 25, 35, 50, 70, 95, 120 seconds), the probe samples idle power for 8 polls and writes the mean / std / CV to a CSV under results/diagnostics/.

Recovery curve from the latest probe run.

On the GoS1 hardware the recovery is fast (see chart above): post-CPU and post-GPU baselines converge to the settled idle floor by d = 5–8 s with within-window CV around 1–2.5%. So the configured cooldown of 70 seconds is comfortably more than necessary — useful as a margin, not as a correction.

The same curve is also on the Settings page (lab access), where it refreshes live from the probe endpoint. Each probe run overwrites nothing — it leaves a fresh timestamped CSV pair under results/diagnostics/ so historical curves can be diffed if hardware or thermal conditions change.

Why the probe matters

The probe was the seam that exposed the scale_vaapi leak (the GPU encode failed within 90 seconds of starting the diagnostic) and the silent-failure path in the calibration loop. Generalisable lesson: measurement code should fail loudly, not interpolate around brokenness. The probe predates being a first-class server feature, so its on-server execution is currently CLI-only; a queue-aware /precalibration/run endpoint with an in-page “Re-run” button is captured as a follow-up.

Hardware Disclosure

All results are tied to specific hardware. Different CPUs, GPUs, RAM configurations, and PSU efficiencies will produce different numbers. OWL results should always be cited with their hardware context.

Server	GoS1 — custom build, Ubuntu 24, kernel 6.17
CPU	AMD Ryzen 9 7900, 24 cores (12C/24T), 65W TDP
GPU	NVIDIA GeForce RTX 5080, NVENC + CUDA
RAM	61 GB DDR5
Storage	500 GB NVMe SSD (OS + working set) + 4 TB NVMe SSD (test media & result archive, mounted `/srv/data`)
Idle power	~79W at the wall (settled, display-blanked). The mid-2026 RTX 5080 swap raised idle ~+20W over the prior AMD 7800 XT (~57–59W) — intrinsic to the larger card, not a fault. The 5080 idle is display-state-sensitive: a blanked desktop sits at ~79W, an active (non-blanked) desktop ~101W; GoS1 blanks ~15 min after the last input, so the like-for-like figure is ~79W
Measurement	Tapo P110, 1-second polling via local API (tapo 0.8.12)
Video	ffmpeg current master build (`/usr/local/bin/ffmpeg-master` — ships the NVENC encoders + `scale_cuda` filter) — libx264, libx265, libsvtav1 (CPU); h264_nvenc, hevc_nvenc, av1_nvenc (GPU, full NVENC/CUDA pipeline)
LLM	Ollama 0.20.2 — ladder of TinyLlama 1.1B, Qwen3 1.7B/4B/8B, Mistral-NeMo 12B, Phi-4 14B, GPT-OSS 20B (CPU + CUDA GPU); Qwen3 4B is the canonical RAG model
Image	PyTorch + diffusers — SD-Turbo (~1B), SDXL-Turbo (~3.5B, GPU only); CPU + CUDA GPU

Hardware change — GPU swap (mid-2026). GoS1’s GPU was replaced from an AMD Radeon RX 7800 XT (VAAPI + ROCm) with an NVIDIA RTX 5080 (NVENC + CUDA). OWL’s vendor-abstraction layer auto-detected the new card with no code change, and results are stamped with the GPU they ran on. The driver was tooling reach (CUDA-only partner workloads), not energy — and the swap has a real methodology consequence worth stating plainly: idle power rose ~+20W at the wall (~57–59W → ~79W), intrinsic to the larger card. Per-encode NVENC is more efficient than VAAPI at matched bitrate (measured n=10: H.264 −42%, H.265 −22%, AV1 −25% energy), but the higher idle floor means the swap is only net energy-positive for H.264-heavy, near-saturated duty cycles; for H.265 the idle penalty is never repaid by transcode alone. We therefore treat the 5080 as a capability / quality / speed upgrade, not a same-workload energy win. The frozen pre-swap AMD baseline is preserved for comparison.

Test Types

Video transcoding

Transcode a source file (default: Netflix Meridian 4K, CC BY 4.0) to a target codec and 1080p. Measures the energy cost of the full encode pipeline — decode, colour-space conversion, scale, encode. Supports CPU vs GPU comparison: both paths are run sequentially with a cooldown between them, and results are presented side by side.

Six presets across three codecs: H.264 (libx264 / h264_nvenc, 4000 kbps), H.265 (libx265 / hevc_nvenc, 2000 kbps), AV1 (libsvtav1 / av1_nvenc, 1500 kbps). A seventh Compare all codecs preset runs all six in sequence and produces a cross-codec energy matrix. (Encoder names track the installed GPU — the live list is in the Hardware Disclosure table above.)

All presets use ABR (Average Bit Rate) rate control at a shared per-codec bitrate target, so CPU and GPU receive the identical encoding task — output file sizes match across devices as confirmation. All GPU presets use the full hardware pipeline: hardware decode (-hwaccel cuda) + scale_cuda + hardware encode, with frames GPU-resident throughout. This represents real live-encoding workflows (Harmonic, Ateme); an earlier partial pipeline (CPU decode + GPU encode) has been replaced because it was unrepresentative and bottlenecked on CPU decode overhead.

The ffmpeg command used for each run is logged in the result JSON, editable from the page (signed-in GoS members and lab access), and reproduced in the result card for full transparency.

Perceptual quality (VMAF). Comparison runs (CPU vs GPU, or all codecs) also report VMAF — Netflix’s perceptual quality metric (0–100, higher is better) — so the energy figures sit next to a quality figure rather than an unstated assumption that the encodes are equivalent. It is computed at the delivered 1080p, comparing each encoded output against the source downscaled to 1080p (the distorted side is cropped to strip hardware-encoder padding, never upscaled). VMAF runs after the measurement window closes, so its compute cost is excluded from the reported energy. It is a quality cross-check, not a primary GoS measurement.

Open item (narrower than before): With ABR, the bitrate target is now equal across devices. GOP structure and profile level are not yet explicitly controlled and may differ between CPU and GPU encoder defaults — a working session with the measurement team is planned to confirm apples-to-apples output at the profile/GOP level. A second benchmark family at each codec’s natural operating point (CRF for CPU, QP for GPU) is also on the roadmap.

AI workloads — beta, exploratory

Video transcoding is OWL’s core benchmark. Three AI workloads run alongside it on the same protocol and confidence framework, but they are explicitly beta — useful for relative comparisons, with headline numbers still being hardened (see Open Questions). In brief:

LLM inference — mWh/token across a model ladder (TinyLlama 1.1B, Qwen3 1.7B/4B/8B, Mistral-NeMo 12B, Phi-4 14B, up to GPT-OSS 20B), cold or warm, CPU or GPU, with an optional batch mode. Prompts are saved in the result JSON; output streams word-by-word as live-run proof.
Image generation — Wh/image for the SD-Turbo (~1B) and SDXL-Turbo (~3.5B) distilled diffusion models, CPU or GPU, with a Compare-Models mode that fixes prompt, seed and resolution so model size is the only variable.
RAG — the energy delta of retrieval: baseline (no retrieval) vs RAG with 3 context chunks vs 8, retrieved from a document corpus via ChromaDB + sentence-transformer embeddings, compared side by side.

Framing (GoS Language Lab position paper, Jan 2026): AI in streaming is neither inherently sustainable nor unsustainable — type, size and deployment decide net impact. The type matters enormously: streaming leans on small specialised CNNs (per-title encoding, scene classification, super-resolution) that are orders of magnitude cheaper than the general-purpose LLMs and diffusion models these tabs measure as an upper bound. OWL measures the energy AI adds (inference only); it does not measure the infrastructure energy AI avoids through better compression, caching or routing — both halves are needed for net impact, and OWL has the first. Each AI result is also shown as a multiple of a real video encode (the pinned canonical H.265 GPU encode of Meridian-120s) so the number stays anchored to a streaming workload rather than floating free. Full framing: Language Lab AI position paper →.

Known Limitations

►P110 temporal resolution. 1-second polling means tasks shorter than ~5 seconds produce few data points. Very fast models (e.g., TinyLlama single inference at 1–4 seconds) are at the edge of measurability. Batching mitigates this but changes what’s being measured (batch cost, not single-inference cost). The same constraint puts a floor on any artificially-shortened encode: a workload that finishes in 3–4 seconds yields only 3–4 P110 polls, and the resulting per-run ΔW mean becomes noisy enough to inflate the coefficient of variation independently of any real measurement issue.

►P110 power resolution. The Tapo P110 instrument itself reports at 1 mW resolution via direct device read, but its public HTTP API exposes only 1 W — and the public API is what this deployment polls. The effective ~±1 W noise floor is therefore an API-shape limit, not a hardware limit: low-delta tasks (e.g., idle audio processing, lightweight network operations) cannot be reliably measured against it. A future direct-device path would unlock ~1000× finer resolution from the same plug.

►Single server. All results are from one machine. Generalisability to other hardware configurations is unknown without cross-platform measurement.

►Baseline drift. The server’s idle power drifts with thermal state, background processes, and — since the RTX 5080 swap — GPU display power state: a blanked vs active desktop alone moves the wall figure by ~20W (~79W → ~101W). The per-run baseline capture (re-measured immediately before each task) mitigates this, but it introduces variance between runs taken at different times.

►PSU efficiency curve. Wall power includes PSU conversion losses, which are non-linear (PSUs are less efficient at low and very high loads). Two tasks that consume the same internal power may report different wall-power deltas depending on where they sit on the PSU efficiency curve.

From energy to CO₂e — for reference only

OWL is a power meter, not a carbon calculator. The number OWL produces and stands behind is energy — watts at the wall and watt-hours per task, measured directly by the P110. Everything else on this page is about getting that energy number right. We lead with power because it is what we can measure at the wall with no modelling assumptions; carbon is one modelling layer removed.

Every result also carries a gCO₂e figure, but only as a downstream convenience: we multiply the measured energy by a grid carbon-intensity factor (Wh × gCO₂e/kWh) so the energy can be read against everyday activities. That makes it a reference estimate, never a GoS measurement. Carbon attribution — allocation, boundaries, double-counting, marginal vs average intensity — is a hard problem that GoS deliberately leaves to the bodies whose job it is. This follows the GoS principle directly: “if it can’t be measured, it shouldn’t be asserted” — and what OWL measures directly is energy. Read the energy figure as the result; the CO₂e is a footnote.

🟢 Direct = the energy figure (P110 polling at the wall, validated method, GoS primary measurement — this is what we cite). 🟡 Indicative = the gCO₂e figure (Wh × third-party grid intensity — context, not citable as GoS data). Vocabulary follows the Greening of Streaming Language Lab AI position paper (Jan 2026), which proposes this 🟢/🟡/🔴 traffic-light for the entire ICT-energy-measurement landscape and rates IEA top-down energy figures as 🟡 Amber. OWL applies the same framework to its own outputs — every result-card carbon block carries the 🟡 chip; the energy headline retains the green palette.

For what it’s worth, the intensity used is lifecycle-basis (IPCC AR6 factors): the live French grid mix via Eco2mix when reachable, ElectricityMaps as a backup, and Ember annual country means as the fallback (also used for the stable comparison cities). The value and which source produced it are recorded in every result JSON and CSV export (CSV header carries a leading comment marking the carbon columns indicative). A result’s carbon dropdown also shows the same energy on a few past French grids for context. Module status — live cache, source, age, fallback — is at /carbon.

Open Questions

These are questions OWL has surfaced but not yet answered. They are published here in the interest of transparency.

?Confidence thresholds. The live flag is the CR-028 Phase 2 confidence interval described above; its positive-confidence cut-points (95% / 80%) and minimum poll counts are still set by judgement, and the first pass uses a 1.96 critical value with raw sample counts. A working session with the measurement team is planned to ground these — and to add the autocorrelation (effective-n) and Student-t refinements — against repeated calibration runs across workloads and thermal states. (The legacy 5× / 2× variance multipliers now apply only to pre-CI historical results.)

?Transcoding profile/GOP equivalence. ABR rate control now gives CPU and GPU the same bitrate target, and output file sizes match as confirmation. GOP structure and profile level are still default-per-encoder and have not been explicitly normalised. A working session is planned to confirm apples-to-apples at that level, and to add a second benchmark family at each codec’s natural operating point (CRF for CPU, QP for GPU).

?AI-workload questions (beta). LLM: does mWh/token drift across a batch (thermal saturation, memory pressure)? Image / RAG: how much of each energy delta is fixed overhead (model load, embedding lookup) vs. work that scales with output or context length? Secondary to the video benchmark; not yet investigated in depth.

?Cross-platform comparability. How should results from different hardware be compared? Normalisation by TDP? By performance tier? By workload-equivalent output quality?

← Home

OWL Measurement Methodology

Scope

Measurement Principle

Measurement Protocol

Energy Calculation

Confidence Framework

Calibration integrity

Diagnostics & Pre-calibration

Thermal-recovery probe

Why the probe matters

Hardware Disclosure

Test Types

Video transcoding

AI workloads — beta, exploratory

Known Limitations

From energy to CO2e — for reference only

Open Questions

From energy to CO₂e — for reference only