Greening of Streaming · Live energy measurement · GoS1
OWL measures the real energy cost of video transcoding and AI inference — using a calibrated smart plug, not estimates. Every number on this page comes from a live measurement on GoS1, a server in our lab in France.
GoS1 is an AMD Ryzen 9 workstation with an RX 7800 XT GPU. Power is sampled at 1-second intervals via a Tapo P110 smart plug connected to the mains supply. We measure the delta between idle baseline and task power — not estimated TDP or nameplate figures.
Scope: device layer only. Network, CDN, and CPE are explicitly excluded. Amortised embodied carbon and training cost are not included in LLM measurements.
Streaming accounts for a significant and growing share of global internet traffic. Codec choice, inference model size, and hardware path all affect real energy use — but most published figures are estimates or averages. OWL produces primary measurement data that operators and researchers can reproduce and cite.
→ Read the full measurement methodology protocol, confidence framework, scope statements, calibration
Whether transcoding to the same quality target uses more energy on CPU or GPU — and whether the faster path is also the more efficient one.
Encoding a 4K clip (Meridian, Netflix Open Content, CC BY 4.0) to 1080p H.264 — once in software (libx264, CPU only) and once as a full GPU pipeline (hardware decode + encode via h264_vaapi). Same source. Same quality target. P110 sampled every second throughout.
5s idle baseline before each run. 10s thermal cooldown between CPU and GPU. Energy = ΔW × duration / 3600. Confidence 🟢 = ΔW > 5× noise and ≥ 9 polls.
Source: 812 MB, 4K. Encode time ~2–3 min CPU, ~90s GPU (full pipeline). Previous runs (partial pipeline): CPU 174s / 4.06 Wh · GPU 114s / 4.42 Wh. Full pipeline results pending first run.
Scope: device layer only (GoS1). Network, CDN, and CPE not included. A faster encode does not automatically mean less energy — this measures total Wh, not rate.
How much energy each generated token costs — and how model size translates into energy use per unit of output.
Running a fixed prompt (T3 Long — network energy attribution briefing) through Mistral 7B cold: model unloaded before baseline so we capture the true first-request cost. GPU inference via Ollama ROCm.
Model unloaded from VRAM. 3s settle. 10s idle baseline. Single inference run. P110 at 1s intervals. Primary metric: mWh per output token.
Model: Mistral 7B (4.4 GB). Previous result: 0.94 mWh/tok, ~47 tok/s.
Token count varies between models and prompts, so raw Wh figures aren't comparable. Energy per token lets us place TinyLlama (0.06 mWh/tok) and Mistral 7B (0.94 mWh/tok) on the same axis — a ~15× difference.
Scope: device layer only (GoS1). No amortised training cost included. mWh/token measures inference energy only — not the energy cost of training the model.
How much energy one AI-generated image costs — measured end to end on real hardware, not estimated from TDP or cloud benchmarks.
Running SD-Turbo (stabilityai/sd-turbo, CPU, 8 steps, 512×512) with a randomly modified prompt — the colour modifier changes each run to prove the image is generated live, not replayed from cache.
10s idle baseline. CPU diffusion run. P110 at 1s intervals. Metric: Wh per image = ΔW × generation_time / 3600.
Previous result: 0.21 Wh/image, 12s, ~30W delta above idle.
Scope: device layer only (GoS1). Network and storage excluded. This measures one image on one machine — not the energy cost of a hosted API call.
Whether retrieval-augmented generation (RAG) — searching a local corpus before answering — costs meaningfully more energy than plain inference, and see the difference in context size the model must process.
Running three modes back-to-back on Mistral 7B: baseline (no retrieval), RAG (small corpus), and RAG Large (with re-ranking). Same question, same model, same hardware — only the retrieval pipeline changes.
Each mode: 10s idle baseline, inference with P110 at 1s intervals. Metric: mWh per output token. ChromaDB embeddings via sentence-transformers. Corpus: academic papers on streaming energy.
Scope: device layer only (GoS1). Network excluded. RAG retrieval adds overhead but the dominant cost remains token generation.
Not every measurement we take is equally trustworthy. System noise — P110 quantisation, OS jitter, Wi-Fi polling variance — is real. A task that adds a small delta above baseline might be signal or artefact. We need a principled way to say which.
Every result carries a traffic light. As of CR-028 Phase 2 it's a per-run
confidence interval — "can this run be told apart from idle?" — not a fixed
watt rule.
confidence = Φ(ΔW / SE), SE from this run's noise + the calibrated idle floor
Fixed thresholds (e.g. "5W = green") don't adapt to the machine's actual noise level. Instead we take this run's own baseline + task power samples, form a standard error on ΔW (worst case of the run's observed noise and the calibrated idle floor, plus a drift term), and turn ΔW into a one-sided confidence that the task draws above idle. A short run can't go green on a couple of lucky readings — it also needs enough task polls.
On any result page, click a 🟢 🟡 🔴 badge for a quick reminder of the formula.
Greening of Streaming · OWL · GoS1
From OWL's body of evidence — citable findings backed by stored measurements:
OWL has three access tiers. The numbers and methodology you've just seen are identical for all three — what changes is who can shape the inputs (custom prompts, custom ffmpeg, all-codecs sweeps, your own corpus, full settings access).
| Public | GoS member | Lab (operator) | |
|---|---|---|---|
| Pre-baked workloads, live wall-power & CO2e | ✓ | ✓ | ✓ |
| Guided tour, methodology, recent-run history | ✓ | ✓ | ✓ |
| Custom video upload | — | ≤ 1024 MB | no cap |
| Custom prompts & custom ffmpeg commands | — | ✓ | ✓ |
| All-codecs sweeps, batch / compare-modes | — | ✓ | ✓ |
| RAG corpus upload (your own PDFs) | — | ✓ | ✓ |
| CSV / JSON export of your runs | — | ✓ | ✓ |
| Edit settings, run variance calibration, full results view | — | — | ✓ |
Lab tier is granted automatically on the GoS1 LAN (loopback / 192.168.x). There's no public sign-up for Lab — it's the operator surface for the bench itself.
Same measurement quality on every tier. Members shape the inputs; everyone sees the results.
Scope: device layer only (GoS1). Network, CDN, CPE excluded.
LLM: no amortised training cost included.