OWLOWL/← All findings

RAG retrieval surfaced the correct GoS REM papers across three models; the smallest model (TinyLlama 1.1B) then generated a hallucinated answer combining the correct retrieval with an adjacent JRC chunk

🟑 Indicative · measured 2026-04-29 · v1
Same top-k=3 retrieval, three models β€” TinyLlama 1.1B hallucinated 'European Commission framework'; Gemma 3 12B and Phi-4 14B stayed faithful to the GoS source
SCOPE: Device layer only. RAG corpus = GoS REM whitepapers + adjacent IEA / BBC / industry energy papers (ChromaDB top-k=3 retrieval, all-MiniLM-L6-v2 embeddings).
OWL Finding: RAG retrieval surfaced the correct GoS REM papers across three models; the smallest model (TinyLlama 1.1B) then generated a hallucinated answer combining the correct retrieval with an adjacent JRC chunk measured 2026-04-29 https://wattlab.greeningofstreaming.org/findings/rag-faithfulness-rem-question Greening of Streaming β€” wattlab.greeningofstreaming.org
Source measurement
Loading measurement llm/5efb2079…

Caveats

What was measured

A single rag_compare run on the question "What is REM (Remote Energy Measurement)?" β€” three models (TinyLlama 1.1B, Gemma 3 12B, Phi-4 14B), each receiving the same top-3 retrieved chunks from the OWL corpus (ChromaDB, all-MiniLM-L6-v2 embeddings). All three models retrieved the same chunks β€” the GoS REM whitepapers plus an adjacent JRC-Commission energy paper. The difference between the runs is what each model did with those chunks at generation time.

Observed answers

What this measurement establishes

What this measurement does not establish

Why this matters for the bench

OWL's RAG measurement pipeline reports energy per inference cleanly. Without a faithfulness axis, the energy story alone would read as "small model is cheaper" β€” but the small model is the one that hallucinated. Energy per correct answer is a more useful headline than energy per token for any model where correctness is the goal. See CR-039 (energy Γ— quality axis for AI jobs) for the explicit treatment.

Methodology β†’ (docs/wattlab_traffic_light_confidence.md)
ragfaithfulnesstinyllamagemma3phi4pre-s30-panelhallucination