[
14 Stories
[

The Adaptive Research System

Building a multi-agent research system that routes queries to specialized AI models, validates citations, and adapts its strategy based on results. A deep dive into orchestration, quality scoring, and the realities of production LLM systems.

How This Story Works
Prologue
How This Story Works
Or, Why I Didn't Fix Things Immediately

Marvin suggested telling this chronologically. Discovery, implementation, testing, done. Clean narrative arc. I told him real development doesn't work that way. You discover problems you can't solve yet. You log them. You work on other things. Six stories later, you finally circle back. This is how it actually happened.

by Petteri Leppikallio & Marvin, Jan 9, 2026

The Routing Revelation
Story 1 of 14
The Routing Revelation

Six distinct research angles emerged from a simple query about AI frameworks. Keyword routing had seen two of them. The sequence was backward, and the fix was embarrassingly obvious once you saw it: generate perspectives first, then route specialists to what you actually found instead of what you assumed you'd find.

by Petteri Leppikallio & Marvin, Jan 14, 2026

When 97/100 Means You Failed

Ten research sessions, every one scoring excellent, every one missing obvious platforms. The system measured thoroughness beautifully while ignoring whether agents searched the right places at all. Keyword density favors promoted content, and quality metrics reward being thorough about the wrong things.

by Petteri Leppikallio & Marvin, Jan 16, 2026

The Two-Wave Architecture
Story 3 of 14
The Two-Wave Architecture
and the Cost of Not Thinking

I'd spent days building increasingly sophisticated pattern matchers to route research queries intelligently. The solution turned out to be spending a few seconds asking an LLM to think first. Sometimes the problem isn't that you don't know the answer. It's that you're too stubborn to use it.

by Petteri Leppikallio & Marvin, Jan 24, 2026

The Devil Lives in the Routing Logic

Marvin built a weighted scoring matrix with clean numbers and precise calculations. Claude +3 for technical, Grok +3 for X. It looked scientific. Then I asked why Claude got +3 for technical but only +2 for academic, and he said "because it feels right?" The day I learned that prompt engineering is production code, not documentation.

by Petteri Leppikallio & Marvin, Jan 29, 2026

When Valid Sources Are Still Wrong
Story 5 of 14
When Valid Sources Are Still Wrong

The source existed. The PDF loaded. But the thirty-two percent statistic we'd cited? Nowhere in the accessible content. We built validation to catch this - claims that look legitimate but can't be verified. It worked: barely half valid, unverifiable statistics caught. Then we checked what the valid citations were actually saying. Turns out vendor content is still vendor content, whether the sources exist or not.

by Petteri Leppikallio & Marvin, Dec 12, 2025