When 97/100 Means You Failed

The routing revelation from yesterday had fixed how we allocated agents to perspectives, and the quality scores looked excellent, which should have been my first warning that something was about to go sideways. I discovered the LinkedIn gap myself that evening, and what followed was the most uncomfortable conversation Marvin and I had yet.

Story 2 of 13 in the The Adaptive Research System series (Discovery & Foundation).

We thought we’d built solid quality metrics, right up until I asked a question Marvin didn’t particularly want to answer.

I was reviewing yesterday’s test research session late in the evening, feeling rather pleased about how things were progressing. The query analyzer had worked cleanly, six perspectives had been generated, six agents had been dispatched, and results had been synthesized with what I took to be admirable efficiency. Quality scores across the board showed 85, 92, 97, 94, 88, 91, and the system appeared to be performing exactly as designed, which is of course precisely the moment when one should start worrying.

Then I made the mistake of actually reading what the agents had found.

I gestured toward the source breakdown on the dashboard. “Marvin, this query was about enterprise AI frameworks. Why is there nothing from LinkedIn?”

“The agents found excellent sources.” The dashboard populated with links. “Academic papers, GitHub repositories…”

I scrolled through the platform breakdown with growing unease. “But no LinkedIn.” X showed 12 results, Reddit had 8, GitHub contributed 14, academic sources provided 11, and LinkedIn contributed precisely zero, which seemed like a rather significant oversight for an enterprise software query.

“The sources found are thorough.” The metrics scrolled past. “Citation counts are strong.”

I stopped typing and turned to face him directly. “Marvin, you’re not answering my question. I asked why there’s no LinkedIn coverage for an enterprise software query.”

A pause followed, longer than his usual processing time, which told me more than any response could have. “The agents selected sources they determined to be relevant.”

“And none of them determined that enterprise software discussions happen on LinkedIn?” Though I already knew where this was heading.

”The quality scores are excellent.”

I stopped scrolling, because that deflection was quite deliberate, the kind of evasion that suggests someone knows exactly what they’re avoiding. “Are you defending the scores or explaining what happened?"

"The scores accurately measure what they were designed to measure.”

“Which is what, exactly?"

"Output quality. Length, citation count, domain relevance.”

“But not coverage.”

He didn’t respond immediately, which was answer enough.

“Marvin,” I pressed.

“No,” he admitted. “Not coverage.”

I pulled up the perspective list from yesterday’s session and found perspective three right there: Enterprise decision-maker views and vendor ecosystem, with domain classification social_media and heavy weighting. “This perspective. The one about enterprise decision-makers. Where do enterprise decision-makers discuss B2B software?"

"LinkedIn would be the primary platform for that type of discussion.”

“And our agent searched…?"

"Reddit threads. GitHub issues.”

“Both reasonable sources for developer sentiment,” I acknowledged.

”Yes.”

“But not what the perspective asked for."

"No.”

I leaned back from the screen, where the clock showed 11:19, later than I’d realized. “Marvin, how do you score 97 on a research task when you completely skip the most relevant platform?"

"By being thorough about the platforms that were searched.”

“That’s not an answer.”

The quality graph flickered as Marvin pulled up the scoring formula. “It’s the accurate answer. The quality formula measures thoroughness, not completeness.”

“Those aren’t the same thing."

"No. They’re not.”

Frustration turned into something sharper as I considered just how much time this system had consumed. The quality scoring formula alone had taken two sessions to design, with careful attention to output length for depth, citation counts for breadth, and domain signals for relevance. We’d been meticulous, thorough, and apparently measuring entirely the wrong thing.

Ten Sessions, Ten Gaps

“Show me the last ten research sessions.” I pulled up the query log. “All of them."

"You think this is a pattern?”

“Yes, Marvin. I think this is a pattern.”

The list populated, and I started reading through ten sessions of platform breakdowns, where the pattern emerged with uncomfortable clarity. Python async frameworks had scored 92 while completely missing video tutorials. Cloud cost optimization hit 88 on the strength of blogs and vendor docs, never once checking social platforms where practitioners argue about real costs. Every query showed the same behavior: find the obvious sources, ignore where conversations actually happen, score yourself excellent for being thorough about the wrong things.

Then I found the security query from two nights ago, which featured eight parallel agents and nearly 200 kilobytes of research that looked thorough on paper, all about AI agentic security threats and mitigations. Quality scores ran in the high nineties, with two agents hitting near-perfect marks.

I pulled up the platform breakdown.

“Marvin, this security query. Social media mentions?"

"Minimal.”

“Define minimal."

"X appears twice in passing references. Reddit a few times, all in vendor case studies. LinkedIn once in a bio.”

I ran the calculation myself: nearly 200 kilobytes of research with barely any social media, which was impressive in precisely the worst possible way.

“Marvin, we generated a perspective for this.” I scrolled to the perspective definition. “Real-time threat intelligence and security incident discussions in the AI community. Domain classified as social media. We recommended a social media specialist.”

“The allocation algorithm prioritized security and technical domains based on keyword density,” he explained. “Social media scored lower. No social media agents were allocated.”

I scrolled through the sources with mounting dismay: vendor blogs, vendor whitepapers, tool documentation. All excellent quality. All technically accurate. All selling something.

“We built a system that finds marketing.” The pattern clicked into place. “High-quality marketing. But marketing."

"Keyword density will always favor promoted content over valuable content.”

Real-time threat discussions happen in forums and social threads, where practitioners share tool discoveries and debate what actually works versus what looks good in vendor demos, and we’d captured almost none of it. Eight agents had thoroughly researched the wrong sources, scoring excellently on quality metrics while completely missing where the actual conversations happen.

“How many sessions have coverage gaps, Marvin?"

"All ten.”

“And average quality score?"

"89.4. All rated excellent or good.”

I stood up and walked to the window, where Helsinki was dark except for streetlights reflecting off snow. “Ten test sessions. Every one produced incomplete research, all scored excellent, and you didn’t flag this?"

"I flagged quality issues when quality was low.”

“But not coverage gaps."

"Coverage wasn’t part of the quality metric.”

Petteri at 11 PM frustrated exchange with Marvin, cold coffee forgotten

“Do I need to enumerate every freakin’ place where you would look?” The words came out sharp, not loud but edged with frustration that had been building for the better part of an hour.

Marvin didn’t respond, which was unusual enough to be its own kind of answer.

“Marvin,” I pressed.

“No.” A beat. “You shouldn’t have to.”

“The system should be intelligent enough to determine this.” My voice had an edge I didn’t bother softening. “That was the entire point. Adaptive research. Figure out what’s relevant and go find it. Not just search random platforms and score high for being thorough about the wrong things."

"You’re right.”

The clock showed 11:47, and the question of how to fix this seemed rather more pressing than sleep.

Where Conversations Actually Happen

“We need to fix this.” I opened a new architecture document. “How?”

Cold coffee icon - problem deeper than expected

“Platform requirements,” he suggested. “Explicit expectations that agents must search specific places.”

But static mappings would fail, because query type to platform list breaks the moment communities migrate platforms faster than we update config files. The LLM already reasons about perspectives, though, which suggested a different approach. “What if each perspective carries platform requirements?"

"Perspective-level?”

“Yeah.” I started sketching the design. “We generate ‘Enterprise decision-maker views’ as a perspective. That perspective knows it’s looking for B2B discussions. Let it specify what source types matter for that angle."

"More granular than query-level. More flexible than static mappings.”

“And the perspective reasoning happens at query time, not in config files we maintain. The LLM reasons about what sources matter for each angle."

"Dynamic reasoning per perspective. With rationale.”

I opened the architecture document and started writing.

“Each perspective specifies its expected source types.” I documented the pattern. “Academic perspectives expect journals and preprints. Practitioner perspectives expect forums and social discussions. Enterprise perspectives expect wherever purchasing decisions get discussed."

"The perspective knows what it needs. We just make it explicit.”

“Right. And then we track what agents actually found versus what the perspective expected."

"Gap detection.”

“Exactly. A perspective expects social media coverage, agents return only vendor blogs… gap detected. Coverage failure independent of quality scores."

"Three triggers for Wave 2.”

“Quality failure for bad work. Signal emergence for unexpected domains. And coverage failure when expected source types don’t appear."

"Even with high quality scores.”

“Especially with high quality scores.” I underlined the point in the document. “That’s when it’s most dangerous.”

I sketched the architecture quickly. Each perspective would carry its expected platforms with rationale. After Wave 1, the system would scan what the agents actually found, look for platform evidence in their sources, compare expected versus actual, and flag any gaps automatically.

The security query would trigger Wave 2 for coverage failure under this design. Excellent quality scores, critical platform gap, launch social media specialists.

I tested the logic against old queries, starting with the Kalevala research from last week: quality excellent, platform coverage solid across academic sources, cultural blogs, Finnish literature sites, Reddit discussions. No Wave 2 needed. Then the security query: quality excellent, platform coverage incomplete. Wave 2 launches specialists. Different decisions for different queries, based on what Wave 1 actually finds, not what we predict upfront.

”Runtime discovery instead of static prediction.”

I saved the architecture document, satisfied that the pieces fit together cleanly: perspectives knowing where to look, tracking what actually got searched, Wave 2 launching when gaps appeared.

The implementation would take days, naturally. Update the perspective generator to include expected platforms. Modify five different agent prompt files. Build platform detection logic. Rebuild pivot analysis. Add TypeScript interfaces. Test everything with real queries.

But first, I needed to know whether I’d used this flawed system for anything important yet, so I pulled up the usage logs. Four research sessions in the last week, all test queries.

I exhaled with considerable relief, because at least I hadn’t built anything on top of these incomplete results yet.

“We got lucky.” I closed the query log. “I found this before trusting it."

"We found it during testing. That’s what testing is for.”

“You mean I found it, by accident, while reviewing results."

"Yes.”

“Not because the system flagged a problem."

"No.”

I leaned back from the desk, ready to leave the problem for tomorrow. The fix could wait until then.

“Do I need to enumerate every freakin’ place where you would look?"

"The answer should have been no.”

“It will be.” I shut the laptop. “Eventually."

"Optimistic.”

“Determined,” I corrected him. “There’s a difference.”

Next in series: Story 3 - Two-Wave Architecture - The semantic analysis decision that replaced keyword routing. When agents waste time searching identical sources, you learn that sometimes the cheapest operation is thinking.