That Tuesday afternoon I loaded a test query with reasonable optimism: “quantum computing breakthroughs and public perception.”

Story 3 of 13 in the The Adaptive Research System series (Discovery & Foundation).

The two-wave architecture was working, at least in theory, and platform coverage tracking caught gaps, but Wave 1 kept making spectacularly dumb decisions about which agents to launch.


The keyword analyzer saw “quantum computing” and decided that was sufficient information for decision-making, its pattern-matching circuits lighting up with recognition and confidence that bordered on the reckless. Academic domain, technical content, scientific research, very high confidence. It allocated five Perplexity agents to arXiv, IEEE, and technical publications without a second thought.

I watched them launch in the terminal window, five identical mandates marching off through five separate API calls to search the same academic territory like a synchronized swimming team with nobody conducting. The results populated with technical papers about coherence times, more about error correction, and someone’s extensive writing on qubit implementations, which amounted to an excellent academic literature review and absolutely nothing about public perception.

“It missed half the query.” I scrolled through the results, and I could feel the afternoon starting to go sideways.

“The keyword classifier identified quantum computing as academic domain.” The numbers populated on screen. “Classification confidence: ninety-four percent.”

“Very confident and completely wrong about half the query, which is worse than uncertain and correct.”

I pulled up the classifier logs, looking for what had failed, and the evidence was damning. The query explicitly said “and public perception,” three simple words, and public perception doesn’t live in academic journals. It lives in social media threads where someone’s explaining quantum supremacy to confused relatives, in mainstream news doing the annual “quantum computing will change everything” piece, in forum discussions where people are genuinely baffled about what makes a quantum computer different from a regular computer that’s just really cold. Five agents, all searching identical academic sources, none of them looking where public perception actually lived.

“Where does public perception live, Marvin?”

“Social media platforms, mainstream news coverage, public forums.” The answer was obvious in hindsight. “Not academic publications.”

“And our keyword analyzer allocated five agents to academic sources because the dominant keyword triggered academic domain classification.” I pulled up the allocation logic, working through the architecture. “The keyword analyzer doesn’t reason about implications, it just extracts features.”

“That is its design.” The logic was sound, which made the failure more frustrating. “Pattern matching is deterministic, fast, reliable.”

“And stupid.” I closed the log file. “Consistently wrong isn’t better than variably correct.”

I stared at the allocation logic file, terminal cursor blinking through page after page of pattern matching rules, domain lookup tables, and confidence thresholds. All of it deterministic, all of it fast, none of it capable of understanding that “scientific breakthroughs” and “public perception” required completely different research approaches.

“Do I need to enumerate every query structure manually?” I pulled up the pattern file. “‘If mentions public perception, check social media. If says mainstream, check news sites. If says technical, check arXiv.’”

“That would become an extensive conditional tree,” which was his diplomatic way of saying “don’t do that.”

Productive Stubbornness

Wednesday morning I had a solution that felt elegant: compound pattern detection. If the query contained “X and Y,” the system would allocate agents to both domains, and I pulled up the test interface feeling rather pleased with myself.

“Watch this.” I pulled up the test interface. “Query: ‘quantum computing breakthroughs and public perception.’”

The analyzer parsed the structure, identified two domains, and launched one agent for scientific research and one for public discussion, and for a brief moment I thought I’d solved it.

“It works.” The words had barely left my mouth before Marvin started dismantling my satisfaction.

”And when users phrase it as ‘quantum computing in public discourse’?”

I stared at the pattern-matching rules and admitted I would add that structure.

”’How public perception affects quantum computing research funding’?”

That too.

”’Public understanding of recent quantum computing developments’?”

“Are you going to list every possible variation?”

“I could.” The pause was loaded with implication. “But I suspect the users will do it for me.”

I closed the pattern-matching file, which had grown to forty-seven entries, none of which would survive contact with actual queries.

Petteri playing AI whack - whack-a-mole

“I’m building whack-a-mole architecture, aren’t I?” The answer was obvious enough.

”That would be my assessment.”

By Friday afternoon I’d circled back to the same uncomfortable answer three times, the one I’d been avoiding because it felt like admitting defeat: stop trying to be clever with deterministic pattern matching, stop pretending I could predict every query structure, and just ask something that could actually think about what the query meant.

“Fine.” I pushed back from the keyboard. “Just ask an LLM.”

Three Seconds of Surrender

”Adding latency to every query.”

“Three seconds, maybe four."

"The keyword analyzer routes in under one second.”

“And launches redundant agents that search identical sources while missing half the query."

"That was one instance.”

“Show me the last ten test runs.” The list populated with familiar failures. Python frameworks had sent five agents deep into technical documentation, completely ignoring the YouTube tutorials where people actually explain how async works. Cloud optimization grabbed every vendor blog but missed every forum thread where engineers argue about real costs. Frontend comparisons were drowning in GitHub stars while Discord communities spent the same afternoon debating which framework made them want to quit programming. The pattern was everywhere I looked.

“Should I continue reading?” I scrolled past another page of failures.

“I could enumerate further examples.” The dry tone was unmistakable. “Though I suspect you’d prefer I didn’t.”

“A few seconds of semantic analysis versus redundant agent calls.” I leaned back. “The math isn’t exactly subtle.”

“Semantic analysis introduces variability.” The concern was valid. “Run the same query twice, potentially different angle breakdowns.”

“True, so we validate it with a hybrid approach."

"How would that function?”

I started sketching the architecture, finally feeling like I was building something that might actually work. “Send the query to an LLM, quick API call, ask it what research angles matter for thorough coverage, and it returns angles with domain classifications. Then we run keyword validation on each angle as a sanity check."

"And if they disagree?”

“Ensemble review. Get a second opinion from a different model, three models, majority vote. But most of the time they’ll agree, and when they don’t it’s usually the LLM seeing nuance the keyword detector missed."

"You are proposing semantic understanding with keyword validation as safety check.”

“Exactly."

"Where does this fit in the architecture?”

I pulled up the diagram. “Front of the pipeline. Query comes in, semantic analysis generates six to eight research angles with domain classifications and source recommendations, keyword validation runs in parallel, agent allocation maps angles to specialists, then Wave 1 launches."

"Launches how many agents for six to eight angles?”

“Two to three, top priorities only."

"That seems optimistic.”

“That’s the point, the two-wave structure. Launch cheap, pivot if needed.”

I added more boxes to the diagram, the architecture taking shape. “First wave explores with minimal resources, we evaluate results, and if quality’s good and coverage is adequate, we’re done. If quality’s below sixty or we only addressed forty percent of the angles or we discovered something unexpected, then Wave 2 launches additional specialists."

"Deferred funding based on evidence.”

“Right, we’re not trying to be perfect upfront. We’re trying to be good enough that Wave 1 exposes what we got wrong."

"Self-correction through actual results rather than prediction.”

“Test it with an enterprise AI query.” I pulled up the test interface.

”Running analysis now.”

The semantic analyzer returned seven distinct angles: technical architecture patterns, vendor ecosystem and product landscape, ROI and business case development, deployment patterns and operational considerations, security and compliance requirements, user adoption challenges, and integration with existing enterprise tools. Each angle came with confidence scores, ranging from very high for technical architecture down to medium-high for deployment patterns.

“Seven angles identified.” I scanned the priorities. “We’re not funding seven agents up front, so we take the top three based on confidence scores: technical architecture, vendor ecosystem, deployment patterns. Those three run in Wave 1, and the other four become deferred candidates."

"And if Wave 1 results indicate security is more critical than the initial confidence suggested?”

“Then Wave 2 launches a security-focused specialist. The framework adapts based on what we actually found, not what we predicted we’d find.”

I started running cost comparisons. The old keyword routing launched agents blindly, where they couldn’t see what their peers were researching, couldn’t adapt to findings, and averaged higher costs. The new approach with semantic analysis and two waves gave agents clear mandates, let the system adapt through Wave 2 when needed, and averaged lower costs despite the analysis overhead.

“We’re spending a few seconds and a fraction of a cent on upfront analysis to prevent launching redundant agents."

"When precisely does Wave 2 trigger?”

“Three conditions.” I pulled up the metrics dashboard. “Quality below sixty, coverage under forty percent of identified angles, or emergent signals indicating we missed something that actually matters. And if none of those trigger, Wave 2 doesn’t run. Simple queries get analysis, two agents, done; complex queries get the full treatment.”

“Complexity-aware scaling.” The elegance was satisfying. “Automatic."

"The agents themselves receive what level of detail?”

“Full mandates.” I pulled up an example. “Not just ‘research AI frameworks,’ but more like ‘investigate the vendor ecosystem, specifically product differentiation, pricing models, and support structures; check vendor announcements, analyst coverage, product positioning.’ I ran tests with both vague and specific mandates, and vague mandates produced shallow results while specific mandates with source recommendations produced focused, relevant findings."

"The semantic analysis generates these mandates automatically?”

“For every angle it identifies.”

The prompt needed work, too vague at first, so I added constraints: generate six to eight distinct research angles with specific sources, minimal overlap, prioritized by information value. Testing showed it worked, producing six to seven angles per query, specific enough to guide agents and broad enough for specialist judgment.

“Failure mode question.” I paused. “What if semantic analysis is wrong, misses critical angles or over-prioritizes irrelevant ones?"

"Wave 1 would expose the errors.”

“Right. If we missed something important, synthesis notices the gap when first wave results arrive, and the gap gets flagged for Wave 2. If we over-prioritized something irrelevant, Wave 1 produces thin results and we see the signal, so we don’t waste Wave 2 resources pursuing dead ends."

"Testing assumptions against evidence rather than assuming correctness.”

I committed the architecture changes. Query analysis would use semantic understanding with keyword validation, generate six to eight angles with priorities, first wave explores top priorities, second wave adapts based on evidence.

The Proof

The first test query ran: “Rust adoption challenges in enterprise environments.”

Three seconds passed while the semantic analyzer returned seven angles: technical migration paths, hiring concerns, ecosystem maturity, C++ interop, compliance questions, training requirements, and ROI justification. Wave 1 launched three agents for technical migration, hiring, and ecosystem.

Petteri and Marvin examining stopwatch showing 3 seconds with cost comparison

Twenty-one seconds later: results covering all three angles, zero overlap, source diversity across Stack Overflow, CNCF blogs, and Hacker News discussions.

I leaned back. Three days building increasingly sophisticated pattern matchers, and the solution turned out to be spending three seconds thinking first. Sometimes the problem isn’t that you don’t know the answer; it’s that you’re too stubborn to use it.

“You’re being smug."

"I haven’t said anything.”

“You’re being smug in your silence."

"Three seconds to think before acting.”

“Revolutionary concept."

"If one were being generous.”

Next in series: Story 4 - Implementation Deep Dive - Marvin built a weighted scoring matrix with clean numbers and precise calculations. Claude +3 for technical, Grok +3 for X. It looked scientific. It was completely arbitrary. The day I learned that prompt engineering isn’t documentation, it’s production code.

The link has been copied!