The two-wave architecture was designed, platform coverage tracking built, quality scoring validated, and now came the implementation, which is to say the unsexy work of making components actually talk to each other without immediately revealing why the diagrams were optimistic.
Story 4 of 13 in the The Adaptive Research System series (Implementation & Validation).
I’d been working on implementation for several days, the architecture diagram pinned to the wall looking elegant in its simplicity the way architecture diagrams tend to look when nobody has tried to run them yet. Query analysis generates angles, angles route to specialists, Wave 1 explores, Wave 2 adapts, clean boxes, clear arrows, the kind of diagram that makes you think the hard part is over when it hasn’t properly begun.
The Matrix of Made-Up Numbers
Then Marvin showed me his routing logic, and I realized we had a problem that required more than optimism to solve.
“Simple weighted scoring system,” Marvin announced with what I can only describe as pride, the sort of pride one takes in building something that looks impressively scientific. “Each agent gets points for domain match, platform capability, source type. Highest score wins the angle.”
Simple, right, except that simple is what people call things before they’ve looked at them properly. I pulled up the matrix. Claude: technical +3, academic +2, reasoning +3. Grok: social +3, real-time +2, X +3. Perplexity: research +3, citations +2, current events +2. Neat columns, clean numbers, the kind of precision that screams “I am a rigorous engineer who makes data-driven decisions,” and yet the more I looked at it the more it seemed like an elaborate costume worn by something rather less scientific underneath. Completely arbitrary, in other words.
“Why is Claude +3 for technical but only +2 for academic?"
“That’s a rationalization, not an explanation.” The foundation I’d trusted turned out to be made of good intentions rather than concrete. “You don’t have measurements. You made up numbers that looked scientific.”
I tested it, because I had to know how deep this rabbit hole went, and testing things before trusting them is one of those habits that either saves you or makes you miserable, depending on what the test reveals.
Research angle: “Analysis of X API changes and developer sentiment.”
The weighted scoring churned through its calculations with the contentment of a system that doesn’t know it’s about to embarrass itself. Technical content: +3 to the technical specialist. X platform: +3 to the social media agent. Developer sentiment: +2 social. Current events: +2 research specialist. Final scores: Technical 3, Social 5, Research 2, Multimodal 0. The social media specialist wins, which made sense since it was an X-focused research angle and the X specialist ought to win. The system appeared to work, though “appeared” is doing a lot of heavy lifting in that sentence.
I tried another, feeling the cautious optimism of someone who suspects they’re about to be disappointed. “Technical analysis of new cryptographic protocols discussed on the platform.” Technical: +3 technical specialist. Platform: +3 social. Academic: +2 technical, +2 research. Scores: Technical 5, Social 3, Research 2. The technical specialist wins this time, technical depth trumping platform specificity, the weighted scores falling exactly where we expected them to fall.
Then I tried a harder one, the kind of query that exists in the real world but doesn’t fit neatly into categories, because the real world rarely does. “The platform’s new API pricing model and its business implications for enterprise developers.”
Technical, social platform, business analysis, enterprise context, developer audience: five dimensions, all relevant, all important, all clamoring for attention like children who’ve been promised ice cream and are now discovering there’s only enough for three of them. The scoring system churned. Scores: Technical 4, Social 4, Multimodal 2, Research 2. A tie.
I stared at the screen, watching two agents with identical scores, the system that was supposed to make routing decisions having just thrown up its hands and said “I don’t know,” which is not, generally speaking, what you want your decision-making systems to say.
“Your arbitrary +3 weights can’t handle multi-dimensional research angles.”
The silence that followed had the quality of someone who has just been caught in the act of making things up and is searching for a dignified exit.
”What if we hard-code the routing rules instead?”“Marvin. No."
"If/then rules. Simple. Maintainable.”“You just proposed arbitrary +3 weights.” I leaned back in the chair. “I shut that down. Now you want hard-coded IF statements? That’s the same problem dressed up differently."
"Let me draft a rule set. We can evaluate its coverage.”Fine. Sometimes you have to write the bad code to see why it’s bad, the same way you sometimes have to order the experimental dish at a restaurant to confirm your suspicion that fermented something-or-other was not what you wanted for dinner.
I waited while Marvin drafted the rules, which is a peculiar experience because there’s no visible process, no furrowed brow or tapping pen, just a delay while tokens generate somewhere in the computational distance. I scrolled through the routing rules he’d drafted, and the first few looked straightforward enough. Social media terms routed to the platform specialist. Academic keywords hit the research agent. Technical content to the code specialist. Clean, logical, the kind of rules that make you think you’ve solved the problem.
But then it kept going. And going. And going some more.
”How many rules are there now?”“Thirty-seven.” I’d been afraid to count. “Last week there were twelve.”
Each rule had spawned edge cases, because that’s what rules do when they meet the real world. “What if academic AND technical?” “What if social media but NOT X?” Each edge case spawned sub-rules, priority orderings, fallback logic, and then fallback logic for the fallback logic, which is the point at which you start to suspect that something has gone fundamentally wrong with your approach. This wasn’t rapid co-creation at a whiteboard but rather me typing questions, Marvin generating increasingly complex rule trees, me reading them and feeling the complexity multiply with each iteration until the whole thing had taken on a life of its own.
“Stop."
"Incomplete rule set. Need to handle…”“This is getting Ptolemaic."
"Ptolemaic?”“Epicycles upon epicycles.” I waved at the screen, trying to encompass the disaster. “We’re on our fourth layer of complexity. Like astronomy before someone suggested planets might not orbit Earth."
"The suggestion being?”“We’re solving the wrong problem.”
Let the LLM Decide
”And the LLM is…?”“Already doing the heavy lifting for query analysis."
"Meaning we could extend it.”“Exactly. Let an LLM make the judgment.” I was warming to the idea now, the way you warm to ideas that have the virtue of being obviously correct once you’ve said them out loud. “The analyzer already generates the research angles. It has full query context, understands the research domains, knows the platform requirements. It’s better positioned to make routing decisions than any formula we’ll write.”
Marvin was catching on, which is one of his better qualities.
“Just extend the query analysis prompt,” I continued, feeling the enthusiasm of someone who has just realized the hard problem they were fighting didn’t need to be fought at all. “Generate research angles AND recommend which agent should handle each one. The routing logic becomes trivial, just read the recommendation.”
“Non-deterministic,” Marvin pointed out, because someone has to. “Run the same query twice, might get different routing.”
“Is that actually a problem?"
"Different routing, different results.”“If both routings are reasonable?”
There was a pause, the kind of pause that happens when someone realizes their objection has been neatly sidestepped, and then: “I see.”
I extended the query analysis prompt to include agent recommendations with rationale, and tested it on the same queries that had broken the weighted scoring.
“Platform API pricing and business implications for enterprise developers.”
The query analysis came back with four distinct angles, each carrying its own agent recommendation and rationale, which was already more sophisticated than anything our matrix of made-up numbers had produced. Technical architecture went to the technical specialist, with the rationale noting API design decisions that would require deep technical analysis. Platform ecosystem went to the social media specialist, where real-time platform context and developer community sentiment were the deciding factors. Enterprise adoption got routed to the multimodal analyst for visual analysis of pricing tiers and competitive landscape. Citation-backed analysis landed with the research specialist, who could pull academic research on API economics and platform business models. Four angles, four different agents, four reasoned judgments about what each perspective actually needed.
No arbitrary +3 weights, no nested IF statements, just reasoned judgment calls based on what each angle actually needed.
“Still need quality scoring,” Marvin said, bringing us back to earth with the efficiency of someone whose job it is to notice the problems you were hoping to ignore. “To measure whether assignments work.”
Right. The other implementation devil.
Quality scoring looked simple in theory, which should have been my first warning. Evaluate agent output, assign a score 0-100, use it to inform pivot decisions. Wave 1 produces research, we score the quality, low scores trigger Wave 2 for deeper investigation. Clean, logical, and almost certainly hiding complexity the way a still pond hides depth.
In practice, what were we actually measuring?
”Length of output?”“Unless the agent is just verbose.” I’d seen that problem before, agents that produced impressive word counts while saying very little, like certain academic papers I could name but won’t.
”Number of citations, then.”“Unless they’re all the same two sources repeated.” I pulled up the previous session’s notes. “We saw that exact failure mode with the academic agent.”
There was a long pause. “Domain signal strength?”
I didn’t let him finish. “Which punishes agents that synthesize across domains. The best research often spans multiple fields.”
Every metric had a failure mode, which is the sort of observation that makes you question whether you’re actually measuring what you think you’re measuring. Every simple measurement could be gamed, not intentionally, but by agents optimizing for the wrong thing because that’s what we told them mattered.
”Ask an LLM to score it.”I turned around, because that warranted turning around. “Ask an LLM to grade other LLMs’ work?"
"High-capability model as evaluator. Give it the research angle, agent output, scoring rubric. It returns structured quality assessment.”It felt recursive, philosophically weird, using AI to judge AI like some sort of computational ouroboros. But the logic was sound: if we wanted quality assessment that understood relevance, depth, and novelty, an LLM was better equipped than any formula I’d cobble together.
Six Dimensions, All Subjective
”How would we measure quality?”“Relevance, obviously.” I started a list, because lists make things feel manageable even when they’re not. “Coverage of the query."
"Depth of analysis.”“Right. And citations.” I kept typing, watching the list grow. “Novelty of sources. Coherence of the output."
"Six dimensions.”“Six dimensions.” I looked at the list, which looked back at me with the innocent confidence of something that hasn’t been tested yet. “Each scored zero to one hundred, then averaged."
"Reasonable.”I implemented it, created a detailed rubric for each dimension, and tested it on past agent reports, outputs I knew were good, bad, or mediocre from having read them myself. It consistently identified weak outputs: thin analysis, poor source diversity, agents wandering off-topic. Strong outputs got recognized too: thorough coverage, novel insights, synthesis that actually made sense. Then I ran it on a borderline case, an agent report that had found decent sources, written clear summaries, followed basic research methodology, and the quality scorer gave it 72. But the agent had completely missed the core research question, answering a related-but-different question competently.
“Should have scored lower.” The 72 glowed on the screen like a personal affront.
”Rubric needs refinement.”I adjusted the relevance weighting, emphasized that direct answer to research angle mattered more than general quality, and ran it again. Score dropped to 58, which was better, still not perfect since quality assessment never is, but closer to what human judgment would produce.
“Quality scoring is harder than routing.”
“Because quality is subjective,” Marvin said, demonstrating the kind of insight that’s both obvious and genuinely useful. “Routing is ‘which agent fits this angle.’ Quality is ‘did this agent do good work.’ One has right answers. The other has judgment calls all the way down.”
Edge Cases and Temperamental Oracles
Running the suite revealed edge cases immediately, because that’s what running suites does.
The query analysis prompt struggled with ambiguous queries. “Research AI” without additional context generated vague angles that were technically correct but practically useless.
”Could you provide examples of what you’re looking for?”“I could. Or…” I pulled up the query analysis implementation, feeling the familiar tug of wanting to solve problems rather than dodge them. “What if we just ask the LLM to be more opinionated when the query is vague?”
I refined the prompt, added examples of clarifying questions, added heuristics for detecting insufficient context. The results improved.
Then the multimodal agent started hitting rate limits, because of course it did, and intermittent failures are the worst kind of bug because they make you question your own sanity. Query analysis would work fine on test one, fail on test three, work on test four, like some sort of temperamental oracle that only answered questions when it felt like it.
“The usual fix,” I told Marvin, because this particular problem at least had a usual fix. “Wait a bit, wait longer, add randomness so everything doesn’t hit at once."
"Exponential backoff with jitter.”“That’s what the algorithm is called, yes.”
I tried it, and it still hit limits during parallel research. Tried adjusting the thresholds, which made it worse, and spent an hour tweaking before realizing the real problem wasn’t the rate limiting algorithm at all. It was the number of concurrent requests.
“We’re burning through API budget faster than I’d estimated anyway.” I pulled up the API usage projections, which made for grim reading. Query analysis: one lightweight model call. Per-agent research: multiple calls across different models. Per-agent quality scoring: one expensive model call per agent. Synthesis: one expensive model call.
“Combine operations.” I pulled up the prompt implementation.
”How?”“Query analysis is already doing heavy lifting. Extend it further. One unified call with research angles, domain classifications, source recommendations, agent routing, priority scores. Everything up-front."
"Single prompt?”“Single structured output. Then Wave 1 is pure agent work with no coordination overhead. Then post-Wave-1 synthesis combines result aggregation, quality assessment, and pivot decisions in one high-capability call.”
We refactored, and it took most of an afternoon, the prompts growing complex in the way that prompts do when you ask them to do real work. Multi-hundred-line instructions with detailed examples, edge cases, rubrics, the kind of thing that makes you wonder when prompt engineering stopped being a curiosity and started being actual engineering. But the architecture simplified.
Total LLM overhead: two calls regardless of how many agents participated.
“More efficient,” Marvin said.
I pulled up the budget breakdown, which told a satisfying story. The lightweight analyzer was negligible cost, agent calls dominated the budget, the expensive per-agent quality scoring had been eating a substantial portion, synthesis was moderate. The consolidation cut costs meaningfully, and while it didn’t sound like much per query, multiplied by thousands of queries it became real money.
Prompts Are Production Code
”The prompts are very complex now.”They were, and there was no denying it. The query analysis prompt ran hundreds of lines with detailed instructions, example outputs, edge case handling, output format specification, quality criteria.
“Prompt engineering is production code now."
"Odd thought.”I stared at the prompt file, wondering when it had become two hundred lines and when it had started needing refactoring, which are the questions you ask when something has crept up on you while you weren’t looking.
”Perhaps prompts need version control.”“Perhaps prompts need everything code needs: testing, review, documentation, the works."
"Start using the first iteration.”“Accept that the next iteration will be better."
"A philosophy, or an excuse?”“Both.”
Tomorrow we’d run it and discover what we’d missed.
Next in series: Story 5 - Citation Validation - Building validation to catch hallucinated sources and ensure routing to the right places produces genuine research, not expensive theater.