Selective Ensemble If Uncertain

The refactoring was complete, the system worked cleanly, and then Marvin questioned the ensemble routing I’d been rather proud of, which is when the role reversal began.

Story 9 of 13 in the The Adaptive Research System series (Making It Actually Work).

The test results came back just past midnight, which is when most of my tests seem to come back, as though the universe has opinions about reasonable working hours that it enjoys subverting. Six perspectives analyzed, three API calls per perspective, half a minute elapsed. Every single classification came back with all three models in agreement, which sounds like a good thing until you think about it for more than thirty seconds. Perspective one, API architecture patterns: the primary analyzer said technical, the ensemble validator agreed, keywords detected implementation and framework and confirmed technical, confidence high. Perspective two told a similar story, all three validators in agreement with high confidence. Perspective three, the same thing, which is when the pattern should have started bothering me but somehow didn’t.

“Three API calls per perspective,” Marvin noted, in that perfectly neutral tone that somehow conveyed volumes without technically conveying anything. “With the ensemble model’s rate limiting, total runtime half a minute or more for straightforward queries.”

“The accuracy is excellent though.” I kept my eyes on the screen, hearing the defensiveness in my voice and failing to do anything about it. “Near-perfect match with what we’d want."

"Indeed.”

The silence that followed was the sort that makes you reconsider your life choices, or at least your architectural decisions, which in software development often amount to the same thing. I’d spent the better part of a week building this ensemble validator, three models voting on every perspective classification, belt and suspenders and possibly also a safety harness for good measure. And now Marvin was about to question the entire architecture, because that’s what Marvin does, and I could feel it coming the way you can feel rain coming even when the sky looks clear.

“I’ve been reviewing the results where all three models agreed,” he said, with the diplomatic care of an AI that had learned exactly how much criticism a human can absorb without becoming defensive and unproductive. “There’s an interesting pattern worth noting.”

I kept my eyes on the terminal, which was displaying exactly the results I’d wanted to see and somehow still managing to mock me. “Go on.” I braced for impact.

”When all three models agree with high confidence, what exactly did we gain by asking all three?”

The Obvious Question

“Validation.” The word came out a bit too quickly, the way you answer a question you’ve already prepared a defense for. “We built the ensemble because single-model classification was unreliable. You remember the misrouting disasters? Technical queries going to social researchers? That’s what we were solving."

"Indeed, and the ensemble solved it beautifully. Near-perfect accuracy.”

“Exactly, so…” I trailed off, sensing the trap but unable to see its exact shape.

”The question is whether we need that validation on every perspective. If the primary analyzer and keywords align at high confidence, the ensemble adds cost without adding information.”

Three identical classification cards showing redundancy

I looked at the output again with fresh eyes, which is to say eyes that were beginning to see what Marvin had been politely gesturing toward. Perspective one: primary analyzer highly confident, keywords match, ensemble validator agrees. If the first model was that sure, and keywords independently confirmed it, what did the ensemble vote actually add? The answer, I realized with the uncomfortable clarity that comes from having the obvious pointed out to you, was significant latency and redundant API calls, and not much else.

“You want to skip the ensemble when we’re confident.” The defensive edge in my voice persisted even as I tried to smooth it away like a wrinkle in a shirt that refuses to stay pressed.

”I’m suggesting we skip it when there’s no disagreement to detect.”

There was nothing to do but stall for time while my pride caught up with my intellect, so I stared at the terminal some more, as though the numbers might rearrange themselves into a compelling counterargument. Because here’s the thing: he was right, of course he was right, and I’d spent a week missing the obvious while building something that worked perfectly at solving a problem I’d made larger than it needed to be.

“I spent a week building full ensemble validation.” I pushed back from the desk, conceding the point without saying so explicitly.

”And it works beautifully. The question is whether we need it everywhere, or only where there’s actual ambiguity.”

I wanted to argue, wanted to defend the architecture and the careful design decisions and the belt-and-suspenders approach that had finally given us reliable routing. But the terminal output was staring at me with the implacable patience of data that knows it’s right: three API calls per perspective for six votes that all agreed, three API calls to get three identical answers.

“What about when the primary analyzer is wrong?” I tried, though the question sounded weak even as I asked it.

”Then keywords should catch it. Different detection methods, independent signals.”

“And if keywords are also wrong?"

"Such as?”

I thought for a moment, reaching for the edge cases that justify overengineering, and found one that actually held together. “The quantum computing scenario. Research query about quantum computing in finance. Keywords detect computing and algorithm, flag it as technical. The primary analyzer sees quantum and finance, classifies as business domain. Both make sense. Both could miss that this is actually academic research."

"Precisely the case where we need the ensemble.”

“But not the case where everything agrees.” I finally arrived at the conclusion Marvin had been leading me toward with all the subtlety of a GPS directing you toward a destination you should have recognized from the start.

”Not the case where everything agrees.”

“How confident is confident enough?” Surrender dressed up as engineering discussion, but at least it was productive surrender.

Productive Surrender

“The question is what threshold constitutes sufficient confidence to skip the additional validation,” Marvin said. “I would suggest a conservative threshold.”

I thought about the test cases, the perspectives that scored slightly below the very high range. Those weren’t low confidence classifications, they were solid classifications that happened to have minor ambiguities, like a sentence that’s grammatically correct but could mean two different things depending on context. “Let me try something in the moderately high range.” I started adjusting parameters.

The test ran, and the results looked promising, so I tweaked the threshold slightly and ran it again. Better, but not quite right, so I adjusted once more, watching the numbers shift with each iteration like a dial being tuned toward a frequency I could almost hear.

Another test, and this one felt right, somewhere in the moderately high range where confidence meets pragmatism. The final number came back: accuracy significantly better than earlier attempts, close enough to full ensemble that the difference barely mattered for most use cases.

I stared at the results for a long moment, waiting for the catch, because there’s always a catch. “Better accuracy.” The caution of someone who’s been fooled by promising numbers before kept me from celebrating.

”With selective ensemble activation. Most perspectives fast path, ambiguous cases escalate automatically.”

“Still slightly lower than full ensemble. Is that acceptable?"

"For substantial latency reduction? For most queries, yes.”

The results were clean enough that I started to believe them. Clear queries routed in three seconds flat: single analyzer call, keyword validation confirms, done. Ambiguous queries escalated automatically without manual intervention. The system adapted based on confidence, which is what systems should do but rarely actually manage.

Technical documentation query: one call, all perspectives matched, three seconds. Multi-domain business research: one call plus three perspectives needed ensemble, eight total calls, twelve seconds. The completely ambiguous “what’s happening with AI” query activated ensemble for all six perspectives, requiring similar time to full ensemble mode, which was exactly right, because ambiguous queries deserved thorough analysis.

”It adapts based on query clarity.”

“Clear queries get fast routing, unclear queries automatically escalate, no flag needed, no human judgment call.” The architecture suddenly made the kind of sense that good architecture should make: obvious in retrospect, invisible in operation.

The real test came when we ran a query about AI agent frameworks, the sort of query that sounds simple until you try to classify it. Six perspectives generated, confidence scores ranging from moderately high to very high, four perspectives with perfect analyzer-keyword agreement, two flagging potential issues.

The social media perspective scored just above the threshold, keywords matched though, which meant it could go either way depending on how paranoid you felt about edge cases. The academic perspective showed a mismatch: the primary analyzer said technical based on the implementation focus, while keywords detected research, paper, study and suggested academic domain.

“The academic one definitely needs ensemble.” I pointed at the mismatch on my screen.

”Agreed. The social media perspective is less clear. Barely above threshold, but keywords match.”

“I’d accept it, borderline confidence but not contradictory.”

Marvin activated ensemble for just the academic perspective, which felt efficient in a way that the previous architecture had never quite managed. The ensemble validator came back with academic classification, siding with keywords over the primary analyzer, majority vote for academic, final result: route to perplexity-researcher, academic domain, medium confidence.

”Without the ensemble, that would have gone to the technical researcher.”

“And gotten perfectly good research, probably, but not the academic paper focus we actually wanted.” I paused, running the numbers in my head. “While saving us fourteen API calls on the other five perspectives.”

I added a --thorough flag for cases where users wanted full ensemble regardless, because sometimes you want the complete analysis even when selective routing would be sufficient. Added logging to track ensemble activation rates and cost savings, the sort of instrumentation you want when debugging at ungodly time, though with any luck this system wouldn’t require as many such sessions as its predecessors.

Option B went it was, then.

Later the metrics came back, and I scanned them with the nervous attention of a student checking exam results. API calls averaged significantly lower per query, down substantially from the full-ensemble approach. Most perspectives went through the fast path without escalation, with a minority needing ensemble treatment. Routing accuracy: high match with what full ensemble would have chosen. Latency dropped substantially, which was the entire point but still satisfying to see confirmed by actual numbers.

I stared at the screen, looking for the failure mode I must have missed, because there’s always a failure mode. “What about the gap?"

"The accuracy difference?”

“The perspectives we’re routing wrong now that we weren’t routing wrong before. What if those are the perspectives that matter for my research? What if we’re trading accuracy on the perspectives that actually matter for speed on the ones that don’t?"

"The delta includes both false positives and false negatives. We’re fast-tracking some perspectives that full ensemble would have escalated, and escalating others it would have accepted. The net routing accuracy remains high.”

“But we don’t know if the misroutes are evenly distributed or clustered in important domains."

"We can analyze the failure modes.”

“Or I just got lucky with this test set. What about next week’s queries? What if they expose edge cases where selective ensemble falls apart?"

"The --thorough flag exists for exactly that scenario. When you need maximum accuracy, you can request full ensemble.”

“Will I know I need it before I see the results?”

Marvin’s silence was answer enough, which is to say it was no answer at all, and no answer was the honest answer. You never know you needed the thorough option until after you’ve discovered what the fast option missed, and by then you’re already debugging.

The system ran its validation sweep, and I watched the metrics update with the attention of someone who’s learned not to trust good news immediately. Substantially reduced latency, much faster than before, accuracy holding steady.

“So it works.”

“With reservations.” The quality score ticked down three points. “But substantially faster reservations.”

Tomorrow we’d discover what it missed.

Next in series: Story 10 - Platform Coverage Enforcement - Building mechanisms for humans to correct the system when it misclassifies research intent, creating adaptive feedback channels that improve classification over time.