Five stories laid the foundation. Citation validation caught hallucinations but missed vendor bias, and now came the moment of truth: running the complete system end-to-end with a real query. Does it work? Spoiler: yes, but with reservations, which is honestly the best outcome one can hope for with software built over six days by someone whose relationship with optimism borders on the pathological.

Story 6 of 13 in the The Adaptive Research System series (Refinement).


Six days of building, four refactors, and one very long argument about citation format had brought me to this moment, the point where I would discover whether I’d created a sophisticated research orchestration system or merely an elaborate mechanism for generating expensive nonsense. The query I was about to submit would expose everything that had been held together by optimism and duct tape.

“Agentic AI security 2025,” I typed into the command line, “including emergent threats and proven methods to secure against the threats.”

I’d spent the last six days making architectural decisions I’d either stand behind or regret, a cascading series of choices that had seemed reasonable in isolation but were about to be stress-tested in combination. Perspective-first routing instead of raw query analysis, two-wave exploration that let the system decide when to stop rather than blindly launching a fixed agent count, quality scoring that tried to measure something meaningful about agent output, citation validation to catch the inevitable hallucinations. Each layer stacked on the previous ones, and any of them could be the weak link that brought the whole thing down.

The query was the sort of request that would expose every weakness I’d papered over with optimism, spanning multiple domains and requiring both technical depth and social awareness, the kind of thing that would reveal whether my architectural choices were sound or whether I’d spent six days building an elaborate machine for setting money on fire. I reached for the coffee, knowing this was going to take a while, and hit enter.

The system came alive, query analysis kicking off first with complexity detection and domain classification before spinning up perspective generation, and within seconds eight distinct research angles emerged on screen. The allocation made a certain kind of sense once you saw it laid out: five academic-focused agents for papers and enterprise standards, two technical-framework agents, one social-media agent to catch threat discussions on platforms where practitioners actually talked to each other, and one video-content agent because tutorials and visual material still counted as research even if my instincts said otherwise.

“Not just ten identical agents grinding through the same searches.” I gestured at the screen where the perspective assignments were still populating.

”The perspective-first routing appears to be functioning as designed. Each agent receives a distinct research angle rather than a generic copy of your query.”

“We’ll see if they follow it.”

Wave 1 launched with eight agents in parallel, each receiving its perspective, its platform requirements, its output template. The session directory started filling with research files. Time passed in the way time passes when you’re watching progress bars and trying not to refresh the logs more than once per minute, which is to say slowly and with mounting anxiety. The autumn light shifted across the desk. Eventually, eight files sat complete, and the collection phase was done.

I pulled up the pivot analysis, half-expecting disaster, because six days of building anything is exactly enough time to develop confidence without developing competence.

Seven of eight agents had scored EXCELLENT, with the eighth managing a respectable GOOD, and between them they’d cited over three hundred sources spanning arXiv papers, social media discussions, vendor blogs, practitioner forums, and video channels. I stared at the numbers for a moment, not quite trusting them. Good news in software tends to be just bad news that hasn’t finished loading.

“That’s actually good.” I leaned forward, scrolling through the quality breakdown.

”All agents exceeded the minimum length threshold. Cited multiple sources per claim. Reported high confidence levels. Covered their assigned platforms.”

“So the Wave 2 decision is what, exactly?”

The pivot analysis appeared.

SKIP WAVE 2: Sufficient coverage across all expected platforms. Quality excellent. No critical gaps detected.

I’d built the system to be adaptive, to assess whether more research would add value or just burn money launching specialists to rehash what Wave 1 already found, and here it was, making that call on its own. The system was deciding Wave 2 wasn’t worth the cost, which meant it wasn’t just blindly launching agents because I’d given it permission.

“Though determining sufficiency from quality scores alone may prove…” Marvin paused, diplomatically. “Optimistic.”

He was right, of course. Quality scores measured how well agents executed their assigned perspectives. They didn’t measure whether those perspectives covered all relevant platforms. Or whether hundreds of citations would balloon the context until only dozens made it into synthesis. Or whether vendor blogs were crowding out independent sources. But I wouldn’t discover those gaps until later, much later, when the consequences had had time to compound themselves into proper problems.

“The alternative is launching Wave 2 every single time,” I said. “Six more agents, six more API bills, thirty more minutes of waiting. For what? To find a seventh article saying prompt injection is bad?"

"A fair point.”

Thirty Seconds to Die

The Gemini integration was supposed to be straightforward. We had Claude working, we had keyword analysis working, and adding a third model to the ensemble seemed like the sort of enhancement that takes an afternoon. Then nothing happened, which is always more alarming than something going wrong because at least errors give you something to grep for.

No error messages appeared. No progress updates materialized. Just silence where Gemini results should have appeared, and “(No content)” messages where perspective analysis should have been. I gave it two minutes out of professional courtesy, then checked the logs.

”The Gemini calls appear to be timing out. Rate limiting, most likely. The API has aggressive quotas. I’ll add exponential backoff.”

“Do that.”

He did, implementing five seconds between retries, then ten, then twenty, standard best practices with textbook implementation, the sort of approach that works reliably for transient failures. The calls kept timing out anyway. I checked the clock. Checked the logs again. Still timing out. Added more aggressive backoff, longer delays, jitter to spread the requests. Nothing helped. The Gemini perspective generation would start, run for exactly thirty seconds, and then silently die without producing any output, like a conversation partner who simply stops mid-sentence and stares vacantly at the wall.

Three-minute timer showing Marvin's failed prediction

“It’s not rate limiting.” I pulled up the subprocess config. “Rate limiting returns errors. This just… stops."

"The requests are being sent correctly. The API is receiving them. Something is terminating the connection before the response arrives.”

I stared at the code. We’d built the Gemini integration using subprocess calls, spawning the CLI tool for each request, which provided clean separation and easy testing, the same pattern we used for a dozen other tools. And there it was, buried in the spawn configuration: a thirty-second timeout.

“Marvin. How long does Gemini typically take to respond to these prompts?”

A pause followed, the sort of pause that happens when confidence meets data and loses.

”The Gemini API processes complex prompts in approximately fifty seconds.”

“Fifty seconds."

"Yes.”

“And our subprocess timeout is thirty seconds."

"…Yes.”

I’d built the system to spawn CLI tools with what seemed like a generous timeout, and then pointed it at an API that legitimately needed nearly twice that long to respond. Every single Gemini call was being killed mid-flight, twenty seconds before it would have succeeded.

“So we’re not hitting rate limits at all."

"No. We are murdering valid requests and blaming the vendor for the bodies.”

The fix wasn’t adding more backoff or spreading requests across longer windows, because the problem wasn’t transient failures or aggressive quotas. The fix was architectural. I spent the afternoon refactoring the Gemini integration from subprocess spawning to direct module imports, removing the artificial timeout entirely and letting the API take as long as it needed.

The next test run showed Gemini completing in fifty-three seconds, not fast by any measure but functional, and the “(No content)” messages disappeared. The ensemble analyzer started producing three-model consensus instead of two-model fallback.

“I spent two days debugging rate limiting,” I closed the debug window, “when the problem was a number I typed some time ago."

"The thirty-second timeout seemed reasonable at the time.”

“It was reasonable. For tools that respond in thirty seconds."

"A constraint that perhaps should have been validated against reality before becoming load-bearing infrastructure.”

I chose not to dignify that with a response, largely because any response would have involved admitting he was right.

With Reservations

Eventually, when the synthesis finally completed, the final report appeared in the session directory. I opened it with the cautious optimism of someone who’d been building systems long enough to expect disappointment. It opened with a table of contents that linked to sections, which felt like a small miracle, followed by an executive summary and a methodology section that documented which agents had researched which perspectives. Eight distinct findings sections followed, one per perspective, and then an integrated analysis that pulled cross-cutting themes together before ending with emergent research directions and over 150 unique citations in proper IEEE format.

“This is good.” I scrolled to the findings. “Actually good.”

The findings were solid. Prompt injection emerged as the primary threat vector against agentic systems, with attack patterns documented across major frameworks. Novel exploitation techniques got catalogued with timelines and technical details. Self-replicating threats, data exfiltration vectors, the whole taxonomy of ways AI agents could be compromised or weaponized. Sackful of tools listed with descriptions, links, and categorical organization.

I scrolled through it all, looking for the places where the seams showed, where the synthesis would reveal that the agents hadn’t actually read the sources, or the citations didn’t support the claims, or the whole thing was just sophisticated-looking nonsense. The seams were there, but they were livable, which is about as much as one can ask of a system built over six days by someone who still hadn’t figured out how to make the coffee machine produce anything stronger than medium roast.

“The original system would have launched ten identical agents.” I pulled up the side-by-side comparison. “Overlapping information everywhere. No quality assessment. Hours of manual work to synthesize anything useful from the pile."

"This system allocated eight agents to distinct perspectives. Scored their quality. Validated their citations. Synthesized with source attribution. Generated a structured report following academic conventions.”

“So it works."

"With reservations.”
Checklist showing successes and warnings, empty coffee pot

I pulled up my notes and started cataloguing the rough edges. API timeouts remained unpredictable even after the Gemini refactor, vendor bias still skewed results toward whoever had the best SEO budget, citation utilization showed that hundreds of gathered sources became dozens in final synthesis, the command file had ballooned past 100 kilobytes and needed splitting, and platform coverage felt fragile enough that an agent could claim to have searched professional networks without actually doing the work. Five things to fix, which was honestly fewer than I’d expected.

I looked at the synthesis report again. It answered the original question comprehensively. The methodology was documented clearly enough that someone else could evaluate the research process. Sources were validated and accessible. The emergent directions section pointed toward six logical follow-up queries with specific agent recommendations. For a first serious end-to-end test with a complex multi-domain query, this was better than I’d expected.

”Low expectations?”

“Realistic ones. The old system gave me ten walls of text to manually parse and synthesize. This gave me a structured report I can actually use. Quality scores that meant something. Citations I can trust. A methodology I can defend."

"Then the question isn’t whether it works perfectly.”

“It’s whether the flaws are livable compared to the alternative.”

“Good enough.” I closed my notebook, which was the real lesson, wasn’t it? Perfect doesn’t exist. The question is whether the flaws are livable compared to the alternative, and whether you can fix them without breaking what already works. I saved the synthesis report and opened the session directory to look at the files: eight research outputs, one pivot analysis, one synthesis report, five things to fix tomorrow.

”A successful first test, then.”

“With reservations.” I pushed back from the desk.

”Those tend to be the most instructive kind.”

Tonight, the system worked. Tomorrow, we’d discover what it had missed.


Next in series: Story 7 - Citation Utilization - Discovering how 158 gathered citations became just 30 in the final synthesis, revealing the real problem hiding in the system’s successful test run.

The link has been copied!