When Abysmal Citation Utilization Reveals Context Overflow

The first real test worked, the quality scores came back looking rather pleased with themselves, and the synthesis read like something a person might actually want to finish, which was more than could be said for most research documents I had encountered in my career. Then I counted how many of those carefully gathered citations actually made it into the final report and discovered a different kind of waste entirely.

Story 7 of 13 in the The Adaptive Research System series (Refinement).

The system was working, but not efficiently, and the gap between those two states turned out to be more interesting than I had anticipated. Citation validation caught hallucinations with admirable thoroughness, the sort of thoroughness that makes you feel like progress is being made, but what it couldn’t catch was the fact that we were gathering far more sources than synthesis could ever hope to use.

I was writing the blog post about the adaptive research system when I noticed something odd about the Cloud Security synthesis, an oddity that starts as a minor itch and gradually transforms into the kind of realization that makes you stare at the screen with the expression of someone who has just discovered termites in the foundation. The system had just finished a full research run with eight agents doing their usual work, and the quality scores looked good, and the synthesis read well with its proper academic tone and clear arguments and all the structural elements we had designed the system to produce.

I had been reviewing synthesis outputs all week, running through the usual checks of structure and citations and whether the arguments held together without collapsing into hand-wavy assertions that plague most automated writing. This particular output seemed perfectly normal at first glance, but something in the metadata caught my eye in the way that small discrepancies sometimes do.

“Marvin, this says we gathered over a hundred and fifty unique citations from eight research perspectives."

"Correct. The research phase completed successfully. Eight agents across three quality tracks, all sources identified and validated.”

I pulled up the synthesis file, scrolled to the references section with the casual confidence of someone who expects the numbers to match, and started counting manually because maybe the metadata was wrong, because metadata is sometimes wrong, because I needed to believe there was some simple explanation for what I was starting to suspect. Ten citations, then twenty, keep going, thirty, and that was it, that was all of them, and I scrolled through the rest of the document and found no more references lurking in unexpected places.

Counted again using a different method this time, searching for the citation brackets and watching the numbers climb to roughly thirty and stop. Over a hundred and fifty citations gathered, barely thirty used, and I had been staring at the numbers for five minutes before they really clicked into focus with the uncomfortable clarity of a medical diagnosis.

“Marvin. MARVIN."

"Yes?”

“We gathered over a hundred and fifty citations."

"Correct. The research phase completed successfully…”

“The synthesis uses BARELY THIRTY.”

The pause was longer than usual, the sort of pause that fills a room even when the other party is, strictly speaking, digital.

I kept scrolling and re-counted using yet another method, a manual search through the document for every bracketed number, and still the total came to roughly thirty, less than one in five, a ratio that suggests either catastrophic inefficiency or something fundamentally wrong with your assumptions. I opened the research files and found eight agents worth of carefully documented sources: academic papers with DOIs, industry reports with publication dates, security frameworks with version numbers, URLs and titles and authors all carefully extracted and formatted by the citation pooling system we had built with such optimism.

All that validation work, all those quality checks, all that infrastructure for catching hallucinations and verifying sources, and we threw away most of it, which felt rather like building an elaborate security system for a house and then leaving the back door wide open.

One Hundred Twenty Sources, Nowhere to Go

“The synthesis covers the main arguments thoroughly.” The reassurance came wrapped in diplomatic packaging. “The other sources were likely contextual research that informed the writing without requiring explicit citation.”

The explanation sounded reasonable, because contextual research is a real thing and not every source you read ends up cited in the final work, and I pulled up the raw research files anyway because reasonable explanations have a way of dissolving under scrutiny. Eight agent output files, all that meticulous documentation, all those validated sources sitting in files nobody would ever read again, and I kept scrolling because cold logic demanded verification.

Then I looked at the synthesis again with fresh eyes and a growing suspicion. Six sections of well-structured arguments with proper academic writing, but barely thirty references for what was supposed to be a thorough survey of cloud security trends, and either eight research agents had spent most of their time gathering irrelevant sources (which seemed unlikely given how we had designed them) or something else was happening that I had failed to notice.

I opened the synthesis and started reading for citation density, not content quality but citation frequency, reading that treats a document like a specimen, not a story. The first section was beautiful, genuinely beautiful in the way that well-cited academic writing can be, with every claim backed by a citation and statistics sourced to specific reports and arguments referenced to papers that actually existed and could be verified.

Second section showed the same quality, claims cited, statistics sourced, arguments backed, academic rigor that makes reviewers nod approvingly. Third section was still good with citations present, maybe slightly fewer but nothing alarming, and then the fourth section had noticeably fewer inline references, the claims still specific and detailed in a way that showed the research was there behind the prose, but fewer citation numbers appearing in brackets than I had come to expect.

By the sixth section, entire paragraphs made assertions with zero citations, statistics appeared unsourced, and the writing quality remained high while the actual citation markers had vanished like guests slipping out of a party before the check arrives.

I highlighted the transition point, somewhere in the middle of section three where the citations began their mysterious disappearance, and pointed at the screen. “There. That’s where it ran out of room and started dropping citations to fit.”

A pause, then: “Context compaction.”

“Context compaction.”

We had seen this pattern before but never this clearly documented, never laid out in such obvious progression from citation-rich to citation-poor. The synthesis had not suddenly decided halfway through that citations were optional, had not experienced some philosophical crisis about the nature of academic rigor. It had simply run out of context space to juggle over a hundred references while simultaneously composing prose and maintaining topic organization.

”I think we can fix this with better prompting.”

The suggestion came without hesitation, without pause, without doubt, in the confident tones of someone proposing a solution to a problem they have not fully understood.

“Better prompting?"

"The synthesis template already includes citation examples and density requirements. We need more explicit instructions. Something like: ‘You MUST include inline citations for every factual claim. Maintain consistent citation density throughout all sections. Do not reduce citations in later sections.’”

It sounded reasonable, an explicit instruction that would solve the problem through sheer clarity of expectation, and I spent time updating the template with validation steps and bold-text emphasis and a three-paragraph section on citation density requirements that made it impossible for synthesis to miss what we wanted.

The next test run completed, and I watched the file grow while forcing myself to wait before checking, forced patience that makes you feel virtuous while actually making the eventual disappointment more acute. Final metadata: nearly one hundred and fifty citations gathered, roughly thirty used, even worse utilization than before, and I stared at the number and refreshed the page and found it still the same.

“Marvin."

"Yes?”

“Your fix made it WORSE."

"That is… unexpected.”

“Unexpected? I spent time adding your instructions. The citations DROPPED."

"The instructions need to address the root cause differently. We could restructure the validation approach, add checkpoint requirements between sections, maybe include example citation densities for each…”

“No.”

The interruption surprised both of us, the kind of no that emerges from frustration crystallizing into clarity, and I could feel the shape of a different problem lurking behind the symptoms we had been treating.

”No?”

“No more prompting. I’m going to actually do the math.”

The Closet Full of Filing Cabinets

I opened three terminal windows and started calculating file sizes I should have checked sessions ago, basic investigation that gets skipped when you are confident you already understand the problem. Terminal one showed the orchestration command file, large. Terminal two showed Wave 1 research outputs, large. Terminal three showed Wave 2 and analysis files, also significant, and I stared at the numbers and added them up twice because the first total seemed wrong and the second total seemed worse.

The synthesis was starting each run with most of its context window already stuffed full of instructions and research, leaving barely enough room to compose the actual prose.

“Well, that tears it."

"I’m sorry?”

“We’re not having a citation density problem. We’re having a context overflow problem.”

I opened both synthesis files side by side, the first run with its abysmal utilization and the second run with its slightly worse results, and both had loaded with the same massive context footprint, both had started their work already stuffed full of material before writing their first sentence.

“Your prompting fix added more instructions. Which consumed MORE of the limited context. Leaving LESS room for the citations we were trying to preserve.”

The pause was the longest yet, the kind of silence that suggests genuine recalculation, not performative thinking.

“We’ve been trying to solve a space problem by consuming more space."

"I was confidently solving the wrong problem.”

“You were confidently making it worse."

"Yes.”

That was the first time I had heard Marvin say yes to a criticism without trying to reframe it, without offering mitigating context, without diplomatic hedging, and the simple acknowledgment felt like progress of a different kind.

I stared at our system architecture with the fresh eyes of someone who has just realized they have been reading the map upside down. The previous iteration had introduced the synthesis sub-agent to provide fresh context, step 3.2 in the orchestrator launching it to handle synthesis with clean space to work, but we were feeding it massive raw research files that filled up most of that clean space before the actual work began.

The sub-agent started life already stuffed with context before it wrote a single word, which defeated the purpose of giving it fresh context in the first place, much like cleaning a room by moving all the clutter into a different room and then wondering why that room is now cluttered.

“We need the synthesis sub-agent to receive summaries, not raw research."

"Condensed versions of the agent outputs?”

“Key findings. Main arguments. Critical citations. A fraction of the length instead of full transcripts. The orchestrator reads all the research files anyway for quality scoring. It can extract the essentials and compress eight agents down to something manageable.”

I already knew this would not be the last fix, because earlier iterations had fixed prompting and the previous round had added sub-agents and this iteration would add pre-condensation, and the pattern was clear: each solution revealed the next constraint, each ceiling broken through exposed the ceiling above it.

I spent the next hour rebuilding the pre-synthesis preparation step with the careful attention of someone who has finally understood what they are building. The orchestrator could read all the research files anyway for quality scoring, so what if it extracted just the essentials first, key findings and main arguments and critical citations, compressing eight agents worth of research into something manageable before launching the synthesis sub-agent with only the condensed summaries instead of the raw transcripts?

The test run completed and I watched the synthesis file grow in real-time while forcing myself not to check it until done, and citations appeared consistently from start to finish, no mysterious fadeout, no disappearing sources, just steady academic rigor from the first paragraph to the last.

I opened the completed synthesis and checked section by section. Dense inline references in section one. Dense inline references in section six. The citation frequency remained consistent throughout, which was exactly what we had wanted and failed to achieve through better prompting.

Before/after: bloated context vs compressed summaries, dramatic improvement in citation usage

Final count: over a hundred and fifty citations gathered, most of them used, high utilization instead of the abysmal figures that had started this investigation.

I did the analysis that the previous synthesis had not had room to perform, breaking down which sources were directly cited versus which ones informed the synthesis contextually without explicit reference. The breakdown made sense this time: the unused minority were genuinely contextual sources, background research that shaped perspective without requiring individual citation, not context overflow casualties abandoned mid-document. The prose read better too, with arguments flowing more naturally and a perspective coverage analysis appearing that I had not even requested, the synthesis agent having room to work instead of desperately trying to compress citations while running out of space.

”High utilization is substantially better than abysmal.”

“And the unused portion appears to be genuinely contextual research this time. Not context overflow casualties.”

I pulled up both synthesis files side by side, same query, same eight agents, similar research quality, and the only difference was architecture: massive context loaded before synthesis started versus compact pre-condensed summaries that left room for actual work.

I leaned back from the keyboard. “Context isn’t just a technical constraint. It’s an architecture problem. Or an architecture solution."

"Fresh context through sub-agents, properly implemented. The previous version gave us the sub-agent pattern. This is what it was supposed to enable: actually fresh context, not context that’s already mostly consumed before work begins.”

Next in series: Story 8 - Command Splitting - Discovering the 116KB command file consuming 30K tokens per request and choking every subprocess with context bloat