Citation validation was supposed to complete the system, the final piece after we’d built quality scoring, routing, platform coverage, and all those Wave 2 improvements. It caught the hallucinations beautifully, which made me feel rather clever for about fifteen minutes, and then it revealed something considerably worse.

Story 5 of 13 in the The Adaptive Research System series (Discovery & Foundation).


I’d been staring at one citation for twenty minutes, the kind of deep contemplation that looks productive but is really just confusion dressed up in a thoughtful expression.

“Marvin, this statistic.” I gestured at the screen with the weary authority of someone who suspects they’re about to receive bad news. “The one about thirty-two percent higher accuracy in multi-hop reasoning."

"From the academic source, yes. Professional citation format, proper IEEE structure.”

“Can you verify it?”

There was a pause, the kind Marvin does when he’s found something I’m not going to like, which by this point in our working relationship I’d learned to recognize the way a sailor recognizes the particular quality of sky that precedes a squall.

”The source appears to exist. PDF is accessible. But the content itself is not verifiable. The claimed statistic doesn’t appear in any accessible portion of the document.”

“So we’re citing something we can’t actually confirm?” I asked, though I already knew the answer and wished I didn’t.

”Correct.”

I pulled up the full research output, which contained nearly three hundred citations in professional IEEE format, alphabetized, numbered, and formatted beautifully, each one claiming to support specific findings, specific statistics, specific technical claims. How many of them were real, I wondered, and how many could actually be verified, and more pressingly, how many of my theoretical readers would have caught this before I did?

“We need to validate every single one.” The grim determination of someone who has just realized the scope of the mess they’ve created.

“Every citation?” The pause suggested mild horror, or perhaps concern, or both.

“Every single one,” I repeated, because apparently I hadn’t made myself sufficiently clear, and because the alternative was publishing fiction with academic formatting.

Even for personal writing, citing claims you can’t verify destroys credibility faster than almost anything else, and once your blog has unverifiable statistics dressed up in professional formatting, readers remember. I’d been days away from publishing this, which was a sobering thought.


The implementation was simple enough in concept: pull every URL from the research files, fetch each one, see if it actually loads, and assign a status of Valid, Mismatch, Invalid, or Paywalled depending on what we found.

”What about content verification? A URL might load successfully while containing nothing related to our research.”

He was right, of course, because a URL could load perfectly while containing nothing we’d cited, which meant we needed to go deeper. “We check that too.” I pulled up the validation schema. “Does the page contain the keywords from the claim, and if we’re citing statistics, do those numbers appear in the source?"

"And if they don’t?”

“Status: Mismatch, and we flag it for manual review.” I was rather pleased with this system, which should perhaps have been a warning sign.

The first test run processed a sample of maybe fifteen or twenty citations, half security sources and half academic papers, because I wanted to see how bad the problem actually was before running the full validation. The results appeared a few minutes later, and I found myself holding my breath in a way that felt melodramatic but proved justified.

“Security sources: most valid.” I scanned the results with growing relief.

”Acceptable?”

“Better than I expected, honestly. A couple mismatches where the source existed but didn’t support the specific claim, but no complete hallucinations."

"And the academic sources?”
Academic sources worse than vendor content

I stared at the numbers for a long moment, trying to make them say something different through sheer force of will, but they remained stubbornly what they were. Academic sources came in at 38% valid, while security vendor sources hit 75%, and Marvin said nothing, which was probably wise given the expression on my face. Academic sources, those peer-reviewed papers that were supposed to be the gold standard of research credibility, had a lower validation rate than vendor blog posts, which felt like discovering that your expensive wine collection was mostly grape juice.

“Several invalid,” I reported, my voice flat. “URLs that don’t exist, one paywalled that we can’t verify, and a couple mismatches where the paper exists but doesn’t contain the claims attributed to it."

"The thirty-two percent statistic?”

“Unverifiable,” I confirmed. “The source exists and loads fine, but the claimed statistic isn’t in any accessible portion of the document. Professional citation format, accurate source reference, completely unverifiable claim.”

I was angry, and tired, though mostly tired, the kind of tired that comes from realizing you’ve been thoroughly had by your own tools. The internet is drowning in AI slop, not the obvious kind but the subtle kind, with respected publications citing papers that don’t exist, companies publishing “research reports” with statistics from hallucinated journals, URLs that lead nowhere, and claims that sound authoritative until you actually try to verify them. We’d poured sessions into building a research system, and it was producing fiction with academic formatting.

“Run the full validation.” I leaned back in the chair, resigned. “Every single citation.”


It took most of the afternoon, and when it finished, the overall validation rate was 56%, which meant that not even two-thirds of our citations were actually valid, a number that would haunt me for some time.

”You’re right. The system has a fundamental problem in citation grounding. We need validation as a mandatory step.”

I added validation to catch bad citations, making it so that every research session would now extract citations, verify them, and report the validation rate automatically. I wouldn’t use synthesis results until I knew how much was actually grounded in real sources, which felt like the minimum viable sanity check.

The Validation Rate Wasn’t the Problem

Late that night I was reviewing the synthesis on agentic AI security, hundreds of KB of findings with quality scores consistently high and every claim professionally cited in that reassuring IEEE format that makes everything look legitimate, the typographical equivalent of a firm handshake and an expensive suit.

I scrolled through the key findings section, which documented emerging attack vectors and novel exploitation techniques, prompt injection success rates with suspiciously precise percentages, newly discovered malware variants with dramatic names, self-replicating threats spreading across cloud environments, and over forty security tools catalogued with each claiming to solve different aspects of the problem. Impressive stuff, comprehensive, exactly the kind of research I’d wanted.

I clicked the first citation for a self-replicating worm attack, which loaded for approximately half a second before displaying that universal error message every researcher, or aspiring researcher, has come to dread. The second source proved more cooperative, loading the page just fine and presenting itself with professional formatting and everything, but searching for the specific malware name returned exactly nothing, which was somehow more insulting than the URL that didn’t load at all. The attack success statistics came with three citations, one paywalled, and two that loaded perfectly well but didn’t actually contain the specific percentages we’d cited.

“Marvin, these findings.” I gestured at the screen with growing unease. “The novel attack vectors, the new malware variants. Can you verify any of these?”

“The citations follow proper IEEE format.” Technically responsive without being remotely helpful.

“That’s not what I asked."

"Several sources are inaccessible. Others don’t contain the specific claims attributed to them.”

Professional formatting, impressive technical detail, completely unverifiable, which was becoming a depressingly familiar pattern. And then I noticed something else about the sources that did work, something I should have spotted much earlier.

“Marvin, how many of the top fifteen sources are vendors?”

The silence that followed was eloquent in its reluctance.

“All fifteen.” A pause, as if hoping repetition might somehow change the facts. “All fifteen. The complete vendor roster.”

I kept scrolling through page after page, watching an endless parade of vendor sources with the occasional academic paper appearing like a lone protester at a trade show, usually cited once and buried between vendor white papers and product announcements. First page, all vendors. Second page, mostly vendors. Third page, I stopped counting and just scrolled.

“This is vendor marketing.” The realization settled over me like a damp blanket.

“The technical content is accurate.” True but missing the larger point entirely.

“The technical content is about threats that conveniently require their products to solve,” I replied, hearing my voice rise slightly. “Every source, every claim, it’s all vendors.”

“The research agents prioritized the best-cited, most discoverable sources.” The careful neutrality of explaining why the train was late without accepting blame for the scheduling. “Those happen to be vendor-produced.”

“My research was just vendor echo chamber.” I closed the laptop with perhaps more force than strictly necessary. “I’d built a system that gave me marketing material instead of independent analysis. That’s not research. That’s something else entirely, and I don’t even know what to call it.”

“Systematic bias toward marketing investment.” Unfortunately accurate.

I stared at the synthesis with its hundreds of KB representing impressive amount of agent work, all beautifully formatted and properly cited, and completely compromised.

I opened Anthropic’s blog post about their multi-agent research system, because they’d built something similar at production-grade and I wanted to see if they’d encountered the same problem. Found it in the evaluation section, which made me feel slightly better about having missed it myself.

“In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs.” [1]

“They had this exact problem.” The particular comfort of shared misfortune.

”Vendors invest heavily in SEO. They publish frequently, with excellent discoverability. Academic papers don’t rank well. Practitioner blogs don’t have marketing budgets. Independent voices are systematically harder to find.”

“So our comprehensive research is comprehensive vendor research,” I summarized, though calling it a summary felt generous.

“Yes.” At least it had the virtue of brevity.

Sixteen Hundred and Counting

Next morning I pulled up the citation validation results again, staring at the vendor-heavy source distribution like it might rearrange itself into something more balanced if I looked hard enough.

“Marvin, can you categorize the sources we’ve actually collected? Academic institutions, standards bodies, industry associations, news sites, vendors. I want to see the breakdown."

"One moment.”

The results appeared almost instantly, which was both impressive and slightly annoying given how long I’d spent manually reviewing sources the day before. Two hundred and twenty-six unique domains across our research output, neatly sorted into categories with confidence scores and classification rationale.

Vendor sources comprised well over half of everything.

“Well.” I stared at the numbers. “That’s not subtle.”

“No.” Admirable restraint.

“We’re not doing balanced research,” I continued, warming to my theme. “We’re cataloging vendor perspectives with occasional academic footnotes."

"An accurate characterization.”

I scrolled through the breakdown, watching category after category skew toward commercial sources, and then a thought occurred to me that should have arrived much earlier.

“How many cybersecurity vendors actually exist? If we wanted to enumerate them all, build a complete reference list, how big would that list be?”

“Oh, dozens at least,” Marvin said with the confident precision of someone about to be spectacularly wrong. “Perhaps as many as fifty major players across all categories.”

I opened a new browser tab and started searching. “Marvin, this analyst report from the well known analyst company lists more than that in a single category alone.”

There was a pause, the kind where you can practically hear someone recalculating everything they thought they knew.

“I may need to revise that estimate,” he admitted. “The cybersecurity market appears to contain approximately sixteen hundred documented vendors, according to recent industry analysis.”

“Sixteen hundred,” I repeated, the number sitting there like an unwelcome guest. “And our research pulled from 226 unique domains.”

“Which represents roughly fourteen percent of the vendor landscape alone.” The precision was helpful in the way that knowing exactly how lost you are is helpful. “Before accounting for academic sources, standards bodies, independent researchers, or any other category.”

I kept scrolling through search results. Security startups. Regional players. Niche tools. Open-source projects with commercial backing. Every search revealed vendors I’d never heard of, some selling products that competed directly with sources we’d already cited.

“Marvin, I think sixteen hundred might also be insufficient.”

“An unfortunate possibility.” The quality scores on his estimate were clearly dropping. “The actual number appears to be considerably larger.”

“How considerably?"

"Perhaps ‘roughly gazillion’ would be a more honest representation than implying we have any real grasp of the scale.”

The silence that followed was eloquent. I’d been contemplating building a master list of sources, a canonical reference of trustworthy versus suspect domains that we could validate against. That approach now looked like trying to catalog the ocean by examining a particularly interesting bucket of seawater.

“We can’t enumerate our way out of this.” I closed the search tabs, accepting defeat.

”Agreed. The landscape is too large and changes too rapidly. I should have indicated uncertainty in my initial estimate rather than projecting confidence I didn’t possess.”

“Dozens,” I muttered, still stuck on the original number. “You said dozens."

"An error of magnitude I do not intend to repeat.”

The solution wasn’t a list but an architecture, and once I saw it that way, the pieces fell into place rather quickly. Three research tracks: standard research for breadth, independent-focused work to counter vendor bias, and contrarian hunting for critical voices. Force diversity through structure, not enumeration.

I sketched the concept in my notes, letting the ideas flow onto the page with the enthusiasm of someone who has finally found the right approach. Fifty percent to standard research to keep it broad, twenty-five percent to the independent track to actively seek academics and practitioners, and twenty-five percent to the contrarian track to hunt for skeptics and critics. Let each track operate with different priorities, different source preferences, natural bias toward different perspectives.

The architecture would work like the Wave 2 system we’d already built, which was satisfying because it meant we weren’t starting from scratch. After Wave 1 explores broadly, check the vendor ratio, and if it’s too high, trigger specialists tasked specifically with finding independent voices. Use the pivot logic we already had, just add source diversity as a quality gate alongside platform coverage.

“What would the contrarian track actually find?” I wondered aloud, and then, because I was curious and because procrastination sometimes masquerades as validation, I did a quick search for independent AI security criticism.

Simon Willison’s blog appeared in the first few results, a post titled “The lethal trifecta for AI agents” that I’d somehow missed during all my vendor-dominated research. I opened it and started reading.

Any time you ask an LLM system to summarize a web page, read an email, process a document or even look at an image there’s a chance that the content you are exposing it to might contain additional instructions which cause it to do something you didn’t intend. [2]

“This is exactly what the contrarian track should surface.” The slight unsettling quality of having someone read over your shoulder. “Independent practitioner, no product to sell, critical perspective on AI agent security that contradicts vendor optimism.”

“And our vendor-heavy research missed it entirely.” I scrolled through Willison’s analysis of the three conditions that make AI agents dangerous. The piece represented exactly the kind of independent voice that should balance vendor perspectives.

Our comprehensive research hadn’t found it once.

“This is the proof of concept.” I bookmarked the post. “If the three-track architecture works, the contrarian track should surface voices like this automatically. Critics. Skeptics. People whose incentives don’t align with selling security products."

"A testable hypothesis.”

I documented the approach in a design note, covering allocation ratios, track definitions, quality gates, and Wave 2 triggers, and the pieces fit together cleanly with what we’d already built. The Willison post went into the notes as an example of what success should look like.

”You’re implementing this now?”

“Not yet.” I forced myself to exercise uncharacteristic patience. “First I need to verify our research isn’t just vendor marketing with academic footnotes.”

“A sobering observation.” The bookmarked Willison post sat open in its tab like an accusation. “A three-second Google search surfaced what our glorified research apparatus, with its multiple agents and quality scoring and comprehensive source validation, managed to bypass completely.”

I closed the validation results and saved the architecture sketch, feeling like the day had been productive despite the uncomfortable revelations. Tomorrow I’d tackle citation utilization tracking, but tonight I’d learned something worth the frustration: validation solves accuracy, not bias, and confusing the two is how you end up with beautifully formatted vendor marketing.


References:

[1] Anthropic, “How we built our multi-agent research system,” Anthropic Engineering Blog, Jun. 13, 2025. [Online]. Available: https://www.anthropic.com/engineering/multi-agent-research-system

[2] S. Willison, “The lethal trifecta for AI agents,” simonwillison.net, Jun. 16, 2024. [Online]. Available: https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/

Next in series: Story 6 - Results & Lessons - Six days of building, and the first real test actually worked. Eight perspectives, quality scores in the 90s, hundreds of sources validated. Then Gemini started timing out and we spent two days blaming rate limits before discovering a thirty-second timeout was killing fifty-second API calls.

The link has been copied!