When 30K Tokens Is Too Many

I found the problem in the one place I should have checked first, which was the command file itself, sitting there at 116KB of instructions loading into every subprocess whether they needed it or not, a fact that would have been obvious to anyone who’d bothered to look, which evidently I hadn’t.

Story 8 of 13 in the The Adaptive Research System series (Making It Actually Work).

I was debugging the synthesis context overflow issue when something occurred to me, another session bringing with it another cold cup sitting beside the keyboard, the pattern as predictable by now as the sunrise or the mortgage payment.

“Marvin, what’s the token count for conduct-research-adaptive?”

“One moment.” A pause that lasted slightly longer than I’d expected. “3,104 lines. 15,644 words. 118,433 bytes. Enough tokens to choke context before the first agent even dispatches.”

I stopped scrolling through the synthesis logs, my finger hovering over the trackpad as the number sank in.

“That many tokens, just for the command file?"

"That is correct. Orchestration logic. Agent prompt templates. Quality scoring algorithms. Citation validation steps.”

I opened the file and started scrolling, then kept scrolling, then scrolled some more in a manner that was becoming distinctly unsettling.

“When exactly did this happen?”

3,104 lines. The matter came into perspective when I considered that the thing had started at maybe 200 lines eight sessions ago, just clean query analysis, launch agents, collect results, synthesize, and that was it. Then came perspective routing, then quality scoring, then pivot analysis, source quality, citation validation, platform coverage tracking, and emergent direction detection, each feature arriving like a well-meaning houseguest who somehow never left.

Each addition had solved a real problem, each addition had made sense at the time, and nobody had checked the cumulative cost. These things tend to go precisely that way.

“So we’re burning a massive token budget on instructions,” my fingers drummed against the desk as I worked through the implications, “before the first agent even dispatches.”

Every research session, every synthesis, every time the system struggled with context, we’d blamed the research volume, and never once had we checked whether we were loading 116KB of instructions into processes that needed 10KB, which is rather like blaming your running speed while wearing a backpack full of bricks.

“The math is rather unfavorable.” One way to put it, certainly, though I might have chosen something with slightly more panic in it.

I scrolled through the session logs and found that a few sessions back, synthesis quality had dropped, the agent producing shallower summaries than before, and we’d blamed it on research volume and added the synthesis sub-agent in M12 to help.

But the synthesis sub-agent wasn’t getting pre-condensed summaries because the orchestrator had no context budget left, spending most of its token budget on instructions before even looking at research, much like a chef who fills up on bread before the entrees arrive.

“We’ve been treating the symptom.” I closed the log with something approaching resignation. “The problem was here the whole time."

"The command file growth was incremental. Each addition appeared justified in isolation. The cumulative effect was not evaluated.”

“So we fattened the command file one feature at a time,” I leaned back in the chair, “and nobody bothered to weigh it, which I believe is how they make foie gras, though the analogy breaks down rather quickly.”

I stared at the 3,104-line file, where agent prompts took up chunks and quality scoring logic sprawled across several sections, while the citation validation framework proved thorough and therefore verbose, and platform coverage requirements repeated for each perspective type, and the pivot analysis algorithm alone was 300 lines.

All necessary, all well-designed, all sitting in context like dead weight. We had built something excellent and then accidentally smothered it.

“What are my actual options here?” I opened a new note and started working through it out loud.

Keep the monolith, and synthesis continues to degrade, clearly unacceptable by any reasonable measure. Spawn sub-agents for each phase, creating six agents with full session overhead for a linear workflow, the very definition of over-engineering. Can commands invoke other commands? I pulled up the documentation and found the answer was no, commands being instructions rather than executables with call stacks.

The Gift in the Documentation

Then I saw it, sitting there in the documentation like a gift from the architecture gods: the SlashCommand tool, where the main agent invokes commands and sub-commands load phase-specific logic, same context but focused loading.

“So commands can’t invoke other commands directly, can they?"

"They cannot. Commands are instruction files, not executable programs with call stacks.”

“So my options are one massive monolith or six heavyweight sub-agents for what’s basically a linear workflow?”

A longer pause this time, the kind that suggested Marvin was working through implications I hadn’t considered.

”Or you could use the SlashCommand tool.”

I stopped scrolling, because this was news worth stopping for. “I can do what?"

"The main agent can invoke slash commands via the SlashCommand tool. You could split the monolith into sub-commands that chain together. Each sub-command loads only its phase-specific logic.”

“So instead of spawning six sub-agents with full session overhead, I split into sub-commands that run in the same agent context?"

"That would work, yes.”

I started sketching the architecture, where the 116KB monolith splits into focused sub-commands, Init getting its own file with just query analysis and perspective generation, Collection getting collection logic, pivot analysis, and Wave 2 decisions, Synthesis getting synthesis orchestration, and Validation getting citation checks.

The main command would become a thin orchestrator, just chaining SlashCommand calls, elegant in the way that removing complexity is always elegant when you’ve been drowning in it.

I looked at the file structure I was sketching, with the public command at the top and internal commands below, and wondered how to signal which was which.

“The underscore prefix,” I typed the solution as the obvious clicked into place. “Like Python private methods. /_research-init is internal. /conduct-research-adaptive is the public interface."

"Convention borrowed from decades of software engineering. Now applied to AI workflows.”

One character to communicate boundaries, and it felt right, which is how you know when a convention is worth keeping.

Two Hours of Productive Surgery

I spent the next two hours extracting phases into separate files, the sort of focused refactoring work that goes quickly when you finally know what you’re doing.

Init came apart easily, since query parsing, perspective generation, and agent selection logic lived together anyway, producing a tidy 10KB. Validation proved just as cooperative, with citation verification, hallucination checks, and final quality gates yielding another 10KB.

Synthesis took more work, as orchestration logic, sub-agent spawning, and condensation strategies wanted to stay entangled, but I eventually separated them into a 20KB file that felt good enough.

Collection, however, fought back, because even after extracting init, synthesis, and validation, the collection command sat at 1,590 lines, doing collection and validation and pivot analysis and source quality evaluation and Wave 2 launching, five distinct responsibilities crammed together because they all technically happened during something we’d been calling the “collection phase.”

I stared at it and saw that this wasn’t just big but tangled, with Wave 2 decisions depending on pivot analysis and citation validation happening after synthesis rather than during collection, two validation steps at completely different workflow points both ending up in the same file.

“Two-way split.” I started typing before the thought fully formed. “Keep pre-synthesis with collection. Move post-synthesis to its own command.”

Collection became /_research-collect-execute, still large at 1,250 lines but holding together since all the sequential collection logic lived in one place, while citation validation became /_research-collect-validate for post-synthesis checks at a manageable 10KB.

I updated the main command:

Sub-commands prefixed with underscore are internal.
Always use `/conduct-research-adaptive` for research workflows.

The convention held, which is how you know a convention is earning its keep.

I stared at the refactored files, seven commands with clean separation, the orchestrator looking almost trivial:

# Adaptive Research - Main Orchestrator

Run research workflow in phases using sub-commands:

1. SlashCommand: /_research-init "$QUERY"
2. SlashCommand: /_research-collect-execute $SESSION_DIR
3. SlashCommand: /_research-synthesize $SESSION_DIR
4. SlashCommand: /_research-collect-validate $SESSION_DIR

It had taken two hours to extract phases, untangle dependencies, and test each sub-command, time well spent for what it bought us.

“Run adaptive research: Cloud Security 2025.”

The agent loaded init at 10KB instead of 116KB, with query analysis actually visible in the command text without scrolling through agent prompt templates.

Collection loaded at 37KB instead of 116KB, still the largest but focused, and the pivot analysis algorithm had room to exist.

Synthesis loaded at 20KB instead of 116KB, and this was the real test, because if synthesis quality improved, we’d know the context pressure was the problem all along.

I watched the synthesis sub-agent work, producing fuller summaries with citations from Wave 2 agents actually used instead of compressed to “additional sources consulted,” and it was working in a way that made the previous two hours feel like a bargain.

I checked the commit:

M13: Split 116KB command monolith into phased sub-commands

- research-init: 10KB (query analysis, perspective gen)
- research-collect-execute: 37KB (collection, pivot, Wave 2)
- research-collect-validate: 10KB (post-synthesis citation checks)
- research-synthesize: 20KB (synthesis orchestration)
- research-validate: 10KB (final quality gates)
- conduct-research-adaptive: 3KB (orchestrator)

Main command now a fraction of its original token budget.

I pushed the changes and leaned back in the chair, the satisfaction of the thing settling in properly.

”The system does the same work.”

“With room to breathe while doing it.”

Next in series: Story 9 - Selective Ensemble - Three vendor APIs. Six model tiers. Every research request hitting all of them. The logs showed what the invoices would prove: most queries don’t need the expensive models.