AI Agent Token Cost Optimization: How to Cut Spending by 65%
AI coding agents are transforming how software gets built. They are also transforming engineering budgets. A single developer running Claude Code full-time on a complex project can burn through $3,000-$13,000 per month in API costs. Multiply that by a team of five, and you are looking at $15,000-$65,000 monthly -- a line item that makes finance teams very uncomfortable.
The good news: most of that spending is waste. Redundant context loading, suboptimal model selection, bloated prompts, and unnecessary re-reads of unchanged files account for 60-70% of typical token consumption. With the right optimizations, you can cut costs by 65% without reducing output quality.
Where the Tokens Go
Before optimizing, you need to understand what drives cost. AI agent token consumption breaks down into four categories, and their relative weight surprises most developers.
Token Consumption Breakdown (Typical Session)
- Context loading (45%) -- every time you ask the agent a question, it re-reads your project files, system prompt, and conversation history. On a large project, this can exceed 100K tokens per interaction.
- Conversation history (25%) -- as your session progresses, every previous message is included in each new request. A 20-message conversation might carry 50K tokens of history.
- Output generation (20%) -- the actual code and explanations the agent produces. This is the part you are paying for, and it is the smallest portion.
- Retries and corrections (10%) -- when the agent makes a mistake and you ask it to fix the issue, all the context loads again plus the failed attempt.
The insight is clear: 70% of your spending goes to repeatedly loading context that has not changed. This is the primary optimization target.
Strategy 1: Prompt Caching (Save 90% on Input Costs)
Prompt caching is the single highest-impact optimization available. Anthropic's prompt caching feature stores frequently-used context on their servers, reducing the cost of cached tokens by 90% on subsequent reads.
How it works: the first time your system prompt and project context are sent, they are processed at full price. On subsequent requests in the same session, cached tokens are served at 10% of the original cost. For a 100K-token system prompt that gets sent 50 times in a session, you pay full price once and 10% for the other 49 times.
Prompt Caching Math
Without caching: 100K tokens x 50 requests x $3/MTok = $15.00 per session
With caching: 100K tokens x 1 full + 49 cached x $0.30/MTok = $3.00 + $1.47 = $4.47 per session
Savings: 70% on input costs alone
Claude Code enables prompt caching automatically when using the Anthropic API. The key to maximizing cache hits is structuring your prompts so the static content (system prompt, project memory, file contents that have not changed) comes first, and the dynamic content (your current question) comes last. This way, the static prefix matches the cache on every request.
Strategy 2: Model Routing (Use the Right Model for the Job)
Not every task requires a frontier model. Asking Claude Opus to rename a variable or add a console.log statement is like hiring a senior architect to move a desk. It works, but you are dramatically overpaying.
Model routing means directing tasks to the appropriate model based on complexity:
- Frontier models (Claude Opus, GPT-4o) -- complex architecture decisions, multi-file refactors, debugging subtle race conditions, designing new systems. These tasks require deep reasoning and justify the higher token cost.
- Mid-tier models (Claude Sonnet, GPT-4o-mini) -- standard feature implementation, writing tests, code review, documentation. These are the bulk of daily tasks and mid-tier models handle them well at 5-10x lower cost.
- Lightweight models (Claude Haiku, GPT-3.5) -- code formatting, simple refactors, boilerplate generation, commit message writing, syntax fixes. These tasks do not benefit from deeper reasoning.
Cost Comparison by Model Tier
- Claude Opus 4: $15/MTok input, $75/MTok output -- reserve for complex reasoning
- Claude Sonnet 4: $3/MTok input, $15/MTok output -- daily workhorse
- Claude Haiku 3.5: $0.80/MTok input, $4/MTok output -- routine automation
A typical development day might involve 2 hours of complex architectural work (Opus), 5 hours of standard feature work (Sonnet), and 1 hour of routine tasks (Haiku). Routing appropriately reduces the daily cost from $80-120 (all Opus) to $25-40 (routed), a 65% reduction.
Strategy 3: Context Compression
Large codebases generate enormous context windows. When Claude Code reads a 500-line file to understand a function, you pay for all 500 lines even though only 30 lines were relevant. Context compression reduces what gets sent to the model.
The /compact command. Claude Code's built-in /compact command summarizes the current conversation into a condensed format, reducing token count by 50-80% while preserving the essential context. Use it whenever your conversation exceeds 20 messages or when you notice increasing latency.
# When your session gets long, compact the context
/compact
# You can also compact with a specific focus
/compact focus on the authentication module changes
Selective file reading. Instead of letting the agent read entire files, direct it to specific functions or line ranges. "Read the handleSubmit function in UserForm.tsx" costs far fewer tokens than "Read UserForm.tsx" when the file is 400 lines long.
Structured project memory. A well-organized CLAUDE.md file with clear section headers lets the agent find relevant information without reading irrelevant sections. Keep your project memory concise: architecture overview (20 lines), build commands (10 lines), conventions (15 lines), current priorities (10 lines). Total: under 60 lines.
Strategy 4: Session Management
How you structure your work sessions has a direct impact on token consumption. Long, unfocused sessions are expensive. Short, targeted sessions are cheap.
Task-based sessions. Start a new Claude Code session for each distinct task. "Add pagination to the users list" is one session. "Fix the login redirect bug" is a separate session. This prevents conversation history from one task inflating the context of another.
Session checkpoints. When a session is going well, save the current state by asking the agent to summarize what has been accomplished and what remains. If you need to restart, you can paste the summary into a new session rather than replaying the entire conversation.
Avoid exploratory sessions on the API. If you are exploring a codebase or brainstorming architecture, use the flat-rate Claude Max subscription rather than pay-per-token API access. Exploration is inherently token-heavy and unpredictable. Reserve API usage for focused execution.
Real-World Cost Data
Here is what teams actually spend, before and after optimization, based on data from engineering teams running agentic workflows in production.
Solo Developer (Full-Time Agentic Workflow)
- Before optimization: $3,200/month (all Opus, no caching, long sessions)
- After optimization: $1,100/month (model routing + caching + compact)
- Savings: 66%
5-Person Engineering Team
- Before optimization: $13,500/month (mixed usage, no governance)
- After optimization: $4,700/month (routing + caching + session limits)
- Savings: 65%
20-Person Engineering Org
- Before optimization: $47,000/month
- After optimization: $16,500/month (full governance stack)
- Savings: 65%
The 65% savings number is remarkably consistent across team sizes. The optimizations scale linearly because the waste patterns are the same regardless of how many developers are involved.
Tracking Token Usage Across Multiple Agents
You cannot optimize what you do not measure. When running multiple AI agents simultaneously -- a common pattern in agentic engineering workflows -- tracking per-agent costs becomes critical for identifying which workflows are efficient and which are burning money.
Beam helps with this by organizing your agent sessions into labeled panes within workspaces. Each pane corresponds to a specific agent instance running a specific task. When you review your API usage dashboard, you can correlate cost spikes with specific panes and specific tasks, identifying which workflows need optimization.
For example, if your "test writer" agent consistently costs 3x more than your "implementer" agent, something is wrong. Maybe it is reading the entire test suite before writing each new test. Maybe it is using Opus when Haiku would suffice for test generation. Without per-agent visibility, you would never know where the waste lives.
Track Every Agent, Optimize Every Dollar
Beam organizes your multi-agent workflow into labeled panes so you can track which agents cost what and optimize intelligently.
Download Beam FreeThe Optimization Checklist
Apply these in order. Each builds on the previous one.
- Enable prompt caching -- if using the Anthropic API, this happens automatically. Ensure your system prompt is stable within sessions. Expected savings: 30-40%.
- Implement model routing -- use frontier models for complex tasks only. Route standard work to mid-tier models. Route routine tasks to lightweight models. Expected savings: 20-30%.
- Use /compact regularly -- run the compact command every 15-20 messages or when you notice increased latency. Expected savings: 10-15%.
- Structure sessions by task -- one task per session. Avoid letting sessions drift into multiple unrelated topics. Expected savings: 5-10%.
- Optimize project memory -- keep CLAUDE.md under 100 lines. Remove stale information. Be precise, not verbose. Expected savings: 5%.
Combined, these five optimizations typically reduce token spending by 60-70%. The first two alone (caching and model routing) account for the majority of savings and take less than an hour to implement.
AI agents are worth the investment. But there is no reason to pay 3x more than necessary. Optimize your token usage, and the ROI of agentic engineering becomes impossible to argue against.