Claude Opus Agent Teams: Real Production Results

Last week, we tested Claude Code's new agent teams feature on a theoretical product launch. The brief: research competitor pricing, draft technical documentation, review it against our style guide, and generate the SEO metadata. Four distinct steps, each depending on the one before it.

The entire pipeline ran in under four minutes. Four agents worked in parallel where they could, waited on dependencies where they had to, and produced output that needed only minor edits before going live.

This isn't a demo. It's the kind of workflow we're helping our clients build at Jelifish. And it all starts with Anthropic's release of Claude 4.6 Opus and the agent teams feature in Claude Code.

What Claude 4.6 Opus Actually Is

Claude 4.6 Opus is Anthropic's most capable model, released on 5 February 2026. The model ID is claude-opus-4-6. If you've been tracking the Claude model family, the numbering reflects Anthropic's shift away from the old 3.x versioning towards a cleaner scheme: Claude 4.5 Sonnet for the fast, cost-effective workhorse; Claude 4.6 Opus for maximum capability.

The headline specs: 1 million token context window (in beta, up from 200K), 128K output tokens, and pricing at $5/$25 per million tokens for input/output. There's a 50% discount on the batch API for async processing, which we use heavily for bulk content operations.

The improvements that matter for real-world engineering:

1M token context window means you can load an entire codebase into a single conversation. For a typical CDK project, that's the infrastructure code, the Lambda handlers, the test suite, and the deployment scripts — all in context at once. No more "sorry, I've lost track of that file from earlier."
Adaptive thinking is a new mode (thinking: {type: "adaptive"}) where Claude decides how much reasoning to apply based on problem complexity. Four effort levels — low, medium, high, and max. For agent teams, this matters because the researcher agent can run at high effort for complex analysis while the SEO agent runs at low effort for metadata generation. You pay for what you need.
Context compaction (also in beta) automatically summarises older conversation tokens when the context approaches its limit. For long-running agent tasks, this is the difference between an agent that gracefully handles a four-hour investigation and one that falls apart after 20 minutes.
Materially better code generation. On Terminal-Bench 2.0, Opus 4.6 scored 65.4% — the highest score ever recorded, up from 59.8% for Opus 4.5. In practice, that translates to fewer incorrect CDK constructs, better Lambda handler patterns, and more accurate TypeScript types.

One result that caught my attention: Anthropic's frontier red team turned Opus 4.6 loose on open-source libraries with standard vulnerability analysis tools. It found over 500 previously unknown high-severity vulnerabilities, including buffer overflows in OpenSC and CGIF. In the CGIF case, it wrote its own proof-of-concept exploit to validate the finding. That's not a parlour trick. That's meaningful security research capability.

But the model itself is only half the story. The real shift is in how you orchestrate multiple instances of it.

Agent Teams: Not What I Expected

When Anthropic introduced agent teams in Claude Code, I assumed it was a thin wrapper around running multiple prompts. I was wrong.

Agent teams are still experimental — you enable them with CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your environment or settings.json. But even in this early state, the architecture is surprisingly well-thought-out.

An agent team consists of a team lead and one or more specialist agents, each with a defined role. The team lead coordinates work, creates tasks, assigns them to specialists, and manages dependencies between tasks. Each specialist runs as an independent Claude instance with its own context window, tools, and instructions.

Here's what makes this different from simply running multiple prompts in parallel — and critically different from Claude Code's existing subagent system:

Task dependencies are enforced. If agent B's work depends on agent A's output, agent B will wait. Task claiming uses file locking to prevent race conditions when multiple agents try to claim the same task simultaneously. When a blocking task completes, dependent tasks unblock automatically. This sounds trivial, but it eliminates an entire class of coordination bugs.

Agents communicate peer-to-peer. This is the key difference from subagents. With subagents, results flow back to a central coordinator — they can't talk to each other. With agent teams, teammates message each other directly. They can share findings, challenge assumptions, and coordinate without the team lead mediating every interaction. When the researcher in our product launch test found conflicting information about competitor pricing, it flagged the issue to the team lead rather than silently choosing one.

Each agent has its own context window. This is architecturally significant. Instead of cramming everything into a single, increasingly confused context, each specialist focuses on its narrow task. The researcher fills its context with documentation and sources. The technical writer fills its context with style guides and drafts. Neither pollutes the other's working memory.

The filesystem is shared. All agents read from and write to the same project directory. The researcher saves notes to a markdown file; the writer reads that file and produces a draft; the reviewer reads the draft and flags issues. It's the same workflow a human team would use, just faster.

You can watch them work. In split-pane mode (requires tmux or iTerm2), each agent gets its own terminal pane. You can see all agents working simultaneously, or use Shift+Up/Down to select a specific agent when running in-process. You can also message any agent directly to redirect their approach mid-task.

Use Cases We're Building for Clients

At Jelifish, we build production AI systems on AWS using serverless architectures — Lambda, DynamoDB, Step Functions, and Bedrock. We've been integrating Claude Code agent teams into client workflows for specific categories of work where multi-agent coordination adds clear value.

Infrastructure Reviews

Before deploying CDK stacks to production, imagine an agent team that:

Reads the CDK code and identifies all resources being created or modified
Checks IAM policies against least-privilege principles
Estimates costs based on the resource configuration
Compares against our internal standards (tagging requirements, encryption settings, VPC configuration)

That replaces a manual review process that typically takes 30-45 minutes per deployment. The agent team produces a structured report in under two minutes.

Codebase Exploration

When onboarding to a new project, understanding an unfamiliar codebase quickly is essential. A team with an "explorer" agent maps the project structure, identifies key patterns, and produces an architectural summary, while a second agent specifically looks for security concerns and a third documents the API surface.

The Practical Reality of Working With Agent Teams

It's not all smooth. Here's what we've learned from building these workflows.

Context Windows Are Better, But Still a Constraint

Each agent gets its own context window. With the 1M token beta, that's substantially more room than before — but it's still finite, and performance can degrade on very long contexts. Context compaction helps by summarising older messages, but for the researcher role, you still need to be deliberate about what you ask it to gather. "Research everything about Claude 4.6" will produce a bloated, unfocused mess. "Find the release date, key capability improvements, and pricing changes for Claude 4.6 Opus" produces useful, targeted output.

Instructions Need to Be Specific

Vague role descriptions produce vague work. In one early experiment, we gave a documentation agent instructions that said "write in a professional tone." The output was generic corporate prose. When we changed it to "match the exact style of this specific technical document, use UK English, avoid these specific phrases," the quality improved dramatically.

The same principle applies here as with any prompt engineering: specificity wins.

Coordination Overhead Is Real

For tasks that a single agent could handle in one pass, spinning up a team of four agents is slower and more expensive. Agent teams make sense when:

The task naturally decomposes into sequential stages with clear handoffs
Different stages benefit from different contexts (research vs writing vs review)
You want independent verification (the reviewer hasn't seen the writer's source material, only the output)
Parallel execution of independent subtasks saves meaningful time

For a quick code change or a simple question, a single Claude instance is still the right tool.

Cost Considerations

Each agent in a team is a separate Claude instance with its own token consumption. At $5/$25 per million tokens (input/output) for Opus, a four-agent team costs roughly 3-4x what a single agent would. One useful feature: you can mix models across the team. Run your researcher and writer on Opus for maximum quality, but the SEO agent on Sonnet 4.5 since metadata generation doesn't need the same reasoning depth. This brings the total cost down meaningfully.

The quality improvement and time saving justifies the cost for certain workflows. But it's worth tracking token usage carefully and not using teams where a single agent would suffice.

Known Rough Edges

Agent teams are experimental, and it shows in a few places:

No session resumption: You can't /resume or /rewind a team session. If your terminal crashes, you start over.
One team per session: You need to clean up the current team before starting a new one.
File conflicts: Two agents editing the same file can cause overwrites. Structure your tasks so agents work on separate files, with the team lead as the only one that touches shared resources.
Task status can lag: Agents sometimes forget to mark tasks as completed, which can block downstream work. The idle notification system mostly catches this, but it's worth monitoring.

Where This Fits in the Broader Picture

The industry has been talking about multi-agent systems for a while. Most implementations I've seen fall into two categories: academic demonstrations that don't survive contact with production, or thinly disguised sequential prompts marketed as "agents."

What makes Claude Code's agent teams different is the execution model. These aren't simulated agents passing messages through a Python script. They're independent processes with real tool access, real filesystem interaction, and real coordination primitives. The team lead can create tasks, track dependencies, and reassign work. Specialists can push back on instructions or flag blockers.

It's closer to how a small engineering team actually works than anything else I've used.

At the same time, I want to be clear about what this isn't. It's not replacing your engineers. It's replacing the tedious, well-defined portions of their workflow: the initial research pass, the first draft, the boilerplate review, the metadata generation. Your engineers still make the architectural decisions, review the output, and apply the judgment that comes from years of building production systems.

Getting Started

If you want to try agent teams in Claude Code, here's the practical setup:

1. Enable the feature. It's still experimental. Set the environment variable:

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

Or add it to your project's settings.json:

{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

2. Start with a two-agent team. A researcher and a writer, or a coder and a reviewer. Adding more agents adds complexity and cost; start simple and scale up once you understand the coordination patterns.

3. Write detailed role instructions. The more specific your agent descriptions, the better the output. Include examples of good and bad output where possible. "Write in a professional tone" produces generic corporate prose. "Match the exact style of this specific document, use UK English, avoid these specific phrases" produces something useful.

4. Use delegate mode. This restricts the team lead to coordination-only tools — spawning agents, messaging, task management — and prevents it from trying to do the actual work itself. Without delegate mode, the lead will sometimes start implementing tasks rather than assigning them.

5. Consider plan approval for high-stakes work. You can require agents to submit a plan before executing. The team lead reviews and approves before the agent proceeds. For infrastructure changes or anything touching production, this adds a useful safety check.

6. Review everything. Agent teams produce good first drafts, not finished products. Human review is still essential.

The tooling is maturing fast. Six months ago, getting a single AI agent to reliably complete a multi-step task was an achievement. Now we're coordinating teams of them on production workflows.

It's early days, but the trajectory is clear. The organisations that figure out how to effectively orchestrate AI agent teams will have a significant operational advantage. Not because the AI is smarter than their people, but because it handles the repetitive coordination work that slows good teams down.

Owen from the Jelifish team. We help organisations build and integrate AI agent workflows into their engineering processes using AWS serverless architectures. If this sounds like something your team could benefit from, we'd be glad to talk.

This isn't a demo. It's the kind of workflow we're helping our clients build at Jelifish. And it all starts with Anthropic's release of Claude 4.6 Opus and the agent teams feature in Claude Code.

What Claude 4.6 Opus Actually Is

The improvements that matter for real-world engineering:

1M token context window means you can load an entire codebase into a single conversation. For a typical CDK project, that's the infrastructure code, the Lambda handlers, the test suite, and the deployment scripts — all in context at once. No more "sorry, I've lost track of that file from earlier."
Adaptive thinking is a new mode (thinking: {type: "adaptive"}) where Claude decides how much reasoning to apply based on problem complexity. Four effort levels — low, medium, high, and max. For agent teams, this matters because the researcher agent can run at high effort for complex analysis while the SEO agent runs at low effort for metadata generation. You pay for what you need.
Context compaction (also in beta) automatically summarises older conversation tokens when the context approaches its limit. For long-running agent tasks, this is the difference between an agent that gracefully handles a four-hour investigation and one that falls apart after 20 minutes.
Materially better code generation. On Terminal-Bench 2.0, Opus 4.6 scored 65.4% — the highest score ever recorded, up from 59.8% for Opus 4.5. In practice, that translates to fewer incorrect CDK constructs, better Lambda handler patterns, and more accurate TypeScript types.

But the model itself is only half the story. The real shift is in how you orchestrate multiple instances of it.

Agent Teams: Not What I Expected

When Anthropic introduced agent teams in Claude Code, I assumed it was a thin wrapper around running multiple prompts. I was wrong.

Here's what makes this different from simply running multiple prompts in parallel — and critically different from Claude Code's existing subagent system:

Use Cases We're Building for Clients

Infrastructure Reviews

Before deploying CDK stacks to production, imagine an agent team that:

Reads the CDK code and identifies all resources being created or modified
Checks IAM policies against least-privilege principles
Estimates costs based on the resource configuration
Compares against our internal standards (tagging requirements, encryption settings, VPC configuration)

That replaces a manual review process that typically takes 30-45 minutes per deployment. The agent team produces a structured report in under two minutes.

Codebase Exploration

The Practical Reality of Working With Agent Teams

It's not all smooth. Here's what we've learned from building these workflows.

Context Windows Are Better, But Still a Constraint

Instructions Need to Be Specific

The same principle applies here as with any prompt engineering: specificity wins.

Coordination Overhead Is Real

For tasks that a single agent could handle in one pass, spinning up a team of four agents is slower and more expensive. Agent teams make sense when:

The task naturally decomposes into sequential stages with clear handoffs
Different stages benefit from different contexts (research vs writing vs review)
You want independent verification (the reviewer hasn't seen the writer's source material, only the output)
Parallel execution of independent subtasks saves meaningful time

For a quick code change or a simple question, a single Claude instance is still the right tool.

Cost Considerations

The quality improvement and time saving justifies the cost for certain workflows. But it's worth tracking token usage carefully and not using teams where a single agent would suffice.

Known Rough Edges

Agent teams are experimental, and it shows in a few places:

No session resumption: You can't /resume or /rewind a team session. If your terminal crashes, you start over.
One team per session: You need to clean up the current team before starting a new one.
File conflicts: Two agents editing the same file can cause overwrites. Structure your tasks so agents work on separate files, with the team lead as the only one that touches shared resources.
Task status can lag: Agents sometimes forget to mark tasks as completed, which can block downstream work. The idle notification system mostly catches this, but it's worth monitoring.

Where This Fits in the Broader Picture

It's closer to how a small engineering team actually works than anything else I've used.

Getting Started

If you want to try agent teams in Claude Code, here's the practical setup:

1. Enable the feature. It's still experimental. Set the environment variable:

export CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1

Or add it to your project's settings.json:

{
  "env": {
    "CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1"
  }
}

6. Review everything. Agent teams produce good first drafts, not finished products. Human review is still essential.

The tooling is maturing fast. Six months ago, getting a single AI agent to reliably complete a multi-step task was an achievement. Now we're coordinating teams of them on production workflows.

We Replaced Our Sprint Planning With an AI Agent Team: Here's What Actually Happened

What Claude 4.6 Opus Actually Is

Agent Teams: Not What I Expected

Use Cases We're Building for Clients

Infrastructure Reviews

Codebase Exploration

The Practical Reality of Working With Agent Teams

Context Windows Are Better, But Still a Constraint

Instructions Need to Be Specific

Coordination Overhead Is Real

Cost Considerations

Known Rough Edges

Where This Fits in the Broader Picture

Getting Started

We Replaced Our Sprint Planning With an AI Agent Team: Here's What Actually Happened

What Claude 4.6 Opus Actually Is

Agent Teams: Not What I Expected

Use Cases We're Building for Clients

Infrastructure Reviews

Codebase Exploration

The Practical Reality of Working With Agent Teams

Context Windows Are Better, But Still a Constraint

Instructions Need to Be Specific

Coordination Overhead Is Real

Cost Considerations

Known Rough Edges

Where This Fits in the Broader Picture

Getting Started