How Docker Built a Virtual Agent Fleet to Ship Faster: Inside the Coding Agent Sandboxes Team

Introduction: The Challenge of Testing and Triage

Shipping software reliably requires constant testing, issue triage, and release management. Docker’s Coding Agent Sandboxes (sbx) team faced this challenge head-on. Their product provides secure, microVM-based isolation for running AI coding agents—like Claude Code, Gemini, Codex, Docker Agent, and Kiro—giving each agent full autonomy inside its own sandbox (with its own Docker daemon, network, and filesystem) without touching the host system.

How Docker Built a Virtual Agent Fleet to Ship Faster: Inside the Coding Agent Sandboxes Team — Source: www.docker.com

Maintaining quality across macOS, Linux, and Windows, handling upgrade paths, managing resource leaks under load, and keeping an eye on a growing issue backlog was becoming overwhelming. Rather than writing traditional test scripts and reporting tools, the team built something innovative: a virtual team of seven AI agent roles that autonomously test the product, triage issues, post release notes, and even fix bugs. They call it The Fleet.

The Fleet: A Team of Seven AI Agent Roles

The Fleet is built on Claude Code skills—markdown files that define an agent’s persona, responsibilities, and allowed tools. Unlike a script that says “run these steps,” a skill acts as a role description: “You are the build engineer; here’s what you know and how you make decisions.” This distinction is crucial because agents need judgment, not just instructions. When a test fails unexpectedly, a script stops. An agent investigates.

The same skill file produces the same behavior whether it runs on a developer’s laptop or in CI. This consistency is a cornerstone of the Fleet’s design.

Real-World Roles: The /cli-tester and Others

One of the first skills built was the /cli-tester—the Fleet’s exploratory tester. It compiles the CLI binaries, exercises commands, finds issues, and reports them. Other roles include a build engineer, a release note writer, an issue triager, a regression hunter, a documentation reviewer, and a pull request reviewer. Each skill file defines the tools (like sbx commands, git, GitHub API) and the decision-making framework for its role.

Local First, CI Second: The Development Philosophy

The design principle behind the Fleet is simple: every skill runs on your machine first. When building the /cli-tester skill, the team didn’t start with a GitHub workflow. They invoked it locally, watched it build binaries, exercise commands, find issues, and report them. They tweaked the skill until it did the right thing in their terminal—then wired it into CI.

Why Local-First Matters

Building CI-only agents leads to painful commit-push-wait-read-logs cycles. Each iteration takes minutes. When the skill runs locally, iteration takes seconds. You see the agent think. You see where it gets confused. You fix the skill file, re-invoke, and refine. This rapid feedback loop accelerates development dramatically.

CI as Another Runtime for the Same Skill

The /cli-tester that runs nightly on macOS, Linux, and Windows runners is the exact same skill invoked from a developer’s terminal. The CI workflow sets up the environment, checks out the code, and calls the skill. No separate “CI version.” No translation layer. One skill, two runtimes. This ensures consistency and eliminates duplication.

How the Fleet Ships Faster

By delegating repetitive but judgment-required tasks to agent roles, the human team focuses on high-level design and complex bugs. The Fleet:

Runs exploratory tests across multiple OS versions nightly.
Triage issues and assigns priorities based on severity and frequency.
Generates release notes from git history and test results.
Fixes identified bugs autonomously (when within scope) and opens pull requests.

This virtual team operates in CI without human supervision, but the human team can always step in by modifying skill files—no code changes needed.

Key Takeaways for Your Team

Think roles, not scripts. Give agents roles with judgment, not step-by-step instructions.
Develop locally first. Debugging agents in your terminal is orders of magnitude faster than debugging in CI.
Use the same skill everywhere to ensure predictable behavior and minimize maintenance.
Start with one agent role that solves a real pain point, then expand.

The Fleet approach turns CI/CD from a pipeline into a collaborative environment where human and AI agents work together. For Docker’s Coding Agent Sandboxes team, it’s already shipping faster with fewer manual interventions.

Tags: