Cobalt — Testing for AI Agents
By: Basalt AI
GitHub: basalt-ai/cobalt
npm: @basalt-ai/cobalt
License: MIT
Stack: TypeScript 5.7, Node.js 20+
Unit testing for AI Agents — Test, evaluate, and improve your AI systems.
Quickstart
npm install @basalt-ai/cobalt
npx cobalt init
npx cobalt runCore Concepts
| Concept | Description |
|---|---|
| Dataset | Test data — load from JSON, JSONL, CSV, or platforms (Langfuse, LangSmith, Braintrust, Basalt). Immutable, chainable (filter(), map(), sample(), slice()) |
| Evaluator | Score outputs. 4 types: LLM judge (boolean or 0-1 scale), custom functions, semantic similarity (cosine/dot), Autoevals (11 battle-tested) |
| Experiment | Run agent against dataset, evaluate each output, produce structured report with per-evaluator stats (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs, timeouts, CI thresholds |
Example
import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'
const dataset = new Dataset({
items: [
{ input: 'What is 2+2?', expectedOutput: '4' },
{ input: 'Capital of France?', expectedOutput: 'Paris' },
],
})
const evaluators = [
new Evaluator({
name: 'Correctness',
type: 'llm-judge',
prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}',
}),
]
experiment('qa-agent', dataset, async ({ item }) => {
const result = await myAgent(item.input)
return { output: result }
}, { evaluators })MCP Server
Built-in MCP server gives AI coding assistants (Claude Code, etc.) direct access:
{
"mcpServers": {
"cobalt": {
"command": "npx",
"args": ["cobalt", "mcp"]
}
}
}Tools:
cobalt_run— Run experimentscobalt_results— View resultscobalt_compare— Diff two runscobalt_generate— Generate experiments
Resources: cobalt://config, cobalt://experiments, cobalt://latest-results
Prompts: improve-agent (analyze failures), generate-tests (add test cases), regression-check (detect regressions)
AI-First
Cobalt integrates with AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md). After init, your AI assistant knows how to use Cobalt from day one. Run cobalt update to regenerate skills.
Example Prompts
- “Compare gpt 5.1 and 5.2 on my agent and tell me which one is the best”
- “Run my QA experiment and tell me which test cases are failing”
- “Generate a Cobalt experiment for my agent at src/agents/summarizer.ts”
- “Compare my last two runs and check for regressions”
- “My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes”
CI/CD
GitHub Action
- uses: basalt-ai/cobalt@v1
with:
api_key: ${{ secrets.OPENAI_API_KEY }}Posts rich PR comments with score tables, auto-compares against base branch, optionally generates AI-powered analysis.
CLI
npx cobalt run --ci
# Exit code 1 if any threshold is violatedIntegrations
- Langfuse
- LangSmith
- Braintrust
- Basalt
Results
Tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates.
Roadmap
- Datasets, evaluators, experiments
- MCP server
- CI/CD integration
- AI assistant integration (CLAUDE.md, AGENTS.md)
- Langfuse integration
- LangSmith integration
- Braintrust integration
- VSCode extension
- Web dashboard
- Experiment versioning