Cobalt — Testing for AI Agents

By: Basalt AI GitHub: basalt-ai/cobalt npm: @basalt-ai/cobalt License: MIT Stack: TypeScript 5.7, Node.js 20+

Unit testing for AI Agents — Test, evaluate, and improve your AI systems.

Quickstart

npm install @basalt-ai/cobalt
npx cobalt init
npx cobalt run

Core Concepts

ConceptDescription
DatasetTest data — load from JSON, JSONL, CSV, or platforms (Langfuse, LangSmith, Braintrust, Basalt). Immutable, chainable (filter(), map(), sample(), slice())
EvaluatorScore outputs. 4 types: LLM judge (boolean or 0-1 scale), custom functions, semantic similarity (cosine/dot), Autoevals (11 battle-tested)
ExperimentRun agent against dataset, evaluate each output, produce structured report with per-evaluator stats (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs, timeouts, CI thresholds

Example

import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'
 
const dataset = new Dataset({
  items: [
    { input: 'What is 2+2?', expectedOutput: '4' },
    { input: 'Capital of France?', expectedOutput: 'Paris' },
  ],
})
 
const evaluators = [
  new Evaluator({
    name: 'Correctness',
    type: 'llm-judge',
    prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}',
  }),
]
 
experiment('qa-agent', dataset, async ({ item }) => {
  const result = await myAgent(item.input)
  return { output: result }
}, { evaluators })

MCP Server

Built-in MCP server gives AI coding assistants (Claude Code, etc.) direct access:

{
  "mcpServers": {
    "cobalt": {
      "command": "npx",
      "args": ["cobalt", "mcp"]
    }
  }
}

Tools:

  • cobalt_run — Run experiments
  • cobalt_results — View results
  • cobalt_compare — Diff two runs
  • cobalt_generate — Generate experiments

Resources: cobalt://config, cobalt://experiments, cobalt://latest-results

Prompts: improve-agent (analyze failures), generate-tests (add test cases), regression-check (detect regressions)

AI-First

Cobalt integrates with AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md). After init, your AI assistant knows how to use Cobalt from day one. Run cobalt update to regenerate skills.

Example Prompts

  • “Compare gpt 5.1 and 5.2 on my agent and tell me which one is the best”
  • “Run my QA experiment and tell me which test cases are failing”
  • “Generate a Cobalt experiment for my agent at src/agents/summarizer.ts”
  • “Compare my last two runs and check for regressions”
  • “My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes”

CI/CD

GitHub Action

- uses: basalt-ai/cobalt@v1
  with:
    api_key: ${{ secrets.OPENAI_API_KEY }}

Posts rich PR comments with score tables, auto-compares against base branch, optionally generates AI-powered analysis.

CLI

npx cobalt run --ci
# Exit code 1 if any threshold is violated

Integrations

  • Langfuse
  • LangSmith
  • Braintrust
  • Basalt

Results

Tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates.

Roadmap

  • Datasets, evaluators, experiments
  • MCP server
  • CI/CD integration
  • AI assistant integration (CLAUDE.md, AGENTS.md)
  • Langfuse integration
  • LangSmith integration
  • Braintrust integration
  • VSCode extension
  • Web dashboard
  • Experiment versioning