Cobalt — Testing for AI Agents

By: Basalt AI GitHub: basalt-ai/cobalt npm: @basalt-ai/cobalt License: MIT Stack: TypeScript 5.7, Node.js 20+

Unit testing for AI Agents — Test, evaluate, and improve your AI systems.

Quickstart

npm install @basalt-ai/cobalt
npx cobalt init
npx cobalt run

Core Concepts

Concept	Description
Dataset	Test data — load from JSON, JSONL, CSV, or platforms (Langfuse, LangSmith, Braintrust, Basalt). Immutable, chainable (`filter()`, `map()`, `sample()`, `slice()`)
Evaluator	Score outputs. 4 types: LLM judge (boolean or 0-1 scale), custom functions, semantic similarity (cosine/dot), Autoevals (11 battle-tested)
Experiment	Run agent against dataset, evaluate each output, produce structured report with per-evaluator stats (avg, min, max, p50, p95, p99). Supports parallel execution, multiple runs, timeouts, CI thresholds

Example

import { experiment, Dataset, Evaluator } from '@basalt-ai/cobalt'
 
const dataset = new Dataset({
  items: [
    { input: 'What is 2+2?', expectedOutput: '4' },
    { input: 'Capital of France?', expectedOutput: 'Paris' },
  ],
})
 
const evaluators = [
  new Evaluator({
    name: 'Correctness',
    type: 'llm-judge',
    prompt: 'Is the output correct?\nExpected: {{expectedOutput}}\nActual: {{output}}',
  }),
]
 
experiment('qa-agent', dataset, async ({ item }) => {
  const result = await myAgent(item.input)
  return { output: result }
}, { evaluators })

MCP Server

Built-in MCP server gives AI coding assistants (Claude Code, etc.) direct access:

{
  "mcpServers": {
    "cobalt": {
      "command": "npx",
      "args": ["cobalt", "mcp"]
    }
  }
}

Tools:

cobalt_run — Run experiments
cobalt_results — View results
cobalt_compare — Diff two runs
cobalt_generate — Generate experiments

Resources: cobalt://config, cobalt://experiments, cobalt://latest-results

Prompts: improve-agent (analyze failures), generate-tests (add test cases), regression-check (detect regressions)

AI-First

Cobalt integrates with AI instruction files (CLAUDE.md, AGENTS.md, .github/copilot-instructions.md). After init, your AI assistant knows how to use Cobalt from day one. Run cobalt update to regenerate skills.

Example Prompts

“Compare gpt 5.1 and 5.2 on my agent and tell me which one is the best”
“Run my QA experiment and tell me which test cases are failing”
“Generate a Cobalt experiment for my agent at src/agents/summarizer.ts”
“Compare my last two runs and check for regressions”
“My agent is scoring 60% on correctness. Analyze the failures and suggest code fixes”

CI/CD

GitHub Action

- uses: basalt-ai/cobalt@v1
  with:
    api_key: ${{ secrets.OPENAI_API_KEY }}

Posts rich PR comments with score tables, auto-compares against base branch, optionally generates AI-powered analysis.

CLI

npx cobalt run --ci
# Exit code 1 if any threshold is violated

Integrations

Langfuse
LangSmith
Braintrust
Basalt

Results

Tracked in SQLite with built-in comparison tools, cost estimation, and CI/CD quality gates.

description	TypeScript testing framework for AI agents and LLM-powered apps by Basalt AI. Define datasets, run agents, evaluate with LLM judges. Built-in MCP server, CI/CD gates, SQLite tracking.
tags	testing, ai-agents, evaluation, typescript, mcp, ci-cd, llm-judge

Huy's Wiki

Explorer

Cobalt — Unit Testing Framework for AI Agents