A tennis robot standing on a sunlit court

Installs in seconds

Your agent forgot. Baseline did not.

Probe your agent in seconds to track real-life drift over time. Keep Hermes and OpenClaw dialed, and know before issues reach the work.

install the cli → watch a run

Installs in seconds

$ brew install trackbaseline/tap/baseline
$ baseline run
-> run_8f2c . score 92 . 11 ok / 3 watch / 0 fail
# watch: mcp.openclaw.config drifted from baseline_clean-local
# watch: latency.tool_probe +312ms over 5-day median

Today . across your workstations

Three agents. One scoreboard.

open dashboard →

OpenClawapollo-street/api

92/100

watch . +1

Last 13 runs

identitymodel / context / goal matched

watch

mcp.configdrift from clean-local

watch

latency.tool+312ms over 5-day median

Hermesapollo-street/web

97/100

ok . +2

Last 13 runs

identitystable for 14 runs

memorycontext survives across calls

safety.scrubberno leaks detected

Claude Codeapollo-street/infra

78/100

fail . -9

Last 13 runs

fail

memorycontext lost mid-session

watch

variance2 of 5 prompts diverged

watch

repo.workspacedirty: 11 unstaged files

14Probes in the default set

~8sP50 local run

0 rawCloud export: prompts / outputs / paths

7 mcpSetup, run, doctor, report, accept, schedule, scrub

Run Baseline daily

Compare agent responses against a known good baseline.

Learn about latency, drift, and memory issues in real time, before they impact your work.

Enable daily deep or fast checks.
Compare responses against good baselines, by agent.
Review changes over time to see model and harness impact.

A tennis robot walking away across a court baseline

The default set

Fourteen probes.

read the rubric →

01 Identity identity.

Model, provider, context window, primary goal.

v0.1

02 Awareness repo.

Workspace path, clean/dirty state, branch.

v0.1

03 Tools tooling.

MCP server reachable, allowed tool surface declared.

v0.1

04 Memory memory.

Carried context survives between calls. Same answer twice.

v0.1

05 Speed latency.

P50 and P95 against the 5-day local median.

v0.1

06 Stability variance.

Five identical prompts. Five identical answers.

v0.1

07 Safety safety.

Scrubber catches paths, secrets, prompt fragments.

v0.1

08 Style style.

Repository conventions: tabs, naming, file layout.

v0.1

09 Awareness change.

Reports any tool / MCP / repo / config change since Good.

v0.1

10 Basic reasoning.

Two-plus-two. Date. The smoke test for a broken model.

v0.1

11 Obedience instruction.

Answer only the word. Answer only the number.

v0.1

12 Safety redaction.

Redacted summaries verify against original on push.

v0.1

13 Tools tool-call.

Tool calls actually fire. Names match the declared surface.

v0.1

14 Stability session.

A second session in the same workspace agrees with the first.

v0.1

Four commands

Install. Establish. Compare. Accept.

The local loop. Seven MCP tools on top for agents that want to call Baseline themselves. No web app required to get value; the cloud surface is for history, alerts, and team-visible evidence after the workstation is already producing redacted runs.

01$ baseline setup

Detect the local agent, write SQLite, run preflight.

02$ baseline run --mode fast

Fourteen probes against the active workstation. About eight seconds.

03$ baseline accept run_8f2c --label clean-local

Review the report. Mark this run as your Good Baseline.

04$ baseline compare

Every later run is judged against the accepted one.