Testing
How tests are structured and what to write
Minara Agent uses vitest for all testing. Tests live under
tests/ with a three-tier split and shared fakes for the
interesting dependencies. This page explains the conventions so
new tests fit in.
Tiers
tests/
├── unit/ — pure logic, no IO, < 50ms each
├── integration/ — real SQLite, real tool dispatch, fake LLM/network
├── e2e/ — spin up the real gateway, hit it with HTTP, real DB
├── fakes/ — shared test doubles (LLM, Minara client, filesystem)
└── setup.ts — vitest global setup (env vars, temp dirs)Run them:
npm test # everything
npm run test:unit # unit only — fast feedback loop
npm run test:e2e # slowest, last
npm run test -- --watch # watch mode on a subsetUnit tests are the default. If you're adding a new pure function, a
new reducer, or a new router scoring rule, it belongs in
tests/unit/. Integration tests exist for anything that touches the
SQLite schema, the tool registry dispatch, or the agent loop end to
end. E2E tests exercise the HTTP gateway as a black box.
Fakes over mocks
tests/fakes/ holds hand-written doubles rather than jest-style
mocks. The two that matter most:
FakeLLMClient
Implements LLMClient with a scripted response queue:
const llm = new FakeLLMClient();
llm.enqueue({
text: "I'll check the price.",
tool_calls: [{ name: "price", args: { symbol: "BTC" } }],
stop_reason: "tool_use",
});
llm.enqueue({
text: "BTC is $65,000.",
stop_reason: "end_turn",
});
const result = await agentLoop.run({ ... , llm });
expect(llm.callCount).toBe(2);
expect(result.finalText).toContain("65,000");No real HTTP traffic, no token counting, no cache semantics. The
tests that cover cache behavior explicitly use
RecordingLLMClient, which asserts on the cache_control markers
in the system prompt blocks without contacting an upstream.
FakeMinaraClient
Canned responses for price / balance / swap / perps endpoints. Mutable state so integration tests can exercise round-trip flows:
const minara = new FakeMinaraClient();
minara.setPrice("BTC", 65000);
minara.setBalance("BTC", 0.5);
// run a tool that calls minara.swap(...)
minara.assertSwapHappened({ from: "USDC", to: "BTC", amount_usd: 1000 });tempDataDir
tests/setup.ts gives every test file a fresh $dataDir under
os.tmpdir() and cleans it up on exit. Integration tests can open
a real SQLite file without polluting the repo or each other.
What to test
Accumulated conventions from the existing suite:
- Every
BeforeToolCallHookgets its own unit test. Hooks are small, pure, and easy to test in isolation. The whole safety model rides on them. Write at least:- happy path (allowed call passes through)
- block path (expected reason string)
- context propagation (hook reads the right
ToolCallContextfields)
- Every new tool factory gets a smoke test. At minimum: given
required env vars, the factory returns a non-empty tool list;
given missing env vars, it returns
[]without throwing. - Every new skill gets a registration test. Verify the skill
registers, verify its
requires_envactually gates it, verify the router scores it for an obvious query. Keep it one test file per skill. - Router changes get table-driven tests. The scoring rules are
small and deterministic. A single
describe.eachtable is usually the right shape. - Agent loop changes get integration tests. The loop has a lot of state. Test it with a real registry and a scripted fake LLM rather than a unit test over internal helpers.
What NOT to test
- Don't test Anthropic / OpenAI round trips in CI. The wire
layer has dedicated tests that hit a recorded fixture; the rest
of the suite uses
FakeLLMClient. - Don't test tool schemas for "the right shape." The schema is the source of truth; a test that duplicates it is a maintenance tax that fires on every legitimate change.
- Don't use
vi.mock()on internal modules. If a module is hard to test without mocking, it's probably badly factored. Fix the factoring instead. - Don't hit real external APIs (Minara backend, Glassnode, OpenRouter, …). Every one of those has a corresponding fake. New providers must ship a fake alongside the production adapter.
Coverage goals
Coverage numbers aren't a target, but the shape matters:
- Hooks and router scoring. 100%, period. These are where safety lives.
- Tool handlers. Enough to cover happy path plus one error path. The handlers are usually thin wrappers over typed clients, so exhaustive coverage has diminishing returns.
- Skill registration. One test per skill proving it loads and routes correctly.
- Agent loop. Integration tests for every branch in the turn machine (end_turn, tool_use, max_iterations, hook block, pending confirmation).
Debugging a failing test
- Run the single file:
npx vitest run tests/unit/foo.test.ts - Add
test.onlyto isolate the one case. DEBUG=1 npx vitest run ...enables verbose logger output.- For integration test failures, the test's
tempDataDiris printed in the failure message. You can open the SQLite file post-mortem withsqlite3 /tmp/minara-test-xxx/minara.db.
Adding a new test tier
Don't. Three tiers is already more than most projects need. If you feel you need a fourth (contract tests? property tests?), that's a prompt to revisit whether the unit/integration line is drawn in the right place rather than a prompt to add directories.