MINARA

Learning System

How the agent gets better at recurring tasks over time

Minara Agent ships a learning loop that records successful tool sequences and surfaces them as suggestions on future turns. Unlike model fine-tuning, this is all SQLite rows. No training job, no model update, no offline pipeline. This page explains the pieces and how they cooperate with the skill system.

What "learning" means here. In Minara, "learning" is not fine-tuning or weight updates. The model stays the same. What changes is what the agent retrieves from SQLite before each turn: a library of {tool_name, args} sequences that worked, free-text guidance notes, and structured methodologies. A new turn that looks like a past successful one gets the past sequence as a suggestion. That's it. The design trades sophistication for auditability: every "learned" behavior is a row you can read, edit, or delete, and no behavior persists that the operator can't inspect.

See this in use: Features → Self-Improving covers the user-facing surface — how you nudge the agent into saving a lesson and how those lessons show up in later sessions.

For the parallel decision-reflection loop (two-stage LLM classification of whether a specific skill call was right or wrong, scoped per role), see Role Memory. That system runs alongside this one and answers a different question: not "how did I succeed" but "was this specific decision correct, and why".

What gets learned

Three distinct artifacts live in the learnings table (and its sister tables under apps/agent/src/learning/):

  1. Tool sequences. An ordered list of {tool_name, args} pairs the agent successfully ran to accomplish a task. Recorded when the agent explicitly calls skill_learn at the end of a turn.
  2. Guidance notes. Short free-text blurbs like "for Polymarket prices, use web_extract on the specific market URL. The API rate-limits at 10 rpm." These ride alongside tool sequences.
  3. Methodologies. Structured multi-step plans with success criteria, stored by learning/structured-methodology.ts. Used for deep-research workflows where a plain tool sequence isn't expressive enough.

The feedback loop

learning-system diagram

review-engine

learning/review-engine.ts is a lightweight LLM pass that runs at the end of a turn (via review-engine-hook.ts installed in app.ts). It:

  1. Inspects the turn's tool call sequence.
  2. Filters out turns with fewer than N calls or obvious failures.
  3. Calls a cheap model (the fast tier) with a structured prompt asking "was this task completed? how novel? how reusable?"
  4. Emits a ReviewResult with {score, summary, suggested_trigger, suggested_tool_sequence}.

If the score passes a threshold, the result is handed to the skill manager.

skill-manager

learning/skill-manager.ts owns the learnings table. On a qualifying review it writes:

{
  id: uuid,
  name: "hyperliquid_open_long_with_tp_sl",
  trigger: "open long on hyperliquid with tp/sl",
  tool_sequence: [...],
  guidance: "always set TP before SL; Hyperliquid's 'reduce_only' flag...",
  created_at,
  success_count: 1,
  failure_count: 0,
  last_used_at: null,
}

It also deduplicates: if a near-identical trigger already exists (cosine similarity via learning/similarity.ts and TF-IDF via learning/tfidf.ts), the new observation updates the existing row's counters rather than creating a duplicate.

evaluation-loop

learning/evaluation-loop.ts runs at the start of every turn. It:

  1. Builds a TF-IDF query from the user message plus router context.
  2. Scores every learning against the query.
  3. Returns the top K matches (default 3).
  4. Hands them to the prompt builder, which appends them to the system prompt as a <learnings> block with the trigger, tool sequence summary, and guidance text.

The LLM is free to adopt or ignore the suggestion. Either choice updates the learning's counters. Adoption that led to a successful turn increments success_count; ignored learnings slowly decay.

Methodologies: structured plans

Deep-research turns produce a different artifact: a methodology. Where a tool sequence is a flat list, a methodology is a tree of phases with success criteria:

{
  id, name,
  phases: [
    {name: "Gather", criteria: [...], tools_used: [...]},
    {name: "Synthesize", criteria: [...], depends_on: ["Gather"]},
    {name: "Verify", criteria: [...], depends_on: ["Synthesize"]},
  ],
  asset_class: "crypto_alt",
  ...
}

The store is learning/methodology-store.ts and the deep-research skill reads from it to seed multi-phase research plans. Methodologies are a finer-grained learning artifact for tasks where "what tools to call" is less interesting than "what intermediate evidence to gather."

Why this differs from a vector memory

A simple vector memory would just store facts and recall them. The learning system stores procedures: "how to accomplish this class of task." It then surfaces them as executable suggestions. The distinction matters:

  • Vector memory answers "what do I know about BTC?"
  • Learning system answers "how do I usually handle a request to long BTC on Hyperliquid with TP/SL?"

The two are complementary, and the agent uses both. Memory lookups happen in the skill layer via memory_search; learning lookups happen in the agent loop before the first LLM call, as part of prompt assembly.

Safety properties

Learnings are suggestions. They are never mandates. Specifically:

  1. A learning can never bypass the permission tier hook. A suggested tool_sequence that contains a tier-4 tool is still blocked when the turn source doesn't permit it.
  2. A learning can never bypass the L3 risk gate. If the suggested sequence requires activating a requires_user_confirmation skill, the normal confirmation flow applies.
  3. A learning cannot store secrets. The args recorded in a tool sequence go through the same redactor as the audit log.
  4. Failed turns never become learnings. The review engine filters them out before the skill manager ever sees them.

Inspecting and curating

# Top learnings by success rate
sqlite3 $dataDir/minara.db \
  "SELECT name, success_count, failure_count
     FROM learnings
    ORDER BY success_count - failure_count DESC LIMIT 20;"

# Recently used
sqlite3 $dataDir/minara.db \
  "SELECT name, last_used_at FROM learnings
    WHERE last_used_at IS NOT NULL
    ORDER BY last_used_at DESC LIMIT 10;"

# Delete a bad learning
sqlite3 $dataDir/minara.db "DELETE FROM learnings WHERE id = '...'"

There is no "demote" operation. If a learning is misleading, delete it. The agent will re-derive it if it was genuinely useful.

Configuration

Relevant env vars (see env-vars):

  • MINARA_LEARNING_ENABLED is the master switch (default true).
  • MINARA_LEARNING_MIN_CALLS is the minimum tool calls per turn before a review is considered (default 3).
  • MINARA_LEARNING_SCORE_THRESHOLD is the review score required to write a learning (0–10, default 7).
  • MINARA_LEARNING_TOP_K is how many learnings to surface per turn (default 3).

Turning MINARA_LEARNING_ENABLED=false completely disables the loop: no writes, no suggestions, no review passes. The agent still works; it just doesn't get faster over time.

Budget tracking

Every LLM call the learning system makes goes through learning/budget-tracker.ts, which enforces hard caps per category and per window. This was added after a review warned that the two-stage judge pass plus post-hoc probes could easily 10× LLM costs if a bug or adversarial prompt produced runaway reflection. A hard budget is the circuit breaker.

Four categories, each with independent daily + monthly caps:

CategoryPurpose
learningReview engine, methodology extraction, role reflection, skill learning
agentThe main agent-loop turns themselves
workflowWorkflow and autopilot turns
experimentOffline experiments, backtests, A/B tests — never touched in prod

State is persisted to the llm_usage SQLite table:

CREATE TABLE llm_usage (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  category TEXT NOT NULL,
  task TEXT NOT NULL,
  model TEXT NOT NULL,
  input_tokens INTEGER NOT NULL,
  output_tokens INTEGER NOT NULL,
  cost_usd REAL NOT NULL,
  date TEXT NOT NULL,
  ts TEXT NOT NULL
);

On every LLM call the tracker:

  1. Projects the estimated cost against the current daily and monthly totals for the category.
  2. If the projected total would exceed the hard cap, throws BudgetExceededError before the call fires.
  3. If the projected total crosses a soft threshold (below hard cap), emits a warn-level structured log but allows the call.
  4. After the call completes, records the real token counts and cost back into llm_usage.

Budgets survive restarts because the state lives in SQLite. There's no in-memory counter to reset.

To inspect spend without running the agent:

sqlite3 $dataDir/minara.db \
  "SELECT category, SUM(cost_usd) FROM llm_usage
    WHERE date = date('now') GROUP BY 1;"

The /budget REPL command exposes the same view interactively.

Methodology store

learning/methodology-store.ts is the Phase 2 heart: the agent learns which analysis methods produce profitable signals for which asset classes, and retrieves them for future similar analysis.

Each methodology is stored with:

FieldMeaning
idUUID
asset_classOne of the known asset classes (major_crypto, layer_1, defi_blue_chip, meme_coin, stock, …)
methodologyFree-form text description of the method
evidenceSupporting evidence text
confidenceWilson lower-bound confidence score in [0, 1]
times_usedSuccessful applications
times_correctApplications that passed outcome verification
quarantine1 until the method passes N successful uses without anomaly detection
dedup_keyHash of structured fields for O(1) semantic dedup
structured_jsonNormalized StructuredMethodology (see below)

Quarantine and injection defense

New methodologies start quarantined with confidence: 0.1. They are not injected into prompts until they pass enough successful uses. Anomaly detection runs against the methodology text on every write via scanMethodologyForInjection to catch prompt-injection patterns before they land in the store.

Confidence bumping

Every time a methodology is applied and the outcome is verified:

  • Success: increment times_correct and times_used, recompute confidence as the Wilson lower-bound on the binomial (which penalizes small samples).
  • Failure: increment only times_used, recompute confidence. Methods that fail often see their confidence drop below the injection threshold.
  • Graduation: once confidence >= INJECTION_THRESHOLD and times_used >= MIN_USES, flip quarantine = 0. The method is now eligible for prompt injection.

Wilson lower-bound beats raw times_correct / times_used because it doesn't let a 1-in-1 lucky hit outweigh a 15-in-30 consistent winner.

Institution mode: the reflection ladder

Institution Mode is the heaviest writer into the methodology store. Each run convenes several LLM roles (analysts, a bull/bear debate, a risk committee, a portfolio manager) and records their decisions. Those decisions feed a delayed reflection loop that scores each call against what actually happened and graduates lessons back into the store described above.

Why decision roles use forced structured tool-use

Every decision-producing role emits a tool call that has to match a Zod schema (AnalystReportSchema, TraderProposalSchema, PortfolioDecisionSchema, and so on). Free-form prose alongside the call is discarded. Two reasons, both about the reflection loop:

  • Determinism. Reflection scoring compares the same fields across runs. A rescue-parse of free text drifts under model upgrades; a fixed schema does not.
  • Comparability. A Buy at confidence 0.71 today is only directly comparable to a Buy at 0.62 last week if the schema is constant.

When INSTITUTION_LEARNING_ENABLED=1, the capture hook writes institution_runs (run metadata) and institution_role_outputs (each role's structured output) after every run. The reflection ladder writes institution_reflections later, as each window comes due.

The ladder

The runner in learning/institution/reflect.ts revisits each run on a fixed schedule:

WindowTriggerQuestion
1d24h after runWas the trigger valid?
7d7d after runDid the base case play out?
30d30d after runWas the horizon estimate right?
90d / 180d / 365dlongerWas the thesis durable?
lazynext time the ticker is queriedReused as the run's Phase 0 retrospect

Each standard window scores the run against realised price action (price-source.ts) and, for every methodology cited in the run's role outputs, feeds the result into methodologyStore.recordOutcome(). A methodology graduates at its Wilson confidence edge (>= 0.55 to surface), the same gate the confidence bumping path above applies, so a handful of runs cannot promote an unstable rule. Lazy and manual reflections are snapshots and do not feed the loop.

Structured methodology dedup

Free-form text is hard to dedup. "Buy BTC on RSI dip" and "Enter long when RSI oversold" are the same idea but share almost no tokens. Worse, Jaccard similarity can merge "Buy BTC at support" with "Sell BTC at support" (same tokens, opposite actions).

learning/structured-methodology.ts solves this by forcing the judge LLM to output normalized fields with finite vocabularies:

FieldAllowed values
directionbullish / bearish / neutral
primary_signalmomentum / mean_reversion / technical / fundamental / on_chain / sentiment / macro / event
timeframeintraday / short / medium / long
indicatorsArray of known indicators (rsi, macd, funding_rate, …)

Dedup uses a hash of the structured fields (direction + primary_signal + timeframe + sorted indicators + asset_class). Two methodologies with the same hash are considered duplicates and the store increments the existing row's counters rather than inserting a new row.

Free-text descriptions are still stored for human readability and prompt injection. The structured fields are purely a dedup key.

Similarity: Jaccard (legacy) vs TF-IDF

Before structured dedup, the fallback was text similarity. Two implementations exist:

  • Jaccard 4-gram (learning/similarity.ts) is the v1 legacy. Cheap to compute, language-agnostic, but brittle under paraphrasing and wrong on semantic inversions (the "Buy BTC / Sell BTC" trap).
  • TF-IDF cosine (learning/tfidf.ts) is the preferred replacement. Word-level, stop-word aware, still language-agnostic, better at handling paraphrase. findMostSimilarTfidf is the default path the methodology store takes.

The store falls back to Jaccard only if TF-IDF fails (rare: empty corpus, weird tokenization). Both are only used when the structured dedup hash misses, which means they see far fewer calls than they did in v1.

If you're writing a new learning artifact, use findMostSimilarTfidf directly. Do not invent a third similarity function.

Audit subsystem

The learning loop writes plenty of per-row forensic data (methodology_lifecycle_events, methodology_cases, methodology_cron_runs), but those tables answer one question at a time. The audit subsystem (learning/methodology-audit.ts) is the aggregate view: it reads the forensic rows, computes a 0-100 composite health score across six dimensions, and persists one row per pass to methodology_audit_reports with structured findings and operator-facing advisory actions.

The subsystem never mutates learning state. Its only writes are the audit report row and a heartbeat written by the learning cron itself (see Isolation invariants below).

Six scoring dimensions

Each dimension is a pure function in learning/methodology-audit-scoring.ts. Functions return { score: number | null, findings, advisory_actions }; a null score means "not enough data to score honestly" and gets dropped from the composite with weight redistribution.

DimensionReadsWhat it measures
synthesis_qualityreflection_adjusted.reason_text parsePenalises rapid oscillation between flag-side and recovery-side verdicts; stable runs (one-way flag or one-way recovery) score full marks. Surfaces market_stress_freeze and synthesis_auto_demote even on low-sample windows.
graduation_fp_rategraduateddemoted/requantized_by_judge follow-up within 30 dWilson lower bound on the FP rate. Graduations younger than the post-graduation observation window that have not already reversed are excluded from both numerator and denominator, so a burst of fresh graduations cannot inflate the score.
attribution_integritymethodology_cases.outcome_state in window(0.7 × resolve_rate + 0.3 × (1 − backlog_share)) × 100. Returns null when there are no closed cases AND no 14-day-old pending backlog (healthy fresh install, nothing to score yet). No attribution_model drift detection — the recorded model is an intentional snapshot.
coverage_healthmethodologies (graduated AND times_used ≥ 10)Top-level group coverage per CLAUDE.md §13 Asset Class Standard (crypto / stock / index / commodity / forex). Full marks when all five groups have ≥ 3 active methodologies; deductions scale linearly below the 3-group minimum.
quarantine_churnmethodology_lifecycle_events state-change kindsExcludes reflection_adjusted (fires up to 4×/day legitimately at 6 h synthesis cadence). High churn = ≥ 3 state-changing events on one methodology in window.
cron_healthmethodology_cron_runs heartbeat + pending backlogLag vs. 2× expected interval + backlog penalty. Lag > 7 d scores 0 (loop appears dead). Empty heartbeat + non-zero pending backlog also scores 0 (loop demonstrably not draining work).

Composite + bands

Default weights and bands:

composite = 0.22·synthesis_quality + 0.22·graduation_fp_rate + 0.22·attribution_integrity
          + 0.14·coverage_health   + 0.08·quarantine_churn   + 0.12·cron_health

band: ≥ 80 healthy · 60-79 watch · 40-59 degraded · < 40 alarm · disabled (off switch)

null dimensions are dropped and the remaining weights renormalise to sum to 1. The persisted report records dimension_weights_used so an operator reading the JSON sees exactly which dimensions contributed.

Isolation invariants

The audit reads from the four learning tables (methodologies, methodology_lifecycle_events, methodology_cases, methodology_case_hints) and writes to none of them. Three guarantees stack:

  1. Cooperative idle scheduling. The audit cron (learning/methodology-audit-cron.ts) shares a BusyTracker (core/busy-tracker.ts) with AgentLoop.run(). Every tick checks inFlight > 0 (skip) and idle-since (defer). The orchestrator yields between every SQL stage via yieldIfBusy, pausing the pass when a user turn arrives mid-flight. A starvation guard forces the pass after N consecutive deferred ticks so a permanently busy agent does not lose audit coverage.
  2. Pure-function scoring boundary. The dimension scorers in methodology-audit-scoring.ts accept plain arrays of Methodology / MethodologyLifecycleEvent / MethodologyCase shapes. They never see the MemoryStore handle, so they cannot .prepare(...).run(...) even by accident.
  3. End-to-end table-hash invariant. tests/e2e/methodology-audit.test.ts SHA-256-hashes each learning table before and after every audit pass and asserts byte-level equality. The check goes beyond row-count parity: an UPDATE that touches updated_at would pass a row-count check but fail the hash.

The only reverse touch point is the heartbeat row written to methodology_cron_runs at the tail of every learning-cron tick. Both the in-process scheduler (learning/methodology-cron.ts) AND the CLI cron path (gateway/learning-cli.ts runFullCronCli) write this row, so the audit's cron_health dimension works under the documented system-cron deployment.

Operating the audit

The cron is opt-in. Defaults are tuned for daily passive monitoring; flip METHODOLOGY_AUDIT_CRON_ENABLED=1 once the learning loop has accumulated enough data to score (typically a week or two). The full env reference is at Environment variables → Methodology audit subsystem.

Four CLI commands cover the operator workflow:

minara learning audit run [--window-days N]              # one inline pass
minara learning audit show [--latest|--pass <id>]        # inspect a report
minara learning audit trend [--days N]                   # composite history + sparkline
minara learning audit findings [--severity high|medium|low]  # drill into findings

See CLI subcommands → audit for the full surface.

Deferred: active probes

An earlier design proposed a seventh dimension that randomly re-runs the agent on historical "confirmed-bad" trading cases to test whether the learning loop now produces a different decision. That work is deferred. An honest replay needs a frozen snapshot of the historical decision context (prices, news, sentiment, active methodologies, tool outputs) so the new run sees the same information the original did. Without that snapshot, asking the agent "should you buy ETH now?" measures current judgment, not whether learning corrected the past mistake. Two prerequisites gate the work: (a) a snapshot table populated by case-recorder at hint/case time, and (b) a separate ProbeAgentLoop that does not route through createApp() so the replay shares zero state (skill session, tool registry, hooks) with the live agent.

On this page