Skip to content

Lesson 7: Goals and Tests

Your email triage is deployed and running. It classifies 150 emails a day. But how do you know it is working? What if the AI starts routing newsletters as urgent? What if a new email format breaks the confidence score?

You need two things: a definition of success, and tests that verify it.

The achieves Section

Add a goal to your machine:

machine email_triage
achieves
goal "Route emails to the correct destination based on priority and confidence"
succeeds when "urgent emails reach the team within 2 seconds"
succeeds when "low-confidence emails go to human review"
never "route an uncertain email without human review"
never "expose email body content in Teams channel names"
for example
given {subject: "URGENT: Server down", sender: "ops@company.com", body: "Production is offline."}
expect {routed_to: "notify_team"}
for example
given {subject: "Weekly deals!", sender: "noreply@store.com", body: "50% off everything."}
expect {routed_to: "archive"}

achieves declares what success looks like. It has four parts:

  • goal: A plain-language description of the machine’s purpose. Koda uses this to explain the machine. Other machines can read it.
  • succeeds when: Conditions that define correct behavior. These are not code; they are human-readable statements that guide the AI and document intent.
  • never: Hard constraints. Things the machine must not do, regardless of input. These feed into governance checks.
  • for example: Concrete input/output pairs. These are both documentation and executable tests.

The verifies Section

Examples inside achieves serve double duty: they document intent and they run as tests. But for thorough testing, use the verifies section:

verifies
test "routes urgent emails to team"
given {subject: "URGENT: Server down", sender: "ops@company.com", body: "Systems offline"}
assuming classify {priority: "urgent", confidence: 0.95, action: "notify", reason: "Server outage"}
expect {routed_to: "notify_team"}
test "archives newsletters"
given {subject: "Weekly deals!", sender: "noreply@store.com", body: "Sale this week"}
assuming classify {priority: "ignore", confidence: 0.99, action: "archive", reason: "Marketing email"}
expect {routed_to: "archive"}
test "flags low confidence for human review"
given {subject: "Question about project", sender: "unknown@domain.com", body: "Can we discuss?"}
assuming classify {priority: "today", confidence: 0.55, action: "create_task", reason: "Unclear intent"}
expect {routed_to: "human_review"}
test "creates tasks for actionable non-urgent emails"
given {subject: "Please review the Q3 report", sender: "manager@company.com", body: "Attached."}
assuming classify {priority: "today", confidence: 0.88, action: "create_task", reason: "Review request"}
expect {routed_to: "create_task"}

The assuming Keyword

The key difference from for example is assuming. It mocks the AI step.

assuming classify {priority: "urgent", confidence: 0.95, ...} tells the test runner: “When the classify step runs, return this instead of calling the AI.” This means:

  • Tests are deterministic. The AI does not run. The test always produces the same result.
  • Tests are free. No API calls, no token costs.
  • Tests are fast. Milliseconds, not seconds.
  • Tests verify your logic, not the AI’s judgment. You are testing routing, not classification.

To test the AI’s classification quality, use /evaluate (Lesson 9). The verifies section tests the machine’s logic.

Running Tests

/test email_triage

You see:

email_triage
[pass] routes urgent emails to team (2ms)
[pass] archives newsletters (1ms)
[pass] flags low confidence for human review (1ms)
[pass] creates tasks for actionable non-urgent emails (2ms)
4/4 passed. 0 failed. 6ms total.

If a test fails, you see exactly what happened:

[FAIL] flags low confidence for human review
Expected: {routed_to: "human_review"}
Got: {routed_to: "create_task"}
Step trace:
[1] classify (mocked) -> {priority: "today", confidence: 0.55, ...}
[2] route decide -> {routed_to: "create_task"}
Issue: The confidence threshold check is not triggering.

The trace shows every step, including the mocked values, so you can see where the logic diverged from your expectation.

Tests Run Before Deployment

When you run /deploy email_triage, mashin runs all tests first. If any fail, deployment is blocked:

/deploy email_triage
Running tests...
[FAIL] flags low confidence for human review
Deployment blocked: 1 test failure.
Fix the failing test before deploying.

This is not optional. You cannot deploy a machine with failing tests. The governance pipeline enforces it.

The Full Machine with Goals and Tests

Here is the complete email triage with everything from lessons 1 through 7:

machine email_triage
achieves
goal "Route emails to the correct destination based on priority and confidence"
succeeds when "urgent emails reach the team within 2 seconds"
succeeds when "low-confidence emails go to human review"
never "route an uncertain email without human review"
for example
given {subject: "URGENT: Server down", sender: "ops@company.com", body: "Offline."}
expect {routed_to: "notify_team"}
accepts
subject as text, is required
sender as text, is required
body as text
responds with
routed_to as text
action_taken as text
implements
ask classify, using: "anthropic:claude-haiku-4"
with task "Classify this email.\n\nFrom: ${input.sender}\nSubject: ${input.subject}\nBody: ${input.body}"
returns
priority as text, is required, choices: ["urgent", "today", "later", "ignore"]
confidence as number, is required, range: [0.0, 1.0]
reason as text
decide route
if classify.confidence < 0.7
run flow(flag_for_human)
else if classify.priority == "urgent"
run flow(notify_team)
else if classify.priority == "today"
run flow(create_task)
else
{routed_to: "archive", action_taken: "none"}
flows
flow notify_team
ask send_alert, from: "@mashin/actions/microsoft/teams/send_message"
channel: "ops-alerts"
message: "Urgent: " + input.subject + " from " + input.sender
{routed_to: "notify_team", action_taken: "Teams alert sent"}
flow create_task
ask make_task, from: "@mashin/actions/microsoft/planner/create_task"
title: input.subject
description: classify.reason
{routed_to: "create_task", action_taken: "Planner task created"}
flow flag_for_human
ask flag, from: "@mashin/actions/microsoft/teams/send_message"
channel: "email-review"
message: "Review needed: " + input.subject
{routed_to: "human_review", action_taken: "Flagged for review"}
verifies
test "routes urgent emails to team"
given {subject: "URGENT: Server down", sender: "ops@company.com", body: "Offline"}
assuming classify {priority: "urgent", confidence: 0.95, reason: "Server outage"}
expect {routed_to: "notify_team"}
test "archives newsletters"
given {subject: "Weekly deals!", sender: "noreply@store.com", body: "Sale"}
assuming classify {priority: "ignore", confidence: 0.99, reason: "Marketing"}
expect {routed_to: "archive"}
test "flags low confidence for human review"
given {subject: "Question", sender: "unknown@domain.com", body: "Can we discuss?"}
assuming classify {priority: "today", confidence: 0.55, reason: "Unclear"}
expect {routed_to: "human_review"}
test "creates tasks for actionable emails"
given {subject: "Review Q3 report", sender: "manager@company.com", body: "Attached"}
assuming classify {priority: "today", confidence: 0.88, reason: "Review request"}
expect {routed_to: "create_task"}

Notice the section ordering: achieves (what it does), accepts/responds with (its contract), implements (how it works), verifies (proof it works). This is canonical mashin ordering. Define what it does, then expose it, then test it.

What Goals Enable

Goals are not just documentation. Other parts of the system use them:

  • Koda reads goals to explain machines: “This machine routes emails based on priority and confidence.”
  • /verify checks that declared constraints are structurally enforceable.
  • /improve uses goals as the optimization target: “Make this machine better at routing urgent emails.”
  • Evolution ledger records goals alongside version diffs, so you can trace why a machine changed.

Goals connect intent to implementation. Tests verify the connection holds.

What Comes Next

Your machine is deployed, tested, and goal-driven. Next lesson: memory. Your machine will learn from patterns and human corrections, so it gets better over time without you changing the code.

Next: Memory →