Overview › The hierarchy › Goal-Hierarchy design

The Goal-Hierarchy Flow Model

A flow is not a flat graph of states. It is a three-tier goal hierarchy — and every level declares a goal and is instrumented with a metric for that goal.

What & why

A naive view treats a flow as a flat bag of states and scores it with one number. That number hides everything that matters: which phase failed, whether a phase's states cooperate, and whether a weakness is local to one flow or systemic across the whole library. The goal-hierarchy model fixes this by giving each structural level its own goal and its own measurement, authored top-down from flow creation rather than reverse-engineered later.

FLOW      goal: complete the caller's task        metric: correctness · completion · humanness
  └─ GROUP   goal: a TYPED objective (collect/         metric: group_cohesion =
       │         confirm/act/terminal/handoff)               goal_yield·(0.35·efficiency+0.65·transition_coherence)
       └─ STATE goal: one bounded capability (≥1 slot)  metric: progress · stall · revisit · dwell · slot_fill

(The fourth tier, archetype, is a cross-flow rollup of groups — it is a reporting lens, not an authoring level.)

The three tiers and their goals

TierGoal — what “good” meansDeclared by
Stateone bounded capability / speech act — may collect or repair several slotsdescription, required_slots, speech_act
Group (segment)a typed objective (not always a slot-contract)GoalSegment: purpose, kind, target, exit_guard, repair_policy
Flowcomplete the caller's taskflow-level completion_slots, final_status
A state's goal is NOT “one slot.” That would resurrect the refuted one-slot decomposition (it re-asks for facts the caller already volunteered). The validated Collect Pattern uses multi-slot listen states that absorb volunteered facts. A state's goal is one bounded capability that may fill or repair several slots.

Group goals are TYPED

Not every group is a slot-contract. RIFF groups include collect, confirm, action, terminal, and handoff phases — each with its own typed objective and success metric. A slot-contract is only the collect kind.

kindobjectivesuccess metric
collectfill the target slotsslot_yield
confirmget the caller's confirmationconfirmation_success
acta tool/program succeedstool_success
terminalreach the intended terminalsuccess_terminal
handoffsafe transfer / escalationsafe_transfer

The GoalSegment — a group that acts toward its goal

A GoalSegment is a Harel-statechart compound state: a super-state owning a shared objective, with member states as capabilities rather than a hand-wired chain. Statecharts keep transition ownership in the FSM, so the LLM stays language-only. Members declare what they collect/repair and their preconditions/effects — not their next state. The FSM picks the next member deterministically (a GOAP-lite selector over the slot contract):

def choose_next(segment, ctx):
    apply_slot_observations(ctx)            # extraction/validation only
    if exit_guard(segment, ctx):       return segment.exit_target
    repair = first_repair_needed(segment, ctx)
    if repair:                         return repair.state
    slot = first_unsatisfied_slot(segment, ctx)
    if slot is None:                   return segment.exit_target
    candidates = [s for s in states_that_collect(slot) if preconditions_hold(s, ctx)]
    if not candidates:                 return segment.fallback_state
    return min(candidates, key=deterministic_cost_tuple)

Any next_state the LLM suggests is ignored and logged. The segment owns the objective and recomputes after every caller turn, so members become tools under one contract — not a brittle ask_name → ask_phone → ask_address chain. The schema for this contract shipped this session; see GoalSegment schema. The selector itself is the last phase, behind a flag.

Failure defenses

FailureDefense
Oscillationhysteresis: pursue the current slot until valid / max-attempts / blocked; tie-break by declared order, never by LLM wording
Premature exitexit only when all required slots are valid (mentioned ≠ valid) and confirmation is satisfied
Deadlockcompile-time check: every required slot has ≥1 collecting state; ordering acyclic; runtime fallback if no candidate is actionable
Repair loopsper-slot + per-segment attempt caps → deterministic fallback (the shipped stalled guard + sets:)
Contradictory datakeep slot evidence records; latest explicit correction beats earlier implicit extraction

Metrics at every level

Principle: each level has a typed objective score plus diagnostic submetrics — not one universal scalar. All timestamped, per-agent, flow-versioned.

TierObjective + submetricsStorage
Stateprogress, stall, revisit, dwell_turns, slot_fill_rate, latency_p95state_scores
Groupgoal_yield · efficiency · transition_coherence → group_cohesionper-segment rollup of state_scores
Flowcorrectness, completion, errors, latency, humannessaspect_scores

The group-cohesion metric

The naive product collapses to zero and hides which part failed. The bounded, visible form:

goal_yield           = successful_eligible_visits / eligible_visits   # typed per kind
efficiency           = clamp(reference_turns / max(actual_turns, reference_turns, 1), 0, 1)
transition_coherence = clamp(1 − w1·redundant_ask − w2·avoidable_reentry − w3·lost_slot, 0, 1)

group_cohesion       = goal_yield · (0.35·efficiency + 0.65·transition_coherence)   # 0..1

Collaboration is emergent: a segment of locally-fine states can collectively thrash — each state “progresses,” yet the group re-asks across members and takes 8 turns for 3 slots. That is visible only at the group level. See Group cohesion for the as-shipped computation and a live table.

Top-down authoring — a new flow is born with goals

Flows are authored high-level → decomposed, so goals and groups exist from the first commit:

  1. INTENT — the flow's task in one line (“book a plumbing visit”).
  2. GROUPS — decompose into phases (greet → triage → collect → schedule → confirm → close); each is a GoalSegment stub with a purpose.
  3. CONTRACTS — for each group, declare target_slots + exit_guard + repair_policy.
  4. STATES — derive member states as capabilities.
  5. INSTRUMENT — metrics attach automatically: state metrics from events, group_cohesion from the contract, flow metrics from completion_slots. Zero extra authoring.
Build-time gates so the hierarchy can't rot Every required target_slot has ≥1 collecting state (deadlock check); segment ordering is acyclic; every group has a non-empty purpose + exit_guard; every flow's groups cover all completion_slots (no orphan goal). These run in the flow lint/load path and the pre-push hook — a new flow can't ship a malformed contract.

Implementation phases (as ordered by Codex review)

  1. Explicit segment + flow-versioned group membership.
  2. Fix slot-evidence instrumentation (clean per-state slot_filled capture). Gating prerequisite for any group metric. shipped
  3. GoalSegment schema (typed kind + target + exit_guard + fallback + repair_policy). shipped
  4. Build-time gates integrated into lint/load + pre-push.
  5. Contract-backed group metrics. cohesion proxy shipped
  6. Top-down authoring assist (intent → scaffold).
  7. GOAP-lite selector — last, behind a flag, with replay / A-B evidence.

This session shipped phases 2 and 3 (per-state slot evidence at the commit choke point, and the GoalSegment schema), plus a working group_cohesion proxy computed from the reliable per-state signals. Source: docs/flow-eval/GOAL-HIERARCHY-DESIGN.md.