Overview › The hierarchy › Goal-Hierarchy design

The Goal-Hierarchy Flow Model

A flow is not a flat graph of states. It is a three-tier goal hierarchy — and every level declares a goal and is instrumented with a metric for that goal.

What & why

A naive view treats a flow as a flat bag of states and scores it with one number. That number hides everything that matters: which phase failed, whether a phase's states cooperate, and whether a weakness is local to one flow or systemic across the whole library. The goal-hierarchy model fixes this by giving each structural level its own goal and its own measurement, authored top-down from flow creation rather than reverse-engineered later.

FLOW      goal: complete the caller's task        metric: correctness · completion · humanness
  └─ GROUP   goal: a TYPED objective (collect/         metric: group_cohesion =
       │         confirm/act/terminal/handoff)               goal_yield·(0.35·efficiency+0.65·transition_coherence)
       └─ STATE goal: one bounded capability (≥1 slot)  metric: progress · stall · revisit · dwell · slot_fill

(The fourth tier, archetype, is a cross-flow rollup of groups — it is a reporting lens, not an authoring level.)

The three tiers and their goals

Tier	Goal — what “good” means	Declared by
State	one bounded capability / speech act — may collect or repair several slots	`description`, `required_slots`, `speech_act`
Group (segment)	a typed objective (not always a slot-contract)	`GoalSegment`: `purpose`, `kind`, `target`, `exit_guard`, `repair_policy`
Flow	complete the caller's task	flow-level `completion_slots`, `final_status`

A state's goal is NOT “one slot.” That would resurrect the refuted one-slot decomposition (it re-asks for facts the caller already volunteered). The validated Collect Pattern uses multi-slot listen states that absorb volunteered facts. A state's goal is one bounded capability that may fill or repair several slots.

Group goals are TYPED

Not every group is a slot-contract. RIFF groups include collect, confirm, action, terminal, and handoff phases — each with its own typed objective and success metric. A slot-contract is only the collect kind.

`kind`	objective	success metric
`collect`	fill the target slots	`slot_yield`
`confirm`	get the caller's confirmation	`confirmation_success`
`act`	a tool/program succeeds	`tool_success`
`terminal`	reach the intended terminal	`success_terminal`
`handoff`	safe transfer / escalation	`safe_transfer`

The GoalSegment — a group that acts toward its goal

A GoalSegment is a Harel-statechart compound state: a super-state owning a shared objective, with member states as capabilities rather than a hand-wired chain. Statecharts keep transition ownership in the FSM, so the LLM stays language-only. Members declare what they collect/repair and their preconditions/effects — not their next state. The FSM picks the next member deterministically (a GOAP-lite selector over the slot contract):

def choose_next(segment, ctx):
    apply_slot_observations(ctx)            # extraction/validation only
    if exit_guard(segment, ctx):       return segment.exit_target
    repair = first_repair_needed(segment, ctx)
    if repair:                         return repair.state
    slot = first_unsatisfied_slot(segment, ctx)
    if slot is None:                   return segment.exit_target
    candidates = [s for s in states_that_collect(slot) if preconditions_hold(s, ctx)]
    if not candidates:                 return segment.fallback_state
    return min(candidates, key=deterministic_cost_tuple)

Any next_state the LLM suggests is ignored and logged. The segment owns the objective and recomputes after every caller turn, so members become tools under one contract — not a brittle ask_name → ask_phone → ask_address chain. The schema for this contract shipped this session; see GoalSegment schema. The selector itself is the last phase, behind a flag.

Failure defenses

Failure	Defense
Oscillation	hysteresis: pursue the current slot until valid / max-attempts / blocked; tie-break by declared order, never by LLM wording
Premature exit	exit only when all required slots are valid (mentioned ≠ valid) and confirmation is satisfied
Deadlock	compile-time check: every required slot has ≥1 collecting state; ordering acyclic; runtime fallback if no candidate is actionable
Repair loops	per-slot + per-segment attempt caps → deterministic fallback (the shipped `stalled` guard + `sets:`)
Contradictory data	keep slot evidence records; latest explicit correction beats earlier implicit extraction

Metrics at every level

Principle: each level has a typed objective score plus diagnostic submetrics — not one universal scalar. All timestamped, per-agent, flow-versioned.

Tier	Objective + submetrics	Storage
State	progress, stall, revisit, dwell_turns, slot_fill_rate, latency_p95	`state_scores`
Group	goal_yield · efficiency · transition_coherence → group_cohesion	per-segment rollup of `state_scores`
Flow	correctness, completion, errors, latency, humanness	`aspect_scores`

The group-cohesion metric

The naive product collapses to zero and hides which part failed. The bounded, visible form:

goal_yield           = successful_eligible_visits / eligible_visits   # typed per kind
efficiency           = clamp(reference_turns / max(actual_turns, reference_turns, 1), 0, 1)
transition_coherence = clamp(1 − w1·redundant_ask − w2·avoidable_reentry − w3·lost_slot, 0, 1)

group_cohesion       = goal_yield · (0.35·efficiency + 0.65·transition_coherence)   # 0..1

Collaboration is emergent: a segment of locally-fine states can collectively thrash — each state “progresses,” yet the group re-asks across members and takes 8 turns for 3 slots. That is visible only at the group level. See Group cohesion for the as-shipped computation and a live table.

Top-down authoring — a new flow is born with goals

Flows are authored high-level → decomposed, so goals and groups exist from the first commit:

INTENT — the flow's task in one line (“book a plumbing visit”).
GROUPS — decompose into phases (greet → triage → collect → schedule → confirm → close); each is a GoalSegment stub with a purpose.
CONTRACTS — for each group, declare target_slots + exit_guard + repair_policy.
STATES — derive member states as capabilities.
INSTRUMENT — metrics attach automatically: state metrics from events, group_cohesion from the contract, flow metrics from completion_slots. Zero extra authoring.

Build-time gates so the hierarchy can't rot Every required target_slot has ≥1 collecting state (deadlock check); segment ordering is acyclic; every group has a non-empty purpose + exit_guard; every flow's groups cover all completion_slots (no orphan goal). These run in the flow lint/load path and the pre-push hook — a new flow can't ship a malformed contract.

Implementation phases (as ordered by Codex review)

Explicit segment + flow-versioned group membership.
Fix slot-evidence instrumentation (clean per-state slot_filled capture). Gating prerequisite for any group metric. shipped
GoalSegment schema (typed kind + target + exit_guard + fallback + repair_policy). shipped
Build-time gates integrated into lint/load + pre-push.
Contract-backed group metrics. cohesion proxy shipped
Top-down authoring assist (intent → scaffold).
GOAP-lite selector — last, behind a flag, with replay / A-B evidence.

This session shipped phases 2 and 3 (per-state slot evidence at the commit choke point, and the GoalSegment schema), plus a working group_cohesion proxy computed from the reliable per-state signals. Source: docs/flow-eval/GOAL-HIERARCHY-DESIGN.md.

Source: docs/flow-eval/GOAL-HIERARCHY-DESIGN.md, riff/types.py (class GoalSegment), scripts/eval_db.py (group_cohesion).