The Goal-Hierarchy Flow Model
A flow is not a flat graph of states. It is a three-tier goal hierarchy — and every level declares a goal and is instrumented with a metric for that goal.
What & why
A naive view treats a flow as a flat bag of states and scores it with one number. That number hides everything that matters: which phase failed, whether a phase's states cooperate, and whether a weakness is local to one flow or systemic across the whole library. The goal-hierarchy model fixes this by giving each structural level its own goal and its own measurement, authored top-down from flow creation rather than reverse-engineered later.
FLOW goal: complete the caller's task metric: correctness · completion · humanness
└─ GROUP goal: a TYPED objective (collect/ metric: group_cohesion =
│ confirm/act/terminal/handoff) goal_yield·(0.35·efficiency+0.65·transition_coherence)
└─ STATE goal: one bounded capability (≥1 slot) metric: progress · stall · revisit · dwell · slot_fill
(The fourth tier, archetype, is a cross-flow rollup of groups — it is a reporting lens, not an authoring level.)
The three tiers and their goals
| Tier | Goal — what “good” means | Declared by |
|---|---|---|
| State | one bounded capability / speech act — may collect or repair several slots | description, required_slots, speech_act |
| Group (segment) | a typed objective (not always a slot-contract) | GoalSegment: purpose, kind, target, exit_guard, repair_policy |
| Flow | complete the caller's task | flow-level completion_slots, final_status |
listen states that absorb volunteered facts. A state's goal is one
bounded capability that may fill or repair several slots.
Group goals are TYPED
Not every group is a slot-contract. RIFF groups include collect, confirm, action, terminal, and
handoff phases — each with its own typed objective and success metric. A slot-contract is only the
collect kind.
kind | objective | success metric |
|---|---|---|
collect | fill the target slots | slot_yield |
confirm | get the caller's confirmation | confirmation_success |
act | a tool/program succeeds | tool_success |
terminal | reach the intended terminal | success_terminal |
handoff | safe transfer / escalation | safe_transfer |
The GoalSegment — a group that acts toward its goal
A GoalSegment is a Harel-statechart compound state: a super-state owning a shared
objective, with member states as capabilities rather than a hand-wired chain. Statecharts
keep transition ownership in the FSM, so the LLM stays language-only. Members declare what they
collect/repair and their preconditions/effects — not their next state. The FSM picks the
next member deterministically (a GOAP-lite selector over the slot contract):
def choose_next(segment, ctx):
apply_slot_observations(ctx) # extraction/validation only
if exit_guard(segment, ctx): return segment.exit_target
repair = first_repair_needed(segment, ctx)
if repair: return repair.state
slot = first_unsatisfied_slot(segment, ctx)
if slot is None: return segment.exit_target
candidates = [s for s in states_that_collect(slot) if preconditions_hold(s, ctx)]
if not candidates: return segment.fallback_state
return min(candidates, key=deterministic_cost_tuple)
Any next_state the LLM suggests is ignored and logged. The segment
owns the objective and recomputes after every caller turn, so members become tools under one
contract — not a brittle ask_name → ask_phone → ask_address chain. The schema
for this contract shipped this session; see GoalSegment
schema. The selector itself is the last phase, behind a flag.
Failure defenses
| Failure | Defense |
|---|---|
| Oscillation | hysteresis: pursue the current slot until valid / max-attempts / blocked; tie-break by declared order, never by LLM wording |
| Premature exit | exit only when all required slots are valid (mentioned ≠ valid) and confirmation is satisfied |
| Deadlock | compile-time check: every required slot has ≥1 collecting state; ordering acyclic; runtime fallback if no candidate is actionable |
| Repair loops | per-slot + per-segment attempt caps → deterministic fallback (the shipped stalled guard + sets:) |
| Contradictory data | keep slot evidence records; latest explicit correction beats earlier implicit extraction |
Metrics at every level
Principle: each level has a typed objective score plus diagnostic submetrics — not one universal scalar. All timestamped, per-agent, flow-versioned.
| Tier | Objective + submetrics | Storage |
|---|---|---|
| State | progress, stall, revisit, dwell_turns, slot_fill_rate, latency_p95 | state_scores |
| Group | goal_yield · efficiency · transition_coherence → group_cohesion | per-segment rollup of state_scores |
| Flow | correctness, completion, errors, latency, humanness | aspect_scores |
The group-cohesion metric
The naive product collapses to zero and hides which part failed. The bounded, visible form:
goal_yield = successful_eligible_visits / eligible_visits # typed per kind efficiency = clamp(reference_turns / max(actual_turns, reference_turns, 1), 0, 1) transition_coherence = clamp(1 − w1·redundant_ask − w2·avoidable_reentry − w3·lost_slot, 0, 1) group_cohesion = goal_yield · (0.35·efficiency + 0.65·transition_coherence) # 0..1
Collaboration is emergent: a segment of locally-fine states can collectively thrash — each state “progresses,” yet the group re-asks across members and takes 8 turns for 3 slots. That is visible only at the group level. See Group cohesion for the as-shipped computation and a live table.
Top-down authoring — a new flow is born with goals
Flows are authored high-level → decomposed, so goals and groups exist from the first commit:
- INTENT — the flow's task in one line (“book a plumbing visit”).
- GROUPS — decompose into phases (greet → triage → collect → schedule → confirm → close); each is a GoalSegment stub with a
purpose. - CONTRACTS — for each group, declare
target_slots+exit_guard+repair_policy. - STATES — derive member states as capabilities.
- INSTRUMENT — metrics attach automatically: state metrics from events, group_cohesion from the contract, flow metrics from
completion_slots. Zero extra authoring.
target_slot has ≥1 collecting state (deadlock check); segment ordering
is acyclic; every group has a non-empty purpose + exit_guard; every
flow's groups cover all completion_slots (no orphan goal). These run in the flow
lint/load path and the pre-push hook — a new flow can't ship
a malformed contract.
Implementation phases (as ordered by Codex review)
- Explicit
segment+ flow-versioned group membership. - Fix slot-evidence instrumentation (clean per-state
slot_filledcapture). Gating prerequisite for any group metric. shipped GoalSegmentschema (typed kind + target + exit_guard + fallback + repair_policy). shipped- Build-time gates integrated into lint/load + pre-push.
- Contract-backed group metrics. cohesion proxy shipped
- Top-down authoring assist (intent → scaffold).
- GOAP-lite selector — last, behind a flag, with replay / A-B evidence.
This session shipped phases 2 and 3 (per-state slot evidence at the commit choke point, and the
GoalSegment schema), plus a working group_cohesion proxy computed from the reliable
per-state signals. Source: docs/flow-eval/GOAL-HIERARCHY-DESIGN.md.