Stall Recovery — the stalled guard & sets: edges
An adversarial caller who withholds a required slot can loop a flow to escalation forever. This session shipped a deterministic recovery primitive: a declared edge that sets a default and advances — one turn before the engine would give up.
What & why
RIFF flows have a no-progress backstop: if the conversation sits in the same non-terminal state
making no progress for max_no_progress_loops turns (default 3), the engine forces a
graceful exit (a handoff) instead of looping to max_turns. That is correct as a last
resort — but for some flows there is a better answer than escalation. If a caller refuses to give a
name, a reservation flow could simply book under “Guest” and move on. The
stalled guard + sets: edges give the flow author a way to declare that
recovery deterministically, before the engine gives up (commit e22399d).
How it works — two primitives
1. The stalled guard
A registered guard in riff/state_manager.py that fires when the conversation has made
no progress for nearly the escalate limit — one turn before turn.py's
no-progress backstop would force a handoff. That timing is the whole point: it gives a
declared recovery edge a chance to fire first.
@register_guard("stalled")
def _guard_stalled(ctx):
limit = getattr(ctx.flow, "max_no_progress_loops", 3) or 3
return ctx.no_progress_streak >= max(1, limit - 1)
2. TransitionDef.sets
The mirror of the existing clears: field. Where clears pops stale slots
on an edge (the fix for the bare-denial loop), sets assigns default/recovery
values when the edge fires. It is stored as ((slot, value), …) pairs on the frozen
TransitionDef (riff/types.py), authored as a YAML mapping, and applied in
the single commit choke point StateManager.commit_transition — before the
slot_filled emit, so the recovery value is captured as per-state evidence and is visible
to the next hop's guards.
Authoring it — put the recovery edge LAST
The recovery edge must be the last transition in the state, so the normal
(slots-filled) edge wins whenever the caller actually cooperates. The stalled edge only
catches the case where every cooperative path failed. A real example from
flows/restaurant_reservation.yaml (the held-out gap this primitive closed):
transitions: # ... normal edges first (they win when the caller cooperates) ... - to: confirm_reservation when: stalled sets: {reservation_time: "7:00 PM"} # withheld time → sensible default, advance # elsewhere, a name-collection state: - to: read_back when: stalled sets: {name: "Guest"} # withheld name → book under Guest
ctx.no_progress_streak — no LLM. The value comes from the authored
edge, not from model reasoning. And because the edge is last, a cooperative caller never triggers
it. The flow recovers gracefully from an adversarial/withholding caller instead of dead-ending at
escalation.
Use case — closing a held-out gap
The held-out personas skeptical and
privacy_guarded withhold or stall on information on purpose. Against
restaurant_reservation, that caller drove the flow into a no-progress loop →
escalation every time, capping the flow's quality. Adding the stalled recovery edges let
the flow reach a success terminal under sensible defaults — a deterministic fix that generalizes (it
isn't tuned to one transcript; it handles the class of withholding caller).
stalled + sets: edge later fixed a second, unrelated flow:
generic_inquiry was escalating when the caller's inquiry was answered conversationally but
the inquiry slot was never captured (held-out skeptical 2.5, privacy 5.0, bulk 7.5 →
all 10.0, commit eaf5249). Different surface cause, identical shape —
a required slot never gets captured → no-progress → escalate. That is the named
failure class this primitive closes, and it is the deterministic fallback_state branch of
the GOAP-lite selector.
Connection to the goal hierarchy
This primitive is exactly how a GoalSegment's
repair_policy.fallback_state is reached in practice. The design calls for “per-slot
+ per-segment attempt caps → deterministic fallback”; the stalled guard +
sets: edge is the shipped, concrete mechanism for that fallback. So a future
contract-driven segment doesn't need a new recovery engine — it reuses this one.
Example — verify it loads
python -c "from riff.loader import load_flow; \
f = load_flow('flows/restaurant_reservation.yaml'); \
print([(t.to, t.when, t.sets) for s in f.states.values() for t in s.transitions if t.when=='stalled'])"
This prints the recovery edges with their sets pairs — confirming the YAML mapping was
parsed into the frozen tuple the runtime applies.
Where it fits
Stall recovery is an FSM-runtime primitive that the eval system measures: a flow with good
recovery edges shows lower escalation and stall in the
per-state layer, and higher
cohesion in its collect group. It is the deterministic
counterpart to the goal hierarchy's repair policy.