Overview › Operations › Stall recovery (FSM)

Stall Recovery — the stalled guard & sets: edges

An adversarial caller who withholds a required slot can loop a flow to escalation forever. This session shipped a deterministic recovery primitive: a declared edge that sets a default and advances — one turn before the engine would give up.

What & why

RIFF flows have a no-progress backstop: if the conversation sits in the same non-terminal state making no progress for max_no_progress_loops turns (default 3), the engine forces a graceful exit (a handoff) instead of looping to max_turns. That is correct as a last resort — but for some flows there is a better answer than escalation. If a caller refuses to give a name, a reservation flow could simply book under “Guest” and move on. The stalled guard + sets: edges give the flow author a way to declare that recovery deterministically, before the engine gives up (commit e22399d).

How it works — two primitives

1. The stalled guard

A registered guard in riff/state_manager.py that fires when the conversation has made no progress for nearly the escalate limit — one turn before turn.py's no-progress backstop would force a handoff. That timing is the whole point: it gives a declared recovery edge a chance to fire first.

@register_guard("stalled")
def _guard_stalled(ctx):
    limit = getattr(ctx.flow, "max_no_progress_loops", 3) or 3
    return ctx.no_progress_streak >= max(1, limit - 1)

2. TransitionDef.sets

The mirror of the existing clears: field. Where clears pops stale slots on an edge (the fix for the bare-denial loop), sets assigns default/recovery values when the edge fires. It is stored as ((slot, value), …) pairs on the frozen TransitionDef (riff/types.py), authored as a YAML mapping, and applied in the single commit choke point StateManager.commit_transitionbefore the slot_filled emit, so the recovery value is captured as per-state evidence and is visible to the next hop's guards.

Authoring it — put the recovery edge LAST

The recovery edge must be the last transition in the state, so the normal (slots-filled) edge wins whenever the caller actually cooperates. The stalled edge only catches the case where every cooperative path failed. A real example from flows/restaurant_reservation.yaml (the held-out gap this primitive closed):

transitions:
  # ... normal edges first (they win when the caller cooperates) ...
  - to: confirm_reservation
    when: stalled
    sets: {reservation_time: "7:00 PM"}     # withheld time → sensible default, advance
  # elsewhere, a name-collection state:
  - to: read_back
    when: stalled
    sets: {name: "Guest"}                    # withheld name → book under Guest
Why this is deterministic and safe. The guard reads only ctx.no_progress_streak — no LLM. The value comes from the authored edge, not from model reasoning. And because the edge is last, a cooperative caller never triggers it. The flow recovers gracefully from an adversarial/withholding caller instead of dead-ending at escalation.

Use case — closing a held-out gap

The held-out personas skeptical and privacy_guarded withhold or stall on information on purpose. Against restaurant_reservation, that caller drove the flow into a no-progress loop → escalation every time, capping the flow's quality. Adding the stalled recovery edges let the flow reach a success terminal under sensible defaults — a deterministic fix that generalizes (it isn't tuned to one transcript; it handles the class of withholding caller).

One primitive, a whole failure class. The same stalled + sets: edge later fixed a second, unrelated flow: generic_inquiry was escalating when the caller's inquiry was answered conversationally but the inquiry slot was never captured (held-out skeptical 2.5, privacy 5.0, bulk 7.5 → all 10.0, commit eaf5249). Different surface cause, identical shape — a required slot never gets captured → no-progress → escalate. That is the named failure class this primitive closes, and it is the deterministic fallback_state branch of the GOAP-lite selector.

Connection to the goal hierarchy

This primitive is exactly how a GoalSegment's repair_policy.fallback_state is reached in practice. The design calls for “per-slot + per-segment attempt caps → deterministic fallback”; the stalled guard + sets: edge is the shipped, concrete mechanism for that fallback. So a future contract-driven segment doesn't need a new recovery engine — it reuses this one.

Example — verify it loads

python -c "from riff.loader import load_flow; \
f = load_flow('flows/restaurant_reservation.yaml'); \
print([(t.to, t.when, t.sets) for s in f.states.values() for t in s.transitions if t.when=='stalled'])"

This prints the recovery edges with their sets pairs — confirming the YAML mapping was parsed into the frozen tuple the runtime applies.

Where it fits

Stall recovery is an FSM-runtime primitive that the eval system measures: a flow with good recovery edges shows lower escalation and stall in the per-state layer, and higher cohesion in its collect group. It is the deterministic counterpart to the goal hierarchy's repair policy.