Rewiring a Returned Daemon: the "Unheal" Path
=============================================

Status: **design draft**.  This document proposes a mechanism, not yet
implemented, for restoring a daemon to the routing tree after it has
disappeared and later returned.  It builds directly on the heal path
described in :doc:`bootstrap_plan` (Step 7b) and reuses the fault
machinery in ``src/rml``.

Problem statement
-----------------

In a bootstrapped DVM the compute nodes come up independently and are not
fanned out by a launcher.  A node can therefore *leave* the DVM without the
DVM being torn down — the node is powered off, loses power, or is rebooted —
and then *return* when it comes back up.  Its daemon boots again with the
**same rank** (rank is derived from the node's fixed position in the
``DVMNodes`` ordering, not assigned at launch).

Today the RML handles only half of this life cycle:

* **Heal (works).**  When the daemon disappears, ``lost_connection`` /
  ``failed_to_connect`` drive ``prte_rml_route_lost`` →
  ``prte_rml_repair_routing_tree``, which promotes the orphaned children to
  their grandparent and drives adoption/failure notices.  The tree closes
  over the hole.

* **Unheal (missing).**  When the daemon returns, nothing re-inserts it.  The
  departure was recorded as **permanent**: ``prte_rml_repair_routing_tree``
  sets the rank in both ``failed_dmns`` *and* ``dead_dmns``
  (``routed_radix.c``), and ``dead_dmns`` is never cleared and is restored
  into ``failed_dmns`` on every ``prte_rml_compute_routing_tree``.  From then
  on ``radix_is_living`` reports the rank dead forever, so ``get_route``, the
  children array, and the ancestor list never point back at it.  The returned
  daemon becomes a zombie the tree ignores; its former children stay attached
  to the grandparent.

The goal of this design is to make the return a first-class event: the
returned daemon rejoins its old slot in the tree, its children drop the
grandparent lifeline and re-home to it, and the DVM converges on the same
tree it would have computed had the daemon never left.

Scope and non-goals
-------------------

* **Bootstrap mode only.**  Unheal is gated on ``prte_bootstrap_setup``.  In a
  launcher-driven or elastic-shrink DVM a departed vpid is genuinely
  permanent (a shrink retires the vpid on purpose; #2491 depends on it), and
  those modes keep their current behavior unchanged.
* **Same rank, same identity.**  We handle a node that returns as *itself*
  (same nspace + rank).  We do **not** reuse a vpid for a different node; that
  invariant is preserved.
* **No change to the heal path's externally observed behavior.**  A daemon
  that leaves and never returns must behave exactly as it does today.

Design principle: revival is the inverse of repair, not a special case of it
----------------------------------------------------------------------------

The tree is a deterministic function of ``(radix, num_daemons, failed_set)``.
Death removes a rank from the live set and everyone recomputes; revival adds
it back and everyone recomputes.  The two are symmetric at the level of the
routing math, but **not** at the level of the notification code:

* ``prte_rml_repair_routing_tree`` and the adoption-inference logic in
  ``rml_fault_handler.c`` assume **depth only ever decreases** — a promotion.
  ``prte_rml_recv_adoption_notice`` explicitly treats an ancestor list that
  *grew* as an unrecoverable invariant violation
  (``rml_fault_handler.c``, the ``report.size > ancestors.size`` branch that
  raises ``PRTE_ERR_UNRECOVERABLE``).

* Revival is exactly the case that grows a daemon's ancestor list: a rank is
  re-inserted **above** the daemons in its former subtree, demoting them one
  level.  Feeding that through the death path would trip the invariant check.

Therefore revival needs its **own** recompute-and-notify entry point,
``prte_rml_revive_routing_tree``, parallel to ``prte_rml_repair_routing_tree``,
plus its own notice tags.  It must not be bolted onto the repair path.

Separating "absent" from "dead"
-------------------------------

The root cause of the missing behavior is that one bitmap
(``dead_dmns``) is asked to mean two different things.  Split them:

===================  ============  ==================================  ==================
Set                  Persistent?   Set by                              Cleared by
===================  ============  ==================================  ==================
``failed_dmns``      no (per       repair / revive / compute restore   compute re-init;
                     recompute)                                        revive
``global_failed``    no            global death xcast                  revive
``dead_dmns``        yes           shrink holes (``nidmap.c``);         **never**
                                   non-bootstrap faults
``absent_dmns``      yes           **bootstrap** faults (new)          revive (new)
(new)
===================  ============  ==================================  ==================

* ``absent_dmns`` is a new persistent bitmap constructed once in
  ``prte_rml_open`` alongside ``dead_dmns`` and, like it, **not** re-initialized
  by ``prte_rml_compute_routing_tree``.
* ``prte_rml_repair_routing_tree`` chooses the target set by mode: in a
  bootstrapped DVM a fault records the rank in ``absent_dmns``; otherwise it
  records it in ``dead_dmns`` exactly as today.  Both are restored into the
  freshly-initialized ``failed_dmns`` at the top of
  ``prte_rml_compute_routing_tree`` (so a grow still routes around an
  absent-but-not-yet-returned daemon).
* Revival clears the rank from ``failed_dmns``, ``global_failed_dmns``, and
  ``absent_dmns``.  ``dead_dmns`` is still never touched — a shrunk-out rank
  can never be revived, which is correct.

This keeps the #2491 fix and all launcher/elastic behavior byte-for-byte
identical (nothing outside bootstrap ever populates ``absent_dmns``), while
giving bootstrap a departure set that *can* be reversed.

The trigger: a returned daemon announces itself to the HNP
----------------------------------------------------------

When the node reboots, its daemon runs the normal bootstrap startup
(:doc:`bootstrap_plan`, Steps 4–7).  It computes a **healthy** tree — its own
``absent_dmns`` is empty, so it sees the full DVM — and connects up its
lifeline.  It has no way to know, on its own, that the rest of the DVM wrote
it off while it was gone.

Rather than infer the return from a stray inbound socket (routing policy does
not belong in the OOB accept path), make it explicit and route it through the
arbiter of global tree state, the **HNP**, mirroring how death is globally
xcast:

#. **Rejoin request (one hop up, then filtered).**  On bootstrap startup, when
   ``prte_bootstrap_setup`` is set, the daemon sends
   ``PRTE_RML_TAG_DAEMON_RETURNED{rank=self}`` **to its parent** (one hop up its
   lifeline), not to the HNP.  The parent is exactly the daemon that knows
   whether the rank was absent -- the global death broadcast marked it
   everywhere -- so the parent filters: if the rank is not in *its*
   ``absent_dmns`` (a first boot, a duplicate) it drops the notice, and only if
   the rank really was absent does it escalate one relayed message to the HNP.
   This deliberately avoids an N-to-root pattern: on the common first-boot path
   the root is never involved (it sees announcements only from its own few
   direct children, all dropped), and the notice rides the existing lifeline
   link so no daemon opens a socket to the root.  A real return costs one
   escalated message to the root, ``O(1)``.

#. **HNP validates and broadcasts (global).**  The HNP checks the rank against
   ``absent_dmns``.  If absent, it clears the rank locally and xcasts
   ``PRTE_RML_TAG_DAEMON_REVIVED{rank}`` to the whole DVM, exactly as
   ``send_failures_notice`` xcasts ``PRTE_RML_TAG_DAEMON_DIED`` from the master
   (``rml_fault_handler.c``).  If the rank is not absent (a genuine first boot,
   or a duplicate), the HNP drops the request — the operation is idempotent.

#. **Everyone converges (global).**  Each daemon's
   ``prte_rml_recv_revival_notice`` clears the rank from its failure sets and
   calls ``prte_rml_revive_routing_tree(rank)``.  Because the failed set is now
   globally consistent again, every daemon deterministically recomputes the
   same tree.

Centralizing at the HNP avoids the split-brain that per-subtree local revival
would invite, and reuses the existing global-xcast plumbing.

``prte_rml_revive_routing_tree`` — the recompute and its deltas
-----------------------------------------------------------------

Symmetric to ``prte_rml_repair_routing_tree``:

#. Clear ``rank`` from ``failed_dmns`` / ``global_failed_dmns`` /
   ``absent_dmns`` (the recv handler does this before calling in, matching how
   ``repair`` sets the bit before recomputing).
#. Snapshot ``prev_ancestors`` / ``prev_parent`` / ``prev_children`` into a
   ``prte_rml_recovery_status_t``.
#. Re-derive ancestors, promotion/**demotion**, and children.  Most of this is
   the existing helpers run against the updated (smaller) failed set:

   * ``prte_rml_update_ancestors`` already walks to the next *living* ancestor;
     with ``rank`` now living again it will re-appear in the lists of the
     daemons below it, **growing** their ancestor arrays.  This is the case the
     current code deliberately rejects, so ``update_ancestors`` needs an
     audited pass to confirm it produces the right list when depth increases
     (it may need a companion to the promotion path — a "demotion" fixup —
     analogous to ``handle_promotion``).
   * The daemon that had adopted ``rank``'s orphans (``rank``'s parent) loses
     them from its child list; ``rank`` regains them.  ``update_descendants``
     recomputes children from the live set, so the child arrays fall out
     correctly once the failed bit is cleared; the work is producing the
     *delta* for the notices.

#. Fill in ``parent_changed`` / ``children_changed`` / a new ``demoted`` flag
   (mirroring ``promoted``) and notify the components:
   ``prte_rml_fault_handler``, ``prte_grpcomm.fault_handler``,
   ``prte_filem.fault_handler``, ``prte_relm.fault_handler``.

Recovery-status and component impact
------------------------------------

The existing ``prte_rml_recovery_status_t`` is close but assumes promotion.
Two additions:

* Add a ``bool demoted`` flag (a daemon gained an ancestor / its subtree
  shrank), the mirror of ``promoted``.  Handlers that "treat all children as
  new when promoted" need the analogous rule: **treat the re-homing children
  as new when a neighbor is revived**, because a child that briefly had the
  grandparent as parent must discard that lineage.
* The RML's own reaction (``rml_fault_handler.c``) needs revival analogues of
  its two notices:

  - **Re-home notice (down)** — the inverse of the adoption notice.  ``rank``'s
    parent tells the affected promoted children "your ancestor list has
    changed; ``rank`` is back above you," so they drop the grandparent lifeline
    and re-open their lifeline to ``rank`` (or the closest revived ancestor in
    their path).  The receive handler is the inverse of
    ``prte_rml_recv_adoption_notice`` and must accept a *grown* ancestor list
    rather than rejecting it.
  - **Rejoin/rollup (up)** — RELM re-drives any messages that were in flight
    across the re-homing so nothing is lost, exactly as it re-drives across a
    heal.

``prte_grpcomm`` and ``prte_filem`` already receive the recovery status on
every tree change; they must tolerate a change whose net effect is a daemon
*appearing*.  The audit here is: does any collective/xcast accounting assume
membership only shrinks?  ``prte_rml_get_num_contributors`` counts live
children, so a revived child correctly re-enters the count once the failed bit
clears — but in-progress collectives that already excluded ``rank`` need the
same "save state between local and global scope" discipline the death path
uses (``rml_types.h`` documents this contract on the status struct).

Bringing the returned daemon up to date
----------------------------------------

The routing tree is only half the problem.  While it was gone the returned
daemon missed everything: jobs launched, other faults, nidmap growth from
elastic grows.  It boots with a stale world view.  Re-inserting it into the
tree without resynchronizing its state would let it route correctly but act on
stale data.

This is the same problem the **elastic grow** path already solves — "admit a
daemon into a running DVM and hand it the current state" — with one twist: the
vpid is the returned daemon's own, not a newly minted one.  Reuse that
machinery:

* The HNP, on processing the rejoin, drives the returned daemon through the
  grow-style wireup so it receives the current nidmap (with any holes) and the
  active job/proc data, rather than the boot-time snapshot.
* Because the returned rank is an existing hole rather than an extension of the
  vpid span, ``num_daemons`` does **not** change; only the returned rank's
  ``dead``/``absent`` state and the tree change.  The nidmap-hole bookkeeping
  in ``nidmap.c`` must not re-mark the returning rank as dead when it repopulates
  ``daemons->procs`` — clearing ``absent_dmns`` for the rank must precede, or be
  reconciled with, that scan.

Concurrency and correctness concerns
------------------------------------

* **Incarnation / stale-message hazard.**  The returned daemon is a *new
  process* wearing the *old rank*.  Messages queued to the old incarnation, or
  late death/adoption notices still in flight, could be mis-delivered to the
  new one.  **Decided:** tag each daemon with a **boot epoch** — a
  monotonically increasing incarnation counter — and carry it in the wire
  header (``prte_oob_tcp_hdr.h``) so a hop can drop a message addressed to a
  stale incarnation of a rank.  This is safe to add: the header is not an ABI
  (see the RML ``AGENTS.md`` — it is exchanged only among daemons of one DVM,
  which all run the same PRRTE build), so there is no cross-version concern,
  only the requirement that every daemon agree, which a single build
  guarantees.  The epoch is the daemon's **boot timestamp**, captured once at
  startup (a wall-clock time at ``prte_init``); no persisted on-disk counter is
  needed.  A reboot yields a later timestamp, so the returned incarnation
  always outranks the one the DVM wrote off.  (The one degenerate case — a
  reboot fast enough, or with a reset clock, to reproduce the prior timestamp —
  is bounded by timestamp resolution; use at least millisecond granularity, and
  the HNP can reject a ``DAEMON_RETURNED`` whose epoch is not strictly greater
  than the recorded one, forcing a retry.)  The epoch is announced in the
  ``DAEMON_RETURNED`` request so the HNP propagates it in the
  ``DAEMON_REVIVED`` xcast; peers record the current epoch per rank and reject
  header-stamped traffic from an older one.  See Stage 6.
* **Revive/again-die races.**  A node that flaps (returns, dies again before
  the revival xcast completes) must converge.  Because both death and revival
  are HNP-arbitrated global xcasts over the same rank, ordering them at the HNP
  (serialize per-rank; the last event wins) keeps every daemon consistent.  The
  recv handlers must be idempotent (clearing an already-clear bit, or setting an
  already-set one, is a no-op that produces an empty delta and no notices — the
  ``status.failed_ranks.size == 0`` early return in ``repair`` already models
  this).
* **Stale OOB peer object.**  The peer object for ``rank`` on its neighbors is
  in a closed/failed state from the original loss.  Revival must reset it (or
  drop it so the next send re-synthesizes the URI via
  ``prte_ess_base_bootstrap_peer_uri`` and reconnects, as the heal path already
  does for adopted parents).

Staged implementation plan
--------------------------

The stages are independently reviewable and ordered so the tree keeps building
and behaving at each step.

**Stage 1 — Split the departure sets.**  Add ``absent_dmns`` to
``prte_rml_base`` (construct in ``prte_rml_open``, restore into ``failed_dmns``
in ``prte_rml_compute_routing_tree``).  Route bootstrap faults to it instead of
``dead_dmns``.  No behavior change yet (an absent daemon still never returns);
this only reclassifies where the mark lives.  Verify launched/elastic behavior
is untouched (nothing populates ``absent_dmns`` outside bootstrap).

**Stage 2 — Revival recompute.**  Implement
``prte_rml_revive_routing_tree(rank)`` and the ``demoted`` status flag; audit
``update_ancestors`` for growing ancestor lists and add a demotion fixup if
needed.  Unit-exercise it by directly clearing a bit and calling it on a small
synthetic tree.

**Stage 3 — Global protocol.**  Add ``PRTE_RML_TAG_DAEMON_RETURNED`` and
``PRTE_RML_TAG_DAEMON_REVIVED``, the HNP validate-and-xcast, and
``prte_rml_recv_revival_notice``.  At this point a returned daemon that already
holds current state rejoins the tree and children re-home.

**Stage 4 — Component re-drive (done).**  ``prte_rml_revive_routing_tree`` now
notifies ``grpcomm``, ``filem``, and ``relm`` of the reshape (but *not* the
death-only ``prte_rml_fault_handler``).  Two simplifications fell out of the
xcast-driven design and are worth recording:

* **No separate re-home notice is needed.**  The inverse-adoption notice was
  meant to tell promoted children that a rank returned above them.  But a
  revival is driven entirely by the single ``DAEMON_REVIVED`` xcast, so every
  daemon recomputes from the same signal and re-homes locally; there is no
  local-detection-versus-broadcast race for an adoption-style notice to close,
  unlike a fault.

* **No revival-specific handler branch is needed.**  A revival is pure
  *shrinkage* from every reshaping daemon's view (the returned rank's former
  parent swaps orphans for the rank; the orphans re-home; deeper daemons only
  gain an ancestor).  That trips only the existing ``parent_changed`` /
  ``children_changed`` paths in the ``grpcomm`` and ``relm`` handlers; the
  ``promoted``-only paths (replay-pending, op-id-at-promotion) are for the
  growth direction and correctly stay dormant.  So the tested handlers are
  reused rather than forked.

One **watch item** remains for harness validation: RELM link updates are depth
stamped and ``update_link`` drops a mismatched one, while revival changes
depths and rides the xcast forward-first.  Static analysis argues it is safe --
each daemon recomputes synchronously right after forwarding, so both ends have
settled on their new depths before any link update (a later, separate message)
is processed -- but the multi-hop update gating is subtle enough to confirm on
the Docker harness (kill an interior node, restart it, then launch a job across
the DVM and check nothing was lost).

**Stage 5 — State resync.**  Wire the returned daemon through the grow-style
state handoff so it comes back with the current nidmap and job data; reconcile
with the ``nidmap.c`` hole scan.  **Harness evidence (2026-07-04):** the topology
unheal was verified end-to-end on the Docker swarm (a radix-2 bootstrap DVM;
killing the interior daemon healed its child up to the grandparent, restarting
it drove the return/revival broadcast and the child re-homed back).  But once
the returned daemon was asked to participate in a reliable xcast, it died with
``PRTE_ERR_OUT_OF_ORDER_MSG`` (``grpcomm_direct_xcast.c``, the
``op_id != op_id_completed + 1`` check): it rejoined with a fresh xcast op-id
counter while the DVM's broadcast stream was already at a higher op-id, so the
first op forwarded to it was out of order and it force-exited.

*The op-id half is now done and verified.*  ``xcast_recv`` recognizes a late
joiner (a daemon with ``op_id_inited == 0`` handed an op above one -- grown,
returned, or simply booted after the first broadcast) and adopts the
intervening ops as complete, so ordering holds for that op and every one after.
The harness confirms the ``OUT_OF_ORDER`` exit is gone, and the elastic suite
(16/16) confirms the normal xcast path is unaffected.  This also removes a
latent grow hazard.

*The nidmap/job-data half is now done and verified.*  The "Node has gone down"
force-exit turned out not to be missing job data but a nidmap **span** bug in
the handoff itself.  When the returned daemon's connection warms up, the HNP
builds the handoff nidmap (``prte_util_nidmap_create``) while
``prte_process_info.num_daemons`` still holds the *departed* count -- the error
manager decremented it on the death and the daemon's formal relaunch report,
which restores it, has not run yet.  The departed node's ``node->daemon`` entry
persists, though, so every vpid is still packed.  Encoding ``num_daemons`` as
the span therefore declared a span one short of the highest packed vpid.  The
returned daemon decoded the short span, reset its own ``num_daemons`` to it, and
recomputed a routing tree over a rank space that excluded the top daemon --
so its live child dropped out of its subtree and traffic bound for that child
was misrouted to the parent.  (The short span also under-sized the encode-side
vpid buffer by one entry.)  The fix derives the span from the pool instead:
``max(num_daemons, highest_packed_vpid + 1)``, which covers every packed daemon
while still preserving a legitimate top-of-range shrink hole, and sizes the
buffer to match.  The decode-side hole scan additionally records a bootstrap
hole as *absent* (clearable) rather than permanently *dead*.  Harness-verified:
the returned daemon now decodes the full span, its routing stays consistent, and
a job launched after the unheal runs across the returned daemon and its child
with every daemon surviving; the elastic suite (16/16) is unaffected.

**Stage 6 — Incarnation guard (done).**  Each process captures a boot epoch -- a
millisecond wall-clock timestamp taken once at RML startup -- and stamps it into
the OOB wire header (``prte_oob_tcp_hdr_t``) as the origin's epoch: a message
built locally carries this process's epoch, and a relayed message preserves the
original sender's epoch from the received header.  Every daemon records the
highest epoch it has learned per rank; in a bootstrapped DVM it drops
daemon-namespace traffic stamped with a strictly *older* epoch for a rank (the
check runs after the whole message is read, so the byte stream stays framed, and
only for the daemon namespace, since tool namespaces reuse rank numbers).  A
newer epoch passes but does not advance the table -- the arbitrated revival does
that.  The returning daemon announces its epoch in ``DAEMON_RETURNED``; the HNP
accepts the return only if that epoch is strictly greater than the one last
recorded for the rank (rejecting a stale or degenerate same-timestamp reboot and
forcing a retry), then carries it in the ``DAEMON_REVIVED`` broadcast so every
daemon records the new incarnation and drops any lingering traffic from the old
one.  The wire header is exchanged only among daemons of one DVM, all on the same
build, so the added field is no ABI concern.  The drop path is bootstrap-gated,
so launched and elastic DVMs are unaffected; harness-verified -- the elastic
suite (16/16) and the bootstrap unheal end-to-end (revival, ``get_route``
stability, a post-unheal job across the returned daemon) both pass with the guard
in place, and no legitimate traffic is dropped.  This closes the stale-message
window that the return of a same-rank/new-process daemon opens.

Testing
-------

The Docker multi-node bootstrap harness (``contrib/dockerswarm/``) already
drives launcher-less formation.  Extend it:

#. Form a bootstrapped DVM of enough nodes to have a non-trivial interior
   (radix small enough that some daemon has both a parent and children).
#. Kill an interior node's daemon (or ``docker stop`` the node); confirm the
   heal — children promote to the grandparent — via ``rml_base_verbose``.
#. Restart the node; confirm the unheal — the ``DAEMON_RETURNED`` /
   ``DAEMON_REVIVED`` exchange, the children re-homing to the returned rank,
   and the tree matching a never-failed run.
#. Launch a job across the DVM *after* the unheal to confirm the returned
   daemon carries current state and participates in collectives.
#. Flap test: kill and restart in quick succession to exercise the
   race/idempotence handling.

Resolved decisions
------------------

#. **Incarnation identity — boot-epoch in the wire header.**  Adopt a boot-epoch
   incarnation counter carried in ``prte_oob_tcp_hdr_t`` (Stage 6) rather than
   trying to close the stale-message window with xcast ordering and peer reset
   alone.  There is no ABI cost: every daemon of a DVM runs the same PRRTE
   build, so the header can change freely as long as all daemons agree.  The
   epoch value is the daemon's **boot timestamp** (captured at ``prte_init``),
   not a persisted counter — a reboot always produces a later value, so no
   on-disk state is required.
#. **Bootstrap-only — no launched/elastic re-launch.**  Unheal stays gated on
   ``prte_bootstrap_setup``.  Extending it to a launched or elastic DVM would
   require re-launching the returned daemon into its **original** vpid (an
   existing hole), and the bulk launchers cannot do that: SLURM, PALS, and
   similar RMs assign vpids sequentially over the node set they are handed and
   offer no way to force a *particular* vpid one-at-a-time, so there is no
   portable re-launch-into-hole primitive to build on.  Only bootstrap, where a
   returned node re-derives its own rank from static configuration and re-runs
   its own daemon, provides the returned-with-original-rank precondition the
   mechanism needs.  The RML core (the ``absent_dmns`` split,
   ``prte_rml_revive_routing_tree``, the revival protocol, re-home notices, the
   boot-epoch guard) is not itself launcher-specific, so this could be revisited
   if a launcher ever gains per-vpid placement — but it is out of scope now.
#. **Trigger source — announce to the parent, escalate to the HNP.**  A
   returning daemon announces one hop up to its *parent*, which filters on its
   own ``absent_dmns`` and escalates only a genuine return to the HNP; the HNP
   remains the sole arbiter that broadcasts the revival.  This is chosen over
   the daemon announcing straight to the HNP specifically to avoid an N-to-root
   pattern: funnelling every daemon's boot-time announcement onto the root makes
   the root aggregate ``N`` messages and, at the transport, sustain the fan-in
   that burdens the OS and hurts responsiveness at scale.  Parent-filtered
   escalation keeps the root out of the common (first-boot) path entirely while
   preserving single-arbiter global consistency.
#. **Partial returns — handled by the base-rebuild reduction.**  If several
   daemons in one subtree are absent and only some return, or a returned rank is
   itself below a still-absent ancestor, no special handling is needed.  After
   ``prte_rml_revive_routing_tree`` clears the returned rank's bit, the failed
   set is exactly what ``compute_routing_tree`` would hold for the same
   still-absent ranks, and both routines derive the tree through the same
   ``build_tree_from_base`` helper -- which starts from the full-depth base
   ancestor list and drops whatever is still failed.  A revival therefore
   produces the identical tree a fresh compute would for that failed set, so
   ``update_ancestors`` walks partial-return cases correctly by construction.

Open questions
--------------

*(none currently open — remaining work is the Stage 4 RELM watch item and
Stages 5–6, all tracked in the staged plan above.)*