8.4.3. Rewiring a Returned Daemon: the “Unheal” Path

Status: design draft. This document proposes a mechanism, not yet implemented, for restoring a daemon to the routing tree after it has disappeared and later returned. It builds directly on the heal path described in DVM Bootstrap Implementation Plan (Step 7b) and reuses the fault machinery in src/rml.

8.4.3.1. Problem statement

In a bootstrapped DVM the compute nodes come up independently and are not fanned out by a launcher. A node can therefore leave the DVM without the DVM being torn down — the node is powered off, loses power, or is rebooted — and then return when it comes back up. Its daemon boots again with the same rank (rank is derived from the node’s fixed position in the DVMNodes ordering, not assigned at launch).

Today the RML handles only half of this life cycle:

  • Heal (works). When the daemon disappears, lost_connection / failed_to_connect drive prte_rml_route_lostprte_rml_repair_routing_tree, which promotes the orphaned children to their grandparent and drives adoption/failure notices. The tree closes over the hole.

  • Unheal (missing). When the daemon returns, nothing re-inserts it. The departure was recorded as permanent: prte_rml_repair_routing_tree sets the rank in both failed_dmns and dead_dmns (routed_radix.c), and dead_dmns is never cleared and is restored into failed_dmns on every prte_rml_compute_routing_tree. From then on radix_is_living reports the rank dead forever, so get_route, the children array, and the ancestor list never point back at it. The returned daemon becomes a zombie the tree ignores; its former children stay attached to the grandparent.

The goal of this design is to make the return a first-class event: the returned daemon rejoins its old slot in the tree, its children drop the grandparent lifeline and re-home to it, and the DVM converges on the same tree it would have computed had the daemon never left.

8.4.3.2. Scope and non-goals

  • Bootstrap mode only. Unheal is gated on prte_bootstrap_setup. In a launcher-driven or elastic-shrink DVM a departed vpid is genuinely permanent (a shrink retires the vpid on purpose; #2491 depends on it), and those modes keep their current behavior unchanged.

  • Same rank, same identity. We handle a node that returns as itself (same nspace + rank). We do not reuse a vpid for a different node; that invariant is preserved.

  • No change to the heal path’s externally observed behavior. A daemon that leaves and never returns must behave exactly as it does today.

8.4.3.3. Design principle: revival is the inverse of repair, not a special case of it

The tree is a deterministic function of (radix, num_daemons, failed_set). Death removes a rank from the live set and everyone recomputes; revival adds it back and everyone recomputes. The two are symmetric at the level of the routing math, but not at the level of the notification code:

  • prte_rml_repair_routing_tree and the adoption-inference logic in rml_fault_handler.c assume depth only ever decreases — a promotion. prte_rml_recv_adoption_notice explicitly treats an ancestor list that grew as an unrecoverable invariant violation (rml_fault_handler.c, the report.size > ancestors.size branch that raises PRTE_ERR_UNRECOVERABLE).

  • Revival is exactly the case that grows a daemon’s ancestor list: a rank is re-inserted above the daemons in its former subtree, demoting them one level. Feeding that through the death path would trip the invariant check.

Therefore revival needs its own recompute-and-notify entry point, prte_rml_revive_routing_tree, parallel to prte_rml_repair_routing_tree, plus its own notice tags. It must not be bolted onto the repair path.

8.4.3.4. Separating “absent” from “dead”

The root cause of the missing behavior is that one bitmap (dead_dmns) is asked to mean two different things. Split them:

Set

Persistent?

Set by

Cleared by

failed_dmns

no (per recompute)

repair / revive / compute restore

compute re-init; revive

global_failed

no

global death xcast

revive

dead_dmns

yes

shrink holes (nidmap.c); non-bootstrap faults

never

absent_dmns

yes

bootstrap faults (new)

revive (new)

(new)

  • absent_dmns is a new persistent bitmap constructed once in prte_rml_open alongside dead_dmns and, like it, not re-initialized by prte_rml_compute_routing_tree.

  • prte_rml_repair_routing_tree chooses the target set by mode: in a bootstrapped DVM a fault records the rank in absent_dmns; otherwise it records it in dead_dmns exactly as today. Both are restored into the freshly-initialized failed_dmns at the top of prte_rml_compute_routing_tree (so a grow still routes around an absent-but-not-yet-returned daemon).

  • Revival clears the rank from failed_dmns, global_failed_dmns, and absent_dmns. dead_dmns is still never touched — a shrunk-out rank can never be revived, which is correct.

This keeps the #2491 fix and all launcher/elastic behavior byte-for-byte identical (nothing outside bootstrap ever populates absent_dmns), while giving bootstrap a departure set that can be reversed.

8.4.3.5. The trigger: a returned daemon announces itself to the HNP

When the node reboots, its daemon runs the normal bootstrap startup (DVM Bootstrap Implementation Plan, Steps 4–7). It computes a healthy tree — its own absent_dmns is empty, so it sees the full DVM — and connects up its lifeline. It has no way to know, on its own, that the rest of the DVM wrote it off while it was gone.

Rather than infer the return from a stray inbound socket (routing policy does not belong in the OOB accept path), make it explicit and route it through the arbiter of global tree state, the HNP, mirroring how death is globally xcast:

  1. Rejoin request (one hop up, then filtered). On bootstrap startup, when prte_bootstrap_setup is set, the daemon sends PRTE_RML_TAG_DAEMON_RETURNED{rank=self} to its parent (one hop up its lifeline), not to the HNP. The parent is exactly the daemon that knows whether the rank was absent – the global death broadcast marked it everywhere – so the parent filters: if the rank is not in its absent_dmns (a first boot, a duplicate) it drops the notice, and only if the rank really was absent does it escalate one relayed message to the HNP. This deliberately avoids an N-to-root pattern: on the common first-boot path the root is never involved (it sees announcements only from its own few direct children, all dropped), and the notice rides the existing lifeline link so no daemon opens a socket to the root. A real return costs one escalated message to the root, O(1).

  2. HNP validates and broadcasts (global). The HNP checks the rank against absent_dmns. If absent, it clears the rank locally and xcasts PRTE_RML_TAG_DAEMON_REVIVED{rank} to the whole DVM, exactly as send_failures_notice xcasts PRTE_RML_TAG_DAEMON_DIED from the master (rml_fault_handler.c). If the rank is not absent (a genuine first boot, or a duplicate), the HNP drops the request — the operation is idempotent.

  3. Everyone converges (global). Each daemon’s prte_rml_recv_revival_notice clears the rank from its failure sets and calls prte_rml_revive_routing_tree(rank). Because the failed set is now globally consistent again, every daemon deterministically recomputes the same tree.

Centralizing at the HNP avoids the split-brain that per-subtree local revival would invite, and reuses the existing global-xcast plumbing.

8.4.3.6. prte_rml_revive_routing_tree — the recompute and its deltas

Symmetric to prte_rml_repair_routing_tree:

  1. Clear rank from failed_dmns / global_failed_dmns / absent_dmns (the recv handler does this before calling in, matching how repair sets the bit before recomputing).

  2. Snapshot prev_ancestors / prev_parent / prev_children into a prte_rml_recovery_status_t.

  3. Re-derive ancestors, promotion/demotion, and children. Most of this is the existing helpers run against the updated (smaller) failed set:

    • prte_rml_update_ancestors already walks to the next living ancestor; with rank now living again it will re-appear in the lists of the daemons below it, growing their ancestor arrays. This is the case the current code deliberately rejects, so update_ancestors needs an audited pass to confirm it produces the right list when depth increases (it may need a companion to the promotion path — a “demotion” fixup — analogous to handle_promotion).

    • The daemon that had adopted rank’s orphans (rank’s parent) loses them from its child list; rank regains them. update_descendants recomputes children from the live set, so the child arrays fall out correctly once the failed bit is cleared; the work is producing the delta for the notices.

  4. Fill in parent_changed / children_changed / a new demoted flag (mirroring promoted) and notify the components: prte_rml_fault_handler, prte_grpcomm.fault_handler, prte_filem.fault_handler, prte_relm.fault_handler.

8.4.3.7. Recovery-status and component impact

The existing prte_rml_recovery_status_t is close but assumes promotion. Two additions:

  • Add a bool demoted flag (a daemon gained an ancestor / its subtree shrank), the mirror of promoted. Handlers that “treat all children as new when promoted” need the analogous rule: treat the re-homing children as new when a neighbor is revived, because a child that briefly had the grandparent as parent must discard that lineage.

  • The RML’s own reaction (rml_fault_handler.c) needs revival analogues of its two notices:

    • Re-home notice (down) — the inverse of the adoption notice. rank’s parent tells the affected promoted children “your ancestor list has changed; rank is back above you,” so they drop the grandparent lifeline and re-open their lifeline to rank (or the closest revived ancestor in their path). The receive handler is the inverse of prte_rml_recv_adoption_notice and must accept a grown ancestor list rather than rejecting it.

    • Rejoin/rollup (up) — RELM re-drives any messages that were in flight across the re-homing so nothing is lost, exactly as it re-drives across a heal.

prte_grpcomm and prte_filem already receive the recovery status on every tree change; they must tolerate a change whose net effect is a daemon appearing. The audit here is: does any collective/xcast accounting assume membership only shrinks? prte_rml_get_num_contributors counts live children, so a revived child correctly re-enters the count once the failed bit clears — but in-progress collectives that already excluded rank need the same “save state between local and global scope” discipline the death path uses (rml_types.h documents this contract on the status struct).

8.4.3.8. Bringing the returned daemon up to date

The routing tree is only half the problem. While it was gone the returned daemon missed everything: jobs launched, other faults, nidmap growth from elastic grows. It boots with a stale world view. Re-inserting it into the tree without resynchronizing its state would let it route correctly but act on stale data.

This is the same problem the elastic grow path already solves — “admit a daemon into a running DVM and hand it the current state” — with one twist: the vpid is the returned daemon’s own, not a newly minted one. Reuse that machinery:

  • The HNP, on processing the rejoin, drives the returned daemon through the grow-style wireup so it receives the current nidmap (with any holes) and the active job/proc data, rather than the boot-time snapshot.

  • Because the returned rank is an existing hole rather than an extension of the vpid span, num_daemons does not change; only the returned rank’s dead/absent state and the tree change. The nidmap-hole bookkeeping in nidmap.c must not re-mark the returning rank as dead when it repopulates daemons->procs — clearing absent_dmns for the rank must precede, or be reconciled with, that scan.

8.4.3.9. Concurrency and correctness concerns

  • Incarnation / stale-message hazard. The returned daemon is a new process wearing the old rank. Messages queued to the old incarnation, or late death/adoption notices still in flight, could be mis-delivered to the new one. Decided: tag each daemon with a boot epoch — a monotonically increasing incarnation counter — and carry it in the wire header (prte_oob_tcp_hdr.h) so a hop can drop a message addressed to a stale incarnation of a rank. This is safe to add: the header is not an ABI (see the RML AGENTS.md — it is exchanged only among daemons of one DVM, which all run the same PRRTE build), so there is no cross-version concern, only the requirement that every daemon agree, which a single build guarantees. The epoch is the daemon’s boot timestamp, captured once at startup (a wall-clock time at prte_init); no persisted on-disk counter is needed. A reboot yields a later timestamp, so the returned incarnation always outranks the one the DVM wrote off. (The one degenerate case — a reboot fast enough, or with a reset clock, to reproduce the prior timestamp — is bounded by timestamp resolution; use at least millisecond granularity, and the HNP can reject a DAEMON_RETURNED whose epoch is not strictly greater than the recorded one, forcing a retry.) The epoch is announced in the DAEMON_RETURNED request so the HNP propagates it in the DAEMON_REVIVED xcast; peers record the current epoch per rank and reject header-stamped traffic from an older one. See Stage 6.

  • Revive/again-die races. A node that flaps (returns, dies again before the revival xcast completes) must converge. Because both death and revival are HNP-arbitrated global xcasts over the same rank, ordering them at the HNP (serialize per-rank; the last event wins) keeps every daemon consistent. The recv handlers must be idempotent (clearing an already-clear bit, or setting an already-set one, is a no-op that produces an empty delta and no notices — the status.failed_ranks.size == 0 early return in repair already models this).

  • Stale OOB peer object. The peer object for rank on its neighbors is in a closed/failed state from the original loss. Revival must reset it (or drop it so the next send re-synthesizes the URI via prte_ess_base_bootstrap_peer_uri and reconnects, as the heal path already does for adopted parents).

8.4.3.10. Staged implementation plan

The stages are independently reviewable and ordered so the tree keeps building and behaving at each step.

Stage 1 — Split the departure sets. Add absent_dmns to prte_rml_base (construct in prte_rml_open, restore into failed_dmns in prte_rml_compute_routing_tree). Route bootstrap faults to it instead of dead_dmns. No behavior change yet (an absent daemon still never returns); this only reclassifies where the mark lives. Verify launched/elastic behavior is untouched (nothing populates absent_dmns outside bootstrap).

Stage 2 — Revival recompute. Implement prte_rml_revive_routing_tree(rank) and the demoted status flag; audit update_ancestors for growing ancestor lists and add a demotion fixup if needed. Unit-exercise it by directly clearing a bit and calling it on a small synthetic tree.

Stage 3 — Global protocol. Add PRTE_RML_TAG_DAEMON_RETURNED and PRTE_RML_TAG_DAEMON_REVIVED, the HNP validate-and-xcast, and prte_rml_recv_revival_notice. At this point a returned daemon that already holds current state rejoins the tree and children re-home.

Stage 4 — Component re-drive (done). prte_rml_revive_routing_tree now notifies grpcomm, filem, and relm of the reshape (but not the death-only prte_rml_fault_handler). Two simplifications fell out of the xcast-driven design and are worth recording:

  • No separate re-home notice is needed. The inverse-adoption notice was meant to tell promoted children that a rank returned above them. But a revival is driven entirely by the single DAEMON_REVIVED xcast, so every daemon recomputes from the same signal and re-homes locally; there is no local-detection-versus-broadcast race for an adoption-style notice to close, unlike a fault.

  • No revival-specific handler branch is needed. A revival is pure shrinkage from every reshaping daemon’s view (the returned rank’s former parent swaps orphans for the rank; the orphans re-home; deeper daemons only gain an ancestor). That trips only the existing parent_changed / children_changed paths in the grpcomm and relm handlers; the promoted-only paths (replay-pending, op-id-at-promotion) are for the growth direction and correctly stay dormant. So the tested handlers are reused rather than forked.

One watch item remains for harness validation: RELM link updates are depth stamped and update_link drops a mismatched one, while revival changes depths and rides the xcast forward-first. Static analysis argues it is safe – each daemon recomputes synchronously right after forwarding, so both ends have settled on their new depths before any link update (a later, separate message) is processed – but the multi-hop update gating is subtle enough to confirm on the Docker harness (kill an interior node, restart it, then launch a job across the DVM and check nothing was lost).

Stage 5 — State resync. Wire the returned daemon through the grow-style state handoff so it comes back with the current nidmap and job data; reconcile with the nidmap.c hole scan. Harness evidence (2026-07-04): the topology unheal was verified end-to-end on the Docker swarm (a radix-2 bootstrap DVM; killing the interior daemon healed its child up to the grandparent, restarting it drove the return/revival broadcast and the child re-homed back). But once the returned daemon was asked to participate in a reliable xcast, it died with PRTE_ERR_OUT_OF_ORDER_MSG (grpcomm_direct_xcast.c, the op_id != op_id_completed + 1 check): it rejoined with a fresh xcast op-id counter while the DVM’s broadcast stream was already at a higher op-id, so the first op forwarded to it was out of order and it force-exited.

The op-id half is now done and verified. xcast_recv recognizes a late joiner (a daemon with op_id_inited == 0 handed an op above one – grown, returned, or simply booted after the first broadcast) and adopts the intervening ops as complete, so ordering holds for that op and every one after. The harness confirms the OUT_OF_ORDER exit is gone, and the elastic suite (16/16) confirms the normal xcast path is unaffected. This also removes a latent grow hazard.

The nidmap/job-data half is now done and verified. The “Node has gone down” force-exit turned out not to be missing job data but a nidmap span bug in the handoff itself. When the returned daemon’s connection warms up, the HNP builds the handoff nidmap (prte_util_nidmap_create) while prte_process_info.num_daemons still holds the departed count – the error manager decremented it on the death and the daemon’s formal relaunch report, which restores it, has not run yet. The departed node’s node->daemon entry persists, though, so every vpid is still packed. Encoding num_daemons as the span therefore declared a span one short of the highest packed vpid. The returned daemon decoded the short span, reset its own num_daemons to it, and recomputed a routing tree over a rank space that excluded the top daemon – so its live child dropped out of its subtree and traffic bound for that child was misrouted to the parent. (The short span also under-sized the encode-side vpid buffer by one entry.) The fix derives the span from the pool instead: max(num_daemons, highest_packed_vpid + 1), which covers every packed daemon while still preserving a legitimate top-of-range shrink hole, and sizes the buffer to match. The decode-side hole scan additionally records a bootstrap hole as absent (clearable) rather than permanently dead. Harness-verified: the returned daemon now decodes the full span, its routing stays consistent, and a job launched after the unheal runs across the returned daemon and its child with every daemon surviving; the elastic suite (16/16) is unaffected.

Stage 6 — Incarnation guard (done). Each process captures a boot epoch – a millisecond wall-clock timestamp taken once at RML startup – and stamps it into the OOB wire header (prte_oob_tcp_hdr_t) as the origin’s epoch: a message built locally carries this process’s epoch, and a relayed message preserves the original sender’s epoch from the received header. Every daemon records the highest epoch it has learned per rank; in a bootstrapped DVM it drops daemon-namespace traffic stamped with a strictly older epoch for a rank (the check runs after the whole message is read, so the byte stream stays framed, and only for the daemon namespace, since tool namespaces reuse rank numbers). A newer epoch passes but does not advance the table – the arbitrated revival does that. The returning daemon announces its epoch in DAEMON_RETURNED; the HNP accepts the return only if that epoch is strictly greater than the one last recorded for the rank (rejecting a stale or degenerate same-timestamp reboot and forcing a retry), then carries it in the DAEMON_REVIVED broadcast so every daemon records the new incarnation and drops any lingering traffic from the old one. The wire header is exchanged only among daemons of one DVM, all on the same build, so the added field is no ABI concern. The drop path is bootstrap-gated, so launched and elastic DVMs are unaffected; harness-verified – the elastic suite (16/16) and the bootstrap unheal end-to-end (revival, get_route stability, a post-unheal job across the returned daemon) both pass with the guard in place, and no legitimate traffic is dropped. This closes the stale-message window that the return of a same-rank/new-process daemon opens.

8.4.3.11. Testing

The Docker multi-node bootstrap harness (contrib/dockerswarm/) already drives launcher-less formation. Extend it:

  1. Form a bootstrapped DVM of enough nodes to have a non-trivial interior (radix small enough that some daemon has both a parent and children).

  2. Kill an interior node’s daemon (or docker stop the node); confirm the heal — children promote to the grandparent — via rml_base_verbose.

  3. Restart the node; confirm the unheal — the DAEMON_RETURNED / DAEMON_REVIVED exchange, the children re-homing to the returned rank, and the tree matching a never-failed run.

  4. Launch a job across the DVM after the unheal to confirm the returned daemon carries current state and participates in collectives.

  5. Flap test: kill and restart in quick succession to exercise the race/idempotence handling.

8.4.3.12. Resolved decisions

  1. Incarnation identity — boot-epoch in the wire header. Adopt a boot-epoch incarnation counter carried in prte_oob_tcp_hdr_t (Stage 6) rather than trying to close the stale-message window with xcast ordering and peer reset alone. There is no ABI cost: every daemon of a DVM runs the same PRRTE build, so the header can change freely as long as all daemons agree. The epoch value is the daemon’s boot timestamp (captured at prte_init), not a persisted counter — a reboot always produces a later value, so no on-disk state is required.

  2. Bootstrap-only — no launched/elastic re-launch. Unheal stays gated on prte_bootstrap_setup. Extending it to a launched or elastic DVM would require re-launching the returned daemon into its original vpid (an existing hole), and the bulk launchers cannot do that: SLURM, PALS, and similar RMs assign vpids sequentially over the node set they are handed and offer no way to force a particular vpid one-at-a-time, so there is no portable re-launch-into-hole primitive to build on. Only bootstrap, where a returned node re-derives its own rank from static configuration and re-runs its own daemon, provides the returned-with-original-rank precondition the mechanism needs. The RML core (the absent_dmns split, prte_rml_revive_routing_tree, the revival protocol, re-home notices, the boot-epoch guard) is not itself launcher-specific, so this could be revisited if a launcher ever gains per-vpid placement — but it is out of scope now.

  3. Trigger source — announce to the parent, escalate to the HNP. A returning daemon announces one hop up to its parent, which filters on its own absent_dmns and escalates only a genuine return to the HNP; the HNP remains the sole arbiter that broadcasts the revival. This is chosen over the daemon announcing straight to the HNP specifically to avoid an N-to-root pattern: funnelling every daemon’s boot-time announcement onto the root makes the root aggregate N messages and, at the transport, sustain the fan-in that burdens the OS and hurts responsiveness at scale. Parent-filtered escalation keeps the root out of the common (first-boot) path entirely while preserving single-arbiter global consistency.

  4. Partial returns — handled by the base-rebuild reduction. If several daemons in one subtree are absent and only some return, or a returned rank is itself below a still-absent ancestor, no special handling is needed. After prte_rml_revive_routing_tree clears the returned rank’s bit, the failed set is exactly what compute_routing_tree would hold for the same still-absent ranks, and both routines derive the tree through the same build_tree_from_base helper – which starts from the full-depth base ancestor list and drops whatever is still failed. A revival therefore produces the identical tree a fresh compute would for that failed set, so update_ancestors walks partial-return cases correctly by construction.

8.4.3.13. Open questions

(none currently open — remaining work is the Stage 4 RELM watch item and Stages 5–6, all tracked in the staged plan above.)