8.4.3. Rewiring a Returned Daemon: the “Unheal” Path
Status: design draft. This document proposes a mechanism, not yet
implemented, for restoring a daemon to the routing tree after it has
disappeared and later returned. It builds directly on the heal path
described in DVM Bootstrap Implementation Plan (Step 7b) and reuses the fault
machinery in src/rml.
8.4.3.1. Problem statement
In a bootstrapped DVM the compute nodes come up independently and are not
fanned out by a launcher. A node can therefore leave the DVM without the
DVM being torn down — the node is powered off, loses power, or is rebooted —
and then return when it comes back up. Its daemon boots again with the
same rank (rank is derived from the node’s fixed position in the
DVMNodes ordering, not assigned at launch).
Today the RML handles only half of this life cycle:
Heal (works). When the daemon disappears,
lost_connection/failed_to_connectdriveprte_rml_route_lost→prte_rml_repair_routing_tree, which promotes the orphaned children to their grandparent and drives adoption/failure notices. The tree closes over the hole.Unheal (missing). When the daemon returns, nothing re-inserts it. The departure was recorded as permanent:
prte_rml_repair_routing_treesets the rank in bothfailed_dmnsanddead_dmns(routed_radix.c), anddead_dmnsis never cleared and is restored intofailed_dmnson everyprte_rml_compute_routing_tree. From then onradix_is_livingreports the rank dead forever, soget_route, the children array, and the ancestor list never point back at it. The returned daemon becomes a zombie the tree ignores; its former children stay attached to the grandparent.
The goal of this design is to make the return a first-class event: the returned daemon rejoins its old slot in the tree, its children drop the grandparent lifeline and re-home to it, and the DVM converges on the same tree it would have computed had the daemon never left.
8.4.3.2. Scope and non-goals
Bootstrap mode only. Unheal is gated on
prte_bootstrap_setup. In a launcher-driven or elastic-shrink DVM a departed vpid is genuinely permanent (a shrink retires the vpid on purpose; #2491 depends on it), and those modes keep their current behavior unchanged.Same rank, same identity. We handle a node that returns as itself (same nspace + rank). We do not reuse a vpid for a different node; that invariant is preserved.
No change to the heal path’s externally observed behavior. A daemon that leaves and never returns must behave exactly as it does today.
8.4.3.3. Design principle: revival is the inverse of repair, not a special case of it
The tree is a deterministic function of (radix, num_daemons, failed_set).
Death removes a rank from the live set and everyone recomputes; revival adds
it back and everyone recomputes. The two are symmetric at the level of the
routing math, but not at the level of the notification code:
prte_rml_repair_routing_treeand the adoption-inference logic inrml_fault_handler.cassume depth only ever decreases — a promotion.prte_rml_recv_adoption_noticeexplicitly treats an ancestor list that grew as an unrecoverable invariant violation (rml_fault_handler.c, thereport.size > ancestors.sizebranch that raisesPRTE_ERR_UNRECOVERABLE).Revival is exactly the case that grows a daemon’s ancestor list: a rank is re-inserted above the daemons in its former subtree, demoting them one level. Feeding that through the death path would trip the invariant check.
Therefore revival needs its own recompute-and-notify entry point,
prte_rml_revive_routing_tree, parallel to prte_rml_repair_routing_tree,
plus its own notice tags. It must not be bolted onto the repair path.
8.4.3.4. Separating “absent” from “dead”
The root cause of the missing behavior is that one bitmap
(dead_dmns) is asked to mean two different things. Split them:
Set |
Persistent? |
Set by |
Cleared by |
|---|---|---|---|
|
no (per recompute) |
repair / revive / compute restore |
compute re-init; revive |
|
no |
global death xcast |
revive |
|
yes |
shrink holes ( |
never |
|
yes |
bootstrap faults (new) |
revive (new) |
(new) |
absent_dmnsis a new persistent bitmap constructed once inprte_rml_openalongsidedead_dmnsand, like it, not re-initialized byprte_rml_compute_routing_tree.prte_rml_repair_routing_treechooses the target set by mode: in a bootstrapped DVM a fault records the rank inabsent_dmns; otherwise it records it indead_dmnsexactly as today. Both are restored into the freshly-initializedfailed_dmnsat the top ofprte_rml_compute_routing_tree(so a grow still routes around an absent-but-not-yet-returned daemon).Revival clears the rank from
failed_dmns,global_failed_dmns, andabsent_dmns.dead_dmnsis still never touched — a shrunk-out rank can never be revived, which is correct.
This keeps the #2491 fix and all launcher/elastic behavior byte-for-byte
identical (nothing outside bootstrap ever populates absent_dmns), while
giving bootstrap a departure set that can be reversed.
8.4.3.5. The trigger: a returned daemon announces itself to the HNP
When the node reboots, its daemon runs the normal bootstrap startup
(DVM Bootstrap Implementation Plan, Steps 4–7). It computes a healthy tree — its own
absent_dmns is empty, so it sees the full DVM — and connects up its
lifeline. It has no way to know, on its own, that the rest of the DVM wrote
it off while it was gone.
Rather than infer the return from a stray inbound socket (routing policy does not belong in the OOB accept path), make it explicit and route it through the arbiter of global tree state, the HNP, mirroring how death is globally xcast:
Rejoin request (one hop up, then filtered). On bootstrap startup, when
prte_bootstrap_setupis set, the daemon sendsPRTE_RML_TAG_DAEMON_RETURNED{rank=self}to its parent (one hop up its lifeline), not to the HNP. The parent is exactly the daemon that knows whether the rank was absent – the global death broadcast marked it everywhere – so the parent filters: if the rank is not in itsabsent_dmns(a first boot, a duplicate) it drops the notice, and only if the rank really was absent does it escalate one relayed message to the HNP. This deliberately avoids an N-to-root pattern: on the common first-boot path the root is never involved (it sees announcements only from its own few direct children, all dropped), and the notice rides the existing lifeline link so no daemon opens a socket to the root. A real return costs one escalated message to the root,O(1).HNP validates and broadcasts (global). The HNP checks the rank against
absent_dmns. If absent, it clears the rank locally and xcastsPRTE_RML_TAG_DAEMON_REVIVED{rank}to the whole DVM, exactly assend_failures_noticexcastsPRTE_RML_TAG_DAEMON_DIEDfrom the master (rml_fault_handler.c). If the rank is not absent (a genuine first boot, or a duplicate), the HNP drops the request — the operation is idempotent.Everyone converges (global). Each daemon’s
prte_rml_recv_revival_noticeclears the rank from its failure sets and callsprte_rml_revive_routing_tree(rank). Because the failed set is now globally consistent again, every daemon deterministically recomputes the same tree.
Centralizing at the HNP avoids the split-brain that per-subtree local revival would invite, and reuses the existing global-xcast plumbing.
8.4.3.6. prte_rml_revive_routing_tree — the recompute and its deltas
Symmetric to prte_rml_repair_routing_tree:
Clear
rankfromfailed_dmns/global_failed_dmns/absent_dmns(the recv handler does this before calling in, matching howrepairsets the bit before recomputing).Snapshot
prev_ancestors/prev_parent/prev_childreninto aprte_rml_recovery_status_t.Re-derive ancestors, promotion/demotion, and children. Most of this is the existing helpers run against the updated (smaller) failed set:
prte_rml_update_ancestorsalready walks to the next living ancestor; withranknow living again it will re-appear in the lists of the daemons below it, growing their ancestor arrays. This is the case the current code deliberately rejects, soupdate_ancestorsneeds an audited pass to confirm it produces the right list when depth increases (it may need a companion to the promotion path — a “demotion” fixup — analogous tohandle_promotion).The daemon that had adopted
rank’s orphans (rank’s parent) loses them from its child list;rankregains them.update_descendantsrecomputes children from the live set, so the child arrays fall out correctly once the failed bit is cleared; the work is producing the delta for the notices.
Fill in
parent_changed/children_changed/ a newdemotedflag (mirroringpromoted) and notify the components:prte_rml_fault_handler,prte_grpcomm.fault_handler,prte_filem.fault_handler,prte_relm.fault_handler.
8.4.3.7. Recovery-status and component impact
The existing prte_rml_recovery_status_t is close but assumes promotion.
Two additions:
Add a
bool demotedflag (a daemon gained an ancestor / its subtree shrank), the mirror ofpromoted. Handlers that “treat all children as new when promoted” need the analogous rule: treat the re-homing children as new when a neighbor is revived, because a child that briefly had the grandparent as parent must discard that lineage.The RML’s own reaction (
rml_fault_handler.c) needs revival analogues of its two notices:Re-home notice (down) — the inverse of the adoption notice.
rank’s parent tells the affected promoted children “your ancestor list has changed;rankis back above you,” so they drop the grandparent lifeline and re-open their lifeline torank(or the closest revived ancestor in their path). The receive handler is the inverse ofprte_rml_recv_adoption_noticeand must accept a grown ancestor list rather than rejecting it.Rejoin/rollup (up) — RELM re-drives any messages that were in flight across the re-homing so nothing is lost, exactly as it re-drives across a heal.
prte_grpcomm and prte_filem already receive the recovery status on
every tree change; they must tolerate a change whose net effect is a daemon
appearing. The audit here is: does any collective/xcast accounting assume
membership only shrinks? prte_rml_get_num_contributors counts live
children, so a revived child correctly re-enters the count once the failed bit
clears — but in-progress collectives that already excluded rank need the
same “save state between local and global scope” discipline the death path
uses (rml_types.h documents this contract on the status struct).
8.4.3.8. Bringing the returned daemon up to date
The routing tree is only half the problem. While it was gone the returned daemon missed everything: jobs launched, other faults, nidmap growth from elastic grows. It boots with a stale world view. Re-inserting it into the tree without resynchronizing its state would let it route correctly but act on stale data.
This is the same problem the elastic grow path already solves — “admit a daemon into a running DVM and hand it the current state” — with one twist: the vpid is the returned daemon’s own, not a newly minted one. Reuse that machinery:
The HNP, on processing the rejoin, drives the returned daemon through the grow-style wireup so it receives the current nidmap (with any holes) and the active job/proc data, rather than the boot-time snapshot.
Because the returned rank is an existing hole rather than an extension of the vpid span,
num_daemonsdoes not change; only the returned rank’sdead/absentstate and the tree change. The nidmap-hole bookkeeping innidmap.cmust not re-mark the returning rank as dead when it repopulatesdaemons->procs— clearingabsent_dmnsfor the rank must precede, or be reconciled with, that scan.
8.4.3.9. Concurrency and correctness concerns
Incarnation / stale-message hazard. The returned daemon is a new process wearing the old rank. Messages queued to the old incarnation, or late death/adoption notices still in flight, could be mis-delivered to the new one. Decided: tag each daemon with a boot epoch — a monotonically increasing incarnation counter — and carry it in the wire header (
prte_oob_tcp_hdr.h) so a hop can drop a message addressed to a stale incarnation of a rank. This is safe to add: the header is not an ABI (see the RMLAGENTS.md— it is exchanged only among daemons of one DVM, which all run the same PRRTE build), so there is no cross-version concern, only the requirement that every daemon agree, which a single build guarantees. The epoch is the daemon’s boot timestamp, captured once at startup (a wall-clock time atprte_init); no persisted on-disk counter is needed. A reboot yields a later timestamp, so the returned incarnation always outranks the one the DVM wrote off. (The one degenerate case — a reboot fast enough, or with a reset clock, to reproduce the prior timestamp — is bounded by timestamp resolution; use at least millisecond granularity, and the HNP can reject aDAEMON_RETURNEDwhose epoch is not strictly greater than the recorded one, forcing a retry.) The epoch is announced in theDAEMON_RETURNEDrequest so the HNP propagates it in theDAEMON_REVIVEDxcast; peers record the current epoch per rank and reject header-stamped traffic from an older one. See Stage 6.Revive/again-die races. A node that flaps (returns, dies again before the revival xcast completes) must converge. Because both death and revival are HNP-arbitrated global xcasts over the same rank, ordering them at the HNP (serialize per-rank; the last event wins) keeps every daemon consistent. The recv handlers must be idempotent (clearing an already-clear bit, or setting an already-set one, is a no-op that produces an empty delta and no notices — the
status.failed_ranks.size == 0early return inrepairalready models this).Stale OOB peer object. The peer object for
rankon its neighbors is in a closed/failed state from the original loss. Revival must reset it (or drop it so the next send re-synthesizes the URI viaprte_ess_base_bootstrap_peer_uriand reconnects, as the heal path already does for adopted parents).
8.4.3.10. Staged implementation plan
The stages are independently reviewable and ordered so the tree keeps building and behaving at each step.
Stage 1 — Split the departure sets. Add absent_dmns to
prte_rml_base (construct in prte_rml_open, restore into failed_dmns
in prte_rml_compute_routing_tree). Route bootstrap faults to it instead of
dead_dmns. No behavior change yet (an absent daemon still never returns);
this only reclassifies where the mark lives. Verify launched/elastic behavior
is untouched (nothing populates absent_dmns outside bootstrap).
Stage 2 — Revival recompute. Implement
prte_rml_revive_routing_tree(rank) and the demoted status flag; audit
update_ancestors for growing ancestor lists and add a demotion fixup if
needed. Unit-exercise it by directly clearing a bit and calling it on a small
synthetic tree.
Stage 3 — Global protocol. Add PRTE_RML_TAG_DAEMON_RETURNED and
PRTE_RML_TAG_DAEMON_REVIVED, the HNP validate-and-xcast, and
prte_rml_recv_revival_notice. At this point a returned daemon that already
holds current state rejoins the tree and children re-home.
Stage 4 — Component re-drive (done). prte_rml_revive_routing_tree now
notifies grpcomm, filem, and relm of the reshape (but not the
death-only prte_rml_fault_handler). Two simplifications fell out of the
xcast-driven design and are worth recording:
No separate re-home notice is needed. The inverse-adoption notice was meant to tell promoted children that a rank returned above them. But a revival is driven entirely by the single
DAEMON_REVIVEDxcast, so every daemon recomputes from the same signal and re-homes locally; there is no local-detection-versus-broadcast race for an adoption-style notice to close, unlike a fault.No revival-specific handler branch is needed. A revival is pure shrinkage from every reshaping daemon’s view (the returned rank’s former parent swaps orphans for the rank; the orphans re-home; deeper daemons only gain an ancestor). That trips only the existing
parent_changed/children_changedpaths in thegrpcommandrelmhandlers; thepromoted-only paths (replay-pending, op-id-at-promotion) are for the growth direction and correctly stay dormant. So the tested handlers are reused rather than forked.
One watch item remains for harness validation: RELM link updates are depth
stamped and update_link drops a mismatched one, while revival changes
depths and rides the xcast forward-first. Static analysis argues it is safe –
each daemon recomputes synchronously right after forwarding, so both ends have
settled on their new depths before any link update (a later, separate message)
is processed – but the multi-hop update gating is subtle enough to confirm on
the Docker harness (kill an interior node, restart it, then launch a job across
the DVM and check nothing was lost).
Stage 5 — State resync. Wire the returned daemon through the grow-style
state handoff so it comes back with the current nidmap and job data; reconcile
with the nidmap.c hole scan. Harness evidence (2026-07-04): the topology
unheal was verified end-to-end on the Docker swarm (a radix-2 bootstrap DVM;
killing the interior daemon healed its child up to the grandparent, restarting
it drove the return/revival broadcast and the child re-homed back). But once
the returned daemon was asked to participate in a reliable xcast, it died with
PRTE_ERR_OUT_OF_ORDER_MSG (grpcomm_direct_xcast.c, the
op_id != op_id_completed + 1 check): it rejoined with a fresh xcast op-id
counter while the DVM’s broadcast stream was already at a higher op-id, so the
first op forwarded to it was out of order and it force-exited.
The op-id half is now done and verified. xcast_recv recognizes a late
joiner (a daemon with op_id_inited == 0 handed an op above one – grown,
returned, or simply booted after the first broadcast) and adopts the
intervening ops as complete, so ordering holds for that op and every one after.
The harness confirms the OUT_OF_ORDER exit is gone, and the elastic suite
(16/16) confirms the normal xcast path is unaffected. This also removes a
latent grow hazard.
The nidmap/job-data half is now done and verified. The “Node has gone down”
force-exit turned out not to be missing job data but a nidmap span bug in
the handoff itself. When the returned daemon’s connection warms up, the HNP
builds the handoff nidmap (prte_util_nidmap_create) while
prte_process_info.num_daemons still holds the departed count – the error
manager decremented it on the death and the daemon’s formal relaunch report,
which restores it, has not run yet. The departed node’s node->daemon entry
persists, though, so every vpid is still packed. Encoding num_daemons as
the span therefore declared a span one short of the highest packed vpid. The
returned daemon decoded the short span, reset its own num_daemons to it, and
recomputed a routing tree over a rank space that excluded the top daemon –
so its live child dropped out of its subtree and traffic bound for that child
was misrouted to the parent. (The short span also under-sized the encode-side
vpid buffer by one entry.) The fix derives the span from the pool instead:
max(num_daemons, highest_packed_vpid + 1), which covers every packed daemon
while still preserving a legitimate top-of-range shrink hole, and sizes the
buffer to match. The decode-side hole scan additionally records a bootstrap
hole as absent (clearable) rather than permanently dead. Harness-verified:
the returned daemon now decodes the full span, its routing stays consistent, and
a job launched after the unheal runs across the returned daemon and its child
with every daemon surviving; the elastic suite (16/16) is unaffected.
Stage 6 — Incarnation guard (done). Each process captures a boot epoch – a
millisecond wall-clock timestamp taken once at RML startup – and stamps it into
the OOB wire header (prte_oob_tcp_hdr_t) as the origin’s epoch: a message
built locally carries this process’s epoch, and a relayed message preserves the
original sender’s epoch from the received header. Every daemon records the
highest epoch it has learned per rank; in a bootstrapped DVM it drops
daemon-namespace traffic stamped with a strictly older epoch for a rank (the
check runs after the whole message is read, so the byte stream stays framed, and
only for the daemon namespace, since tool namespaces reuse rank numbers). A
newer epoch passes but does not advance the table – the arbitrated revival does
that. The returning daemon announces its epoch in DAEMON_RETURNED; the HNP
accepts the return only if that epoch is strictly greater than the one last
recorded for the rank (rejecting a stale or degenerate same-timestamp reboot and
forcing a retry), then carries it in the DAEMON_REVIVED broadcast so every
daemon records the new incarnation and drops any lingering traffic from the old
one. The wire header is exchanged only among daemons of one DVM, all on the same
build, so the added field is no ABI concern. The drop path is bootstrap-gated,
so launched and elastic DVMs are unaffected; harness-verified – the elastic
suite (16/16) and the bootstrap unheal end-to-end (revival, get_route
stability, a post-unheal job across the returned daemon) both pass with the guard
in place, and no legitimate traffic is dropped. This closes the stale-message
window that the return of a same-rank/new-process daemon opens.
8.4.3.11. Testing
The Docker multi-node bootstrap harness (contrib/dockerswarm/) already
drives launcher-less formation. Extend it:
Form a bootstrapped DVM of enough nodes to have a non-trivial interior (radix small enough that some daemon has both a parent and children).
Kill an interior node’s daemon (or
docker stopthe node); confirm the heal — children promote to the grandparent — viarml_base_verbose.Restart the node; confirm the unheal — the
DAEMON_RETURNED/DAEMON_REVIVEDexchange, the children re-homing to the returned rank, and the tree matching a never-failed run.Launch a job across the DVM after the unheal to confirm the returned daemon carries current state and participates in collectives.
Flap test: kill and restart in quick succession to exercise the race/idempotence handling.
8.4.3.12. Resolved decisions
Incarnation identity — boot-epoch in the wire header. Adopt a boot-epoch incarnation counter carried in
prte_oob_tcp_hdr_t(Stage 6) rather than trying to close the stale-message window with xcast ordering and peer reset alone. There is no ABI cost: every daemon of a DVM runs the same PRRTE build, so the header can change freely as long as all daemons agree. The epoch value is the daemon’s boot timestamp (captured atprte_init), not a persisted counter — a reboot always produces a later value, so no on-disk state is required.Bootstrap-only — no launched/elastic re-launch. Unheal stays gated on
prte_bootstrap_setup. Extending it to a launched or elastic DVM would require re-launching the returned daemon into its original vpid (an existing hole), and the bulk launchers cannot do that: SLURM, PALS, and similar RMs assign vpids sequentially over the node set they are handed and offer no way to force a particular vpid one-at-a-time, so there is no portable re-launch-into-hole primitive to build on. Only bootstrap, where a returned node re-derives its own rank from static configuration and re-runs its own daemon, provides the returned-with-original-rank precondition the mechanism needs. The RML core (theabsent_dmnssplit,prte_rml_revive_routing_tree, the revival protocol, re-home notices, the boot-epoch guard) is not itself launcher-specific, so this could be revisited if a launcher ever gains per-vpid placement — but it is out of scope now.Trigger source — announce to the parent, escalate to the HNP. A returning daemon announces one hop up to its parent, which filters on its own
absent_dmnsand escalates only a genuine return to the HNP; the HNP remains the sole arbiter that broadcasts the revival. This is chosen over the daemon announcing straight to the HNP specifically to avoid an N-to-root pattern: funnelling every daemon’s boot-time announcement onto the root makes the root aggregateNmessages and, at the transport, sustain the fan-in that burdens the OS and hurts responsiveness at scale. Parent-filtered escalation keeps the root out of the common (first-boot) path entirely while preserving single-arbiter global consistency.Partial returns — handled by the base-rebuild reduction. If several daemons in one subtree are absent and only some return, or a returned rank is itself below a still-absent ancestor, no special handling is needed. After
prte_rml_revive_routing_treeclears the returned rank’s bit, the failed set is exactly whatcompute_routing_treewould hold for the same still-absent ranks, and both routines derive the tree through the samebuild_tree_from_basehelper – which starts from the full-depth base ancestor list and drops whatever is still failed. A revival therefore produces the identical tree a fresh compute would for that failed set, soupdate_ancestorswalks partial-return cases correctly by construction.
8.4.3.13. Open questions
(none currently open — remaining work is the Stage 4 RELM watch item and Stages 5–6, all tracked in the staged plan above.)