8.3.5. Collective Shrink Completion — Repair the Routing Tree Once per Campaign

This document plans the optimization tracked in openpmix/prrte#2492: draining a shrink campaign as a single collective event rather than one daemon at a time. It is a revision of the shrink path described in DVM Shrink-Campaign Fence Tracking; that document remains authoritative for the fence mechanism, the held-job arrays, the second hold point at LAUNCH_APPS, and the campaign type. Only the pieces this revision changes are restated here.

This is an optimization, not a correctness fix. The per-daemon completion path shipped in DVM Shrink-Campaign Fence Tracking is correct; it was deliberately kept simple and the collective scheme was deferred. Nothing here changes the externally observable contract in Elastic DVM: Specification — the same PMIX_DVM_IS_READY / PMIX_ERR_DVM_MOD events fire for the same requests; only when and how many times the internal routing-tree repair runs changes.

The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan.

8.3.5.1. Background — the cost being removed

The shipped design drains a campaign one daemon at a time. Every targeted daemon exits on its own in response to PRTE_DAEMON_SHRINK_CMD (src/prted/prted_comm.c, the PRTE_DAEMON_SHRINK_CMD case), and the HNP discovers each departure independently through the errmgr comm-failure path (src/mca/errmgr/dvm/errmgr_dvm.c, proc_errors()), decrementing the campaign’s pending count and the shared fence once per death.

Each of those independent departures drives a separate routing-tree repair. A daemon that exits triggers prte_rml_route_lost() (src/rml/routed_radix.c), which calls prte_rml_repair_routing_tree(&failed_ranks, /*global=*/false) with a single rank; that in turn runs handle_promotion() and update_descendants() for that one rank. Shrinking m daemons that sit along one branch of the radix routing tree therefore triggers up to m sequential promotions/descendant rewrites. Review of PR #2472 flagged this as potentially expensive for a large single-branch shrink (unprofiled).

Crucially, prte_rml_repair_routing_tree() already accepts a rank array (pmix_data_array_t *failed_ranks) and performs a single promotion/descendant pass for the whole set. The optimization is to feed it the whole campaign at once instead of one rank at a time.

8.3.5.2. Design overview

Repair the tree once per campaign:

  1. The HNP broadcasts PRTE_DAEMON_SHRINK_CMD via the reliable xcast, exactly as today. The broadcast payload already carries the full target-rank list, so every daemon learns the complete set of departing ranks from the broadcast itself — no separate failure notice is needed to inform survivors.

  2. Each targeted daemon records that it is leaving and does its local processing, but does not exit yet (today it self-exits).

  3. The HNP hooks the broadcast’s completion. The reliable xcast in src/mca/grpcomm/direct/grpcomm_direct_xcast.c already tracks completion via ACKs flowing up the tree: when finish_op() runs on the master, every daemon in the DVM has received the op. A per-op completion callback is added (or the shrink op special-cased) to fire a handler at that point.

  4. That handler reports all of the campaign’s targets as failed in a single batch via prte_rml_repair_routing_tree(failed_ranks, /*global=*/false) — one promotion/descendant pass for the whole set — and performs the HNP-side teardown bookkeeping (num_daemons, node state, fence) for the batch.

  5. Each doomed daemon exits once its lifeline disconnects as a consequence of the rewire, rather than self-exiting — but only because processing the shrink command put it into a new leaving mode that converts lifeline loss into termination (normally lifeline loss triggers recovery, not exit). Because that mode rides in the broadcast itself and reaches each doomed daemon through its own lifeline, there is no race between learning “you are leaving” and the lifeline failing (see Step 3’s design decision). The completion event (PMIX_DVM_IS_READY) fires from this single batch point rather than from the last individual departure.

8.3.5.3. Why this is not the rejected per-daemon ACK

DVM Shrink-Campaign Fence Tracking rejected an earlier design in which each daemon sent a PRTE_PLM_SHRINK_ACK_CMD announcing its intent to leave, and the HNP decremented the campaign on receipt of that ACK. That was wrong because the ACK arrived while the daemon was still a live participant — acting on it could release held jobs into a DVM that still believed the departing daemon present — and because two decrement paths (ACK plus errmgr fallback) double-counted.

The collective scheme is not that design. The authoritative HNP-side teardown — route removal, num_daemons, node state, fence — still happens at the batch-repair callback, and held jobs are released only after it. The signal is not a daemon announcing intent; it is the xcast-completion fact that every daemon has received the shrink order, at which point the HNP itself performs the teardown. The invariant “act once teardown has occurred, not on intent” is preserved. Because completion collapses to a single event per campaign, the per-rank PMIX_RANK_INVALID idempotency stamping and the double-count analysis that the per-death path required are retired: there is now exactly one teardown event per campaign, so there is nothing to make idempotent.

8.3.5.4. Required revisions

8.3.5.4.1. Step 1 — Add an xcast completion callback (grpcomm/direct)

Today prte_grpcomm.xcast(tag, msg) is fire-and-forget (src/mca/grpcomm/grpcomm.h, prte_grpcomm_base_module_xcast_fn_t). The reliable xcast already knows when the whole DVM has received an op — on the master, finish_op() in grpcomm_direct_xcast.c runs when the last child ACK arrives, and op->sig.op_id is complete for the entire subtree, which for the master is the entire DVM.

Add a mechanism to run a caller-supplied callback at that point. Two options, in preference order:

  • Per-op completion callback (preferred). Extend the xcast entry so the caller may pass a completion function and an opaque cbdata, cache them on the op_t, and invoke them from finish_op() only on the master (PRTE_PROC_IS_MASTER) — the point at which whole-DVM receipt is known. This is a general facility, useful beyond shrink.

  • Special-case the shrink op. If a full callback API is deemed too broad for this change, have finish_op() on the master recognize the shrink op and call the shrink-completion handler directly. Cheaper to write, less reusable; a follow-up would still likely generalize it.

Whichever is chosen, the callback fires on the progress thread inside finish_op(), so it may touch state-machine globals directly.

Note

finish_op() also runs on non-master daemons (it ACKs to the parent). The completion callback must fire only where PRTE_PROC_IS_MASTER is true, because only there does op completion mean every daemon received the op. A non-master’s finish_op() means only its own subtree completed.

8.3.5.4.2. Step 2 — HNP shrink-completion handler (new)

Register the Step-1 callback when the shrink xcast is issued in src/mca/ras/base/ras_base_allocate.c (the PMIX_ALLOC_RELEASE branch of prte_ras_base_complete_request() and the reservation-teardown xcast in prte_ras_base_teardown_reservation() — both send PRTE_DAEMON_SHRINK_CMD). Carry the prte_shrink_campaign_t * as the callback cbdata so the handler has the target list and requester in hand.

The handler, running once per campaign on the master, must do everything the per-death errmgr path currently does across m invocations — but for the whole batch, and exactly once:

  1. Batch routing-tree repair. Build a pmix_data_array_t from camp->targets and call prte_rml_repair_routing_tree(&failed, /*global=*/false) once. This is the single promotion/descendant pass that replaces the per-daemon repairs.

  2. Per-target HNP bookkeeping. For each target rank, apply the same teardown the comm-failure block applies today (errmgr_dvm.c lines 269-274): unset PRTE_PROC_FLAG_ALIVE, set the proc state, and decrement prte_process_info.num_daemons. This bookkeeping currently rides on the comm-failure event; when the loss is declared proactively it must be done here instead. This is the highest-risk part of the change — see Validation below.

  3. Reset the node’s launch state for re-grow. Detaching the daemon (node->daemon = NULL, node->session = NULL) is not enough to make the node re-growable: the node object persists in the pool carrying the PRTE_NODE_FLAG_DAEMON_LAUNCHED flag every plm launcher checks, and it stays in the daemon-job map. Left as-is, a later grow onto the same node is skipped (“daemon already exists”) and its prted never relaunches, and the stale map entry lets setup_vm add the node a second time. Call the shared helper prte_plm_base_reset_dvm_node() for each detached node — it clears PRTE_NODE_FLAG_DAEMON_LAUNCHED/PRTE_NODE_FLAG_LOC_VERIFIED and drops the node from the daemon-job map. This is launcher-agnostic and is a prerequisite for re-growing a previously shrunk node (see #2491); it does not by itself complete the re-grow, which additionally needs the daemon vpid space left dense enough for the positional radix routing tree.

  4. Fence and completion. Decrement prte_dvm_launch_fence by camp->pending (all at once), invoke prte_ras_base_shrink_complete(camp) to give the RAS modules their release hook, emit PMIX_DVM_IS_READY to the requester via prte_plm_base_dvm_mod_notify() when camp->have_requester, remove the campaign from prte_shrink_campaigns, and call prte_plm_base_fence_release() when prte_dvm_launch_fence reaches zero. These are the same calls the errmgr path makes; they simply move here and run once for the batch.

8.3.5.4.3. Step 3 — Daemon side: record-and-wait instead of self-exit

In src/prted/prted_comm.c, the PRTE_DAEMON_SHRINK_CMD case currently, for a daemon that finds its own rank in the target list, fires the PMIX_EVENT_JOB_END notification and immediately activates PRTE_JOB_STATE_DAEMONS_TERMINATED (self-exit). Revise so that:

  • Every daemon (target or survivor) uses the target list carried in the broadcast to repair its own routing tree locally — prte_rml_repair_routing_tree(targets, /*global=*/false) — so survivors drop the departing ranks from their children/parent sets. Here global describes the source of the failure information rather than the intended response: false marks the departures as learned locally (from the broadcast list), so each daemon repairs its own tree without re-raising a redundant global failure notice for ranks the whole DVM already knows are leaving.

  • A daemon that finds its own rank among the targets records that it is leaving by entering leaving mode — a flag set as it processes the ``PRTE_DAEMON_SHRINK_CMD`` itself, not in response to any separate order (see the design decision below) — fires its JOB_END notification, and then waits for its lifeline to disconnect rather than self-exiting. Entering this mode is what changes the daemon’s response to lifeline loss from recover to terminate; see the warning below.

Warning

This code path does not exist today and must be created as part of this effort. A daemon that loses its lifeline (its parent) normally attempts recovery — it promotes and reconnects to an ancestor (prte_rml_route_lost() and the promotion path in routed_radix.c); it does not exit. The collective scheme needs a doomed daemon to terminate on lifeline loss instead, and that new behavior must be gated strictly on the leaving mode set above: termination on lifeline failure may occur only when this daemon has processed the shrink command naming its own rank. A daemon that loses its lifeline for any other reason (a genuine fault) must still take the normal recover/promote path — never this exit. The complementary half — actually dropping the doomed daemon’s lifeline — comes from the survivor-side rewire (driven off the same broadcast) closing the connection to each departing child; for a single-branch shrink this cascades from the top of the branch down. Prove both halves on the testbed before removing the daemon self-exit; a bounded self-exit fallback is the safe interim so a daemon that never sees its lifeline drop still terminates.

8.3.5.4.3.1. Design Decision — Leaving mode rides in the shrink command (race-free by construction)

A tempting but wrong shape is to make “enter leaving mode” a second, separate order the HNP sends after the shrink command. That opens a race: a doomed daemon’s lifeline could fail (because a survivor already rewired and dropped it) before the separate leaving-mode order arrived, and the daemon — not yet in leaving mode — would take the recovery path and try to promote instead of exiting. The fix is to not have a second order at all: leaving mode is set by the daemon as it processes PRTE_DAEMON_SHRINK_CMD, so the “you are leaving” fact travels in the same broadcast that will ultimately tear the lifeline down.

This is race-free by construction because the shrink broadcast propagates down the routing tree, which is the very set of lifelines:

  • A doomed daemon receives PRTE_DAEMON_SHRINK_CMD through its own lifeline (its parent forwards the xcast to it over exactly the connection whose loss will later trigger termination).

  • The reliable xcast forwards to children before processing locally for PRTE_RML_TAG_DAEMON — the tag the shrink command uses. In grpcomm_direct_xcast.c (prte_grpcomm_direct_xcast_recv()), only PRTE_RML_TAG_WIREUP and PRTE_RML_TAG_DAEMON_DIED are in the process_first set; every other tag runs forward_op() then process_msg(). So a parent hands the command to each child before it runs its own handler and rewires/drops that child.

  • TCP in-order delivery on the lifeline then guarantees the doomed daemon reads the command (and enters leaving mode) before it can observe that same connection failing. For a whole branch the cascade is inductive: each doomed daemon forwards the command down to its own children before its later death drops their lifelines, so every daemon on the branch is in leaving mode before its lifeline dies.

Two consequences for the implementation:

  1. The shrink command must stay on PRTE_RML_TAG_DAEMON (forward-first). It must not be moved into the process_first set, or a parent could drop a child before forwarding the command to it — reopening the race.

  2. The only remaining way a doomed daemon can see its lifeline fail before entering leaving mode is a genuine, independent fault that races the broadcast. That is precisely the case the leaving-mode gate handles correctly: not yet in leaving mode ⇒ recover/promote (correct — it really is a fault), and the reliable xcast re-propagates the command through the repaired tree, so the daemon still enters leaving mode and exits once the command reaches it.

8.3.5.4.4. Step 4 — Remove the per-death completion logic from the errmgr

Once completion is driven by Step 2, the shrink-campaign block in proc_errors() (errmgr_dvm.c lines 286-323 — the PMIX_LIST_FOREACH_SAFE over prte_shrink_campaigns with the PMIX_RANK_INVALID stamping, the per-death pending/fence decrement, prte_ras_base_shrink_complete(), and prte_plm_base_dvm_mod_notify()) is removed. The comm-failure event for a doomed daemon must then be harmless: because the daemon was already stamped gone and removed from routing in Step 2, a later comm-failure for the same rank should fall through the general daemon-loss handling without re-aborting the DVM.

Warning

This reverses the current ordering of cause and effect. Today the daemon dies first, and the comm-failure event does the teardown. In the collective scheme the HNP does the teardown first, and the daemon dies afterward. The general daemon-loss handling below the removed block (errmgr_dvm.c from line 324 on) must tolerate a comm-failure for a rank the HNP has already torn down — otherwise the doomed daemons’ eventual deaths will trip DVM abort logic. Auditing that fall-through is mandatory, not optional.

8.3.5.5. Interaction with the grow rollback path

The comm-failure block already special-cases grow targets via prte_plm_base_grow_target_failed() (errmgr_dvm.c line 283), which returns early. That check must remain ahead of any shrink handling and is unaffected: a rank cannot be simultaneously a grow target and a shrink target, and the grow path still relies on the real comm-failure event. Only the shrink branch moves to the collective callback.

The grow-failure rollback (prte_plm_base_grow_rollback) tears nodes out of the DVM for the same reason a shrink does, so it shares the same prte_plm_base_reset_dvm_node() step: a node whose grow was rolled back must also return to a pristine, never-launched state so a subsequent grow can reuse it. Relatedly, the errmgr must treat PRTE_PROC_STATE_FAILED_TO_CONNECT like the other daemon comm-failures (COMM_FAILED, HEARTBEAT_FAILED, UNABLE_TO_SEND_MSG, FAILED_TO_START) so it flows into the grow_target_failed rollback rather than the fatal “UNSUPPORTED DAEMON ERROR STATE” path — otherwise a daemon that comes up during a grow but cannot complete its connect-back takes the whole DVM down instead of failing just that grow.

8.3.5.6. Design invariants preserved

  • Teardown before release. Held jobs are released only after the HNP-side teardown (routes, num_daemons, node state, fence) has run — now in the Step-2 batch handler rather than the per-death path.

  • Per-campaign completion event. PMIX_DVM_IS_READY still fires once per request, when this campaign drains; other concurrent campaigns keeping the shared fence nonzero do not delay it. Held-job release still waits for the global fence to reach zero.

  • Clean exit and crash are indistinguishable. A daemon that crashes mid shrink is still handled: its rank is in the batch, so it is torn down by Step 2 regardless of whether it would have exited cleanly.

  • Failure path unchanged. PMIX_ERR_DVM_MOD is still emitted only on the xcast-failure cleanup in DVM Shrink-Campaign Fence Tracking Step 1; once the shrink command is on the wire, every departure is a success for the campaign.

8.3.5.7. Implementation status

This plan has been implemented and validated on the ten-node Docker testbed. The change spans grpcomm/grpcomm.h, grpcomm/direct/ (the xcast_nb entry point plus the completion FIFO), ras/base/ras_base_allocate.c (the collective completion handler), prted/prted_comm.c and rml/routed_radix.c (the daemon leaving mode), errmgr/dvm/errmgr_dvm.c (the already-departed guard), and runtime/prte_globals.{h,c}. It builds warning-free under --enable-devel-check (-Werror plus the full picky set).

A follow-on commit adds the shared prte_plm_base_reset_dvm_node() helper (plm/base/plm_base_launch_support.c, declared in plm/base/plm_private.h), called from the shrink-completion teardown and the grow-failure rollback, plus the PRTE_PROC_STATE_FAILED_TO_CONNECT routing in errmgr/dvm/errmgr_dvm.c. Both are launcher-agnostic and build warning-free under the same picky set.

The entire machinery is gated behind prte_elastic_mode. The master only enqueues an xcast_nb completion (and pops it on the relay-to-self) when elastic mode is active; the daemon only enters leaving mode — the bounded departure timer and the lifeline-loss fast path — when prte_dvm_leaving is set, which happens exclusively on an elastic shrink; and the already-departed guard in errmgr/dvm is skipped entirely otherwise. A default (non-elastic) prterun and a persistent DVM plus prun were both re-validated after the gating went in: the launch and fault-handling paths are byte-for-byte the prior behavior when elastic mode is off.

Threading the completion callback through xcast_nb takes one piece of care. The op the master initiates is discarded once it has been relayed; the op the master actually tracks and completes is a fresh one it builds when the broadcast loops back to itself. The callback therefore rides a small FIFO of pending completions rather than the initiating op: one entry is queued per master-originated broadcast and popped, in order, as the master builds each tracked op on receipt. The entry is enqueued in begin_xcast immediately before the reliable send that emits the broadcast, and unwound if that send fails, so the FIFO tracks exactly the broadcasts that reached the wire — a dropped send cannot shift the alignment onto the wrong op. Both the enqueue and the receive-side op-id stamping run on the single progress thread with in-order delivery to self, so the ordering the FIFO relies on holds.

Two implementation choices settled the open questions the plan had flagged:

  • The completion hook is the general xcast_nb facility (open question #2’s preferred option), not a shrink special case.

  • The daemon departs on a bounded timer with a lifeline-loss fast path (the plan’s endorsed fallback); the “survivor actively closes the connection” half of open question #1 was not built — the timer covers it.

One point on scope is worth recording. This PR explicitly collapses the HNP-side repair — the cost issue #2492 actually names — into a single pass. The survivors were not separately reworked, but per the reliable xcast’s author the batching effectively falls out of the existing mechanism rather than needing new work: batching the repair at the root drives a batched repair on the rest as well. The reliable xcast’s ACK bookkeeping is designed to absorb repairs that happen mid-broadcast, so a survivor repairing from the broadcast target list does not race it. The scheme stays two-phased — a lifeline first reports adoption status plus any failures within its new subtree, then the global broadcast conveys the full list to keep every daemon’s view of the tree consistent — but the first phase already covers every failure that could require a repair, and the second is a follow-up that only informs about unrelated failures.

8.3.5.8. Validation results

The collective / whole-branch failure path is far less exercised than individual daemon deaths, so it was exercised on the in-repo Docker testbed (contrib/dockerswarm/), sized to ten nodes and driven with --prtemca rml_radix 2 to force a real multi-level tree (0 1,2   1 3,4   2 5,6   3 7,8) rather than the default flat fan-out.

  1. Single-branch multi-daemon shrink (subtree {3,7,8}): completed with a single PMIX_DVM_IS_READY; the HNP survived and prun still worked. With routed_base_verbose on the flat tree the collapse is visible directly — a single repair pass takes children 4,5,6 INVALID in one shot, and every one of the departing daemons’ comm-failures is absorbed by the already-departed guard (errmgr_base_verbose shows the “ignoring it” line), driving zero per-death repairs.

  2. Multi-branch shrink (ranks 4 and 6, one leaf under each of the HNP’s two children): one completion event, correct survivors, prun works.

  3. Crash during shrink (pkill -9 a target’s prted inside the departure window): the campaign still drained to a single completion event and the HNP survived — confirming clean exit and crash are indistinguishable.

  4. Fence under load (forty rapid prun launches spanning a shrink): all forty succeeded and the DVM stayed healthy, so the fence raised during a shrink does not wedge concurrent traffic.

The in-flight-job remap-onto-survivors path (a job held at the LAUNCH_APPS hold point during a shrink, then remapped) could not be exercised in this harness: a plain prun maps only onto the head node’s base pool, not the reservation the grown/shrunk nodes belong to, so a normal job is never held for a reservation-node shrink; and the elastic tool cannot connect while concurrent prun sessions litter $TMPDIR with rendezvous files (it fails PMIX_ERR_UNREACH — “multiple possible servers”). The hold/remap machinery is inherited unchanged from DVM Shrink-Campaign Fence Tracking; this plan only moves when the fence releases, which the tests above confirm fires correctly. A reservation-targeted in-flight shrink remains to be validated.

Two bugs surfaced and were fixed during validation, both worth noting because they are easy to reintroduce:

  • The completion callback was lost across the master’s relay-to-self: the op created in xcast_nb is discarded by begin_xcast and the master rebuilds a fresh op on receipt, so the callback has to be carried in the FIFO and re-attached when the master relays its own broadcast back.

  • The FIFO was first declared with PMIX_LIST_STATIC_INIT, whose sentinel next/prev are NULL; appending to such a list corrupts memory and silently wedged normal launches. The list must be PMIX_CONSTRUCT-ed — it now lives in the xcast-ops object and is constructed in xcast_con.

See contrib/dockerswarm/README.md for the elastic-mode flag and the cleanup loop between runs.

Re-grow of a just-shrunk node (#2491) is partially addressed here. The prte_plm_base_reset_dvm_node() step above fixes the launcher-agnostic half — a shrunk node is no longer skipped (“daemon already exists”) or duplicated on a later grow, verified on the testbed — and the prerequisite that landed in #2497 also routes PRTE_PROC_STATE_FAILED_TO_CONNECT into the grow-failure rollback instead of the fatal “unsupported daemon error state” abort. The remaining half is the daemon vpid space. A shrink leaves a dead vpid in the daemon job array (the proc is marked terminated and prte_process_info.num_daemons is decremented, but it is not removed from daemons->procs and daemons->num_procs is not decremented), and a later grow always appends (proc->name.rank = daemons->num_procs), so the hole becomes permanent. The positional radix routing tree is pure vpid arithmetic over [0, num_daemons): the shrink’s prte_rml_repair_routing_tree() correctly marks the dead rank as failed, but the next grow’s prte_rml_compute_routing_tree() re-inits its failed-daemon bitmaps and wipes those marks, so it re-treats the dead rank as live and the re-grown daemon fails wireup. Reusing the vacated vpid would fix it for the ssh launcher but breaks the non-ssh launchers (slurm/pals/lsf), which launch daemons as a sequential vpid range; the launcher-agnostic fix is therefore to route the tree around the dead rank rather than reuse the vpid.

This routing-side fix is now implemented. prte_rml_base carries a persistent dead_dmns bitmap that prte_rml_repair_routing_tree() sets whenever a rank departs and that prte_rml_compute_routing_tree() — unlike the per-event failed_dmns set — never re-initializes. On every recompute the departed ranks are restored into failed_dmns and the freshly-built base tree is repaired around them (prte_rml_update_ancestors / handle_promotion / update_descendants), so radix_is_living keeps the hole out of every survivor’s ancestors, lifeline, and children. One of the two supporting bugs is fixed alongside: the grow path into setup_vm now resets map->num_new_daemons and map->daemon_vpid_start on entry rather than accumulating them across successive grows. The other — the VM-ready gate (num_reported == num_procs) after a hole — is expected to be moot with append-only vpids plus the routing fix and is left tracked in #2491.

A brand-new daemon starts with an empty dead_dmns set, so it could not learn the holes from the shrink events it never saw. That gap is closed in the nidmap. prte_util_nidmap_create now packs the true daemon vpid-space size (prte_process_info.num_daemons) in addition to the live daemon list, and prte_util_decode_nidmap sets num_daemons from it and marks every vpid in [0, num_daemons) that has no live daemon entry into dead_dmns before (re)computing the routing tree. A shrunk-out rank is exactly such a gap, so every daemon — freshly grown or long-standing — converges on the same vpid span and the same dead set as the HNP, and the tree is routed around the hole even in a deep (multi-level) tree where a new daemon’s own computed ancestor is a departed rank. On an unshrunk DVM the packed span equals the live count, nothing is marked dead, and behavior is unchanged.

8.3.5.9. Open questions

  1. Terminate-on-lifeline mode — resolved (with a fallback). Implemented: a target sets prte_dvm_leaving as it processes the shrink command naming its own rank, and departs on a bounded timer or, sooner, on the first lost connection (prte_rml_route_lost exits early on any route loss while prte_dvm_leaving is set, so a genuine unrelated fault still recovers). It exits on any lost route, not just the lifeline: a leaving daemon can see a child connection drop before its lifeline, and treating that as a child failure would emit an adoption notice that could be misread as a real fault and propagated up the tree, so a leaving daemon never lets a disconnect be mistaken for a fault. The race is closed by construction as the design decision argues. The half that was not built is the survivor actively closing the connection to each departing child; the timer makes that unnecessary for correctness, at the cost of the doomed daemons lingering a second or two after completion. Building the active close would let the fast path fire deterministically and retire the timer.

  2. General callback vs. shrink special case — resolved. Implemented as the general xcast_nb facility, usable beyond shrink.

  3. Comm-failure fall-through — resolved. A daemon comm-failure is ignored when the daemon is already not-alive and its recorded state is at or past PRTE_PROC_STATE_TERMINATED (the state the completion handler stamps). The state test is what keeps the guard from swallowing a FAILED_TO_START daemon (never alive, but its state is still below TERMINATED at that point).

  4. Survivor-side batching — resolved by the existing mechanism. Batching the repair at the root drives a batched repair on the survivors as well, and the reliable xcast’s ACK bookkeeping is designed to absorb the mid-broadcast repair rather than race it, so no separate survivor-side work is required. See Implementation status for the two-phase description.

  5. Profiling. The original concern was unprofiled. Worth measuring the per-daemon vs. batch repair cost on a large single-branch shrink to confirm the optimization earns its complexity — especially since only the HNP side is batched so far.

8.3.5.10. Summary of files changed

File

Change

src/mca/grpcomm/grpcomm.h

Extend the xcast module interface with an optional per-op completion callback + cbdata (Step 1, preferred option).

src/mca/grpcomm/direct/grpcomm_direct_xcast.c

Cache the completion callback on op_t; invoke it from finish_op() only when PRTE_PROC_IS_MASTER.

src/mca/ras/base/ras_base_allocate.c

Register the completion callback (carrying the prte_shrink_campaign_t) on both PRTE_DAEMON_SHRINK_CMD xcasts. Add the Step-2 handler: batch prte_rml_repair_routing_tree(), per-target HNP bookkeeping, fence decrement, prte_ras_base_shrink_complete(), dvm_mod_notify, campaign removal, prte_plm_base_fence_release().

src/prted/prted_comm.c

PRTE_DAEMON_SHRINK_CMD handler: every daemon repairs its own tree from the broadcast target list; a targeted daemon records-and-waits for lifeline loss instead of self-exiting (with a bounded fallback).

src/mca/errmgr/dvm/errmgr_dvm.c

Remove the per-death shrink-campaign block (lines 286-323) including the PMIX_RANK_INVALID stamping. Ensure the general daemon-loss handling tolerates a comm-failure for a rank already torn down by Step 2.

src/rml/routed_radix.c

No change expected — prte_rml_repair_routing_tree() already accepts a rank array and does one pass. Listed for reference.

contrib/dockerswarm/

Testbed grown to ten nodes to exercise single-branch multi-daemon shrink (already done alongside this plan).