.. _dvm-collective-shrink-completion-label:

Collective Shrink Completion — Repair the Routing Tree Once per Campaign
========================================================================

This document plans the **optimization** tracked in
`openpmix/prrte#2492 <https://github.com/openpmix/prrte/issues/2492>`_: draining
a shrink campaign as a single collective event rather than one daemon at a time.
It is a *revision* of the shrink path described in
:ref:`dvm-shrink-campaign-label`; that document remains authoritative for the
fence mechanism, the held-job arrays, the second hold point at ``LAUNCH_APPS``,
and the campaign type.  Only the pieces this revision changes are restated here.

This is an **optimization, not a correctness fix**.  The per-daemon completion
path shipped in :ref:`dvm-shrink-campaign-label` is correct; it was deliberately
kept simple and the collective scheme was deferred.  Nothing here changes the
externally observable contract in :ref:`elastic-dvm-spec-label` — the same
``PMIX_DVM_IS_READY`` / ``PMIX_ERR_DVM_MOD`` events fire for the same requests;
only *when and how many times* the internal routing-tree repair runs changes.

The state machine is single-threaded on the progress thread, so no locking is
required anywhere in this plan.

Background — the cost being removed
-----------------------------------

The shipped design drains a campaign **one daemon at a time**.  Every targeted
daemon exits on its own in response to ``PRTE_DAEMON_SHRINK_CMD``
(``src/prted/prted_comm.c``, the ``PRTE_DAEMON_SHRINK_CMD`` case), and the HNP
discovers each departure independently through the errmgr comm-failure path
(``src/mca/errmgr/dvm/errmgr_dvm.c``, ``proc_errors()``), decrementing the
campaign's ``pending`` count and the shared fence once per death.

Each of those independent departures drives a **separate** routing-tree repair.
A daemon that exits triggers ``prte_rml_route_lost()``
(``src/rml/routed_radix.c``), which calls
``prte_rml_repair_routing_tree(&failed_ranks, /*global=*/false)`` with a
**single** rank; that in turn runs ``handle_promotion()`` and
``update_descendants()`` for that one rank.  Shrinking ``m`` daemons that sit
along one branch of the radix routing tree therefore triggers up to ``m``
sequential promotions/descendant rewrites.  Review of PR #2472 flagged this as
potentially expensive for a large single-branch shrink (unprofiled).

Crucially, ``prte_rml_repair_routing_tree()`` **already accepts a rank array**
(``pmix_data_array_t *failed_ranks``) and performs a single promotion/descendant
pass for the whole set.  The optimization is to feed it the whole campaign at
once instead of one rank at a time.

Design overview
---------------

Repair the tree **once per campaign**:

#. The HNP broadcasts ``PRTE_DAEMON_SHRINK_CMD`` via the reliable xcast, exactly
   as today.  The broadcast payload already carries the full target-rank list,
   so **every** daemon learns the complete set of departing ranks from the
   broadcast itself — no separate failure notice is needed to inform survivors.
#. Each targeted daemon records that it is leaving and does its local
   processing, but **does not exit yet** (today it self-exits).
#. The HNP hooks the **broadcast's completion**.  The reliable xcast in
   ``src/mca/grpcomm/direct/grpcomm_direct_xcast.c`` already tracks completion
   via ACKs flowing up the tree: when ``finish_op()`` runs on the master, every
   daemon in the DVM has received the op.  A per-op completion callback is added
   (or the shrink op special-cased) to fire a handler at that point.
#. That handler reports **all** of the campaign's targets as failed in a single
   batch via ``prte_rml_repair_routing_tree(failed_ranks, /*global=*/false)`` —
   one promotion/descendant pass for the whole set — and performs the HNP-side
   teardown bookkeeping (``num_daemons``, node state, fence) for the batch.
#. Each doomed daemon exits once its lifeline disconnects as a consequence of
   the rewire, rather than self-exiting — but only because processing the shrink
   command put it into a **new leaving mode** that converts lifeline loss into
   termination (normally lifeline loss triggers *recovery*, not exit).  Because
   that mode rides in the broadcast itself and reaches each doomed daemon through
   its own lifeline, there is no race between learning "you are leaving" and the
   lifeline failing (see Step 3's design decision).  The completion event
   (``PMIX_DVM_IS_READY``) fires from this single batch point rather than from
   the last individual departure.

Why this is **not** the rejected per-daemon ACK
-----------------------------------------------

:ref:`dvm-shrink-campaign-label` rejected an earlier design in which each daemon
sent a ``PRTE_PLM_SHRINK_ACK_CMD`` announcing its *intent* to leave, and the HNP
decremented the campaign on receipt of that ACK.  That was wrong because the ACK
arrived while the daemon was still a live participant — acting on it could
release held jobs into a DVM that still believed the departing daemon present —
and because two decrement paths (ACK plus errmgr fallback) double-counted.

The collective scheme is **not** that design.  The authoritative HNP-side
teardown — route removal, ``num_daemons``, node state, fence — still happens at
the batch-repair callback, and held jobs are released only *after* it.  The
signal is not a daemon announcing intent; it is the xcast-completion fact that
every daemon has received the shrink order, at which point the HNP itself
performs the teardown.  The invariant "act once teardown has occurred, not on
intent" is preserved.  Because completion collapses to a single event per
campaign, the per-rank ``PMIX_RANK_INVALID`` idempotency stamping and the
double-count analysis that the per-death path required are retired: there is now
exactly one teardown event per campaign, so there is nothing to make idempotent.

Required revisions
------------------

Step 1 — Add an xcast completion callback (grpcomm/direct)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Today ``prte_grpcomm.xcast(tag, msg)`` is fire-and-forget
(``src/mca/grpcomm/grpcomm.h``, ``prte_grpcomm_base_module_xcast_fn_t``).  The
reliable xcast already *knows* when the whole DVM has received an op — on the
master, ``finish_op()`` in ``grpcomm_direct_xcast.c`` runs when the last child
ACK arrives, and ``op->sig.op_id`` is complete for the entire subtree, which for
the master is the entire DVM.

Add a mechanism to run a caller-supplied callback at that point.  Two options,
in preference order:

* **Per-op completion callback (preferred).**  Extend the xcast entry so the
  caller may pass a completion function and an opaque ``cbdata``, cache them on
  the ``op_t``, and invoke them from ``finish_op()`` **only on the master**
  (``PRTE_PROC_IS_MASTER``) — the point at which whole-DVM receipt is known.
  This is a general facility, useful beyond shrink.
* **Special-case the shrink op.**  If a full callback API is deemed too broad
  for this change, have ``finish_op()`` on the master recognize the shrink op
  and call the shrink-completion handler directly.  Cheaper to write, less
  reusable; a follow-up would still likely generalize it.

Whichever is chosen, the callback fires on the progress thread inside
``finish_op()``, so it may touch state-machine globals directly.

.. note::

   ``finish_op()`` also runs on non-master daemons (it ACKs to the parent).  The
   completion callback must fire **only** where ``PRTE_PROC_IS_MASTER`` is true,
   because only there does op completion mean *every* daemon received the op.  A
   non-master's ``finish_op()`` means only its own subtree completed.

Step 2 — HNP shrink-completion handler (new)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Register the Step-1 callback when the shrink xcast is issued in
``src/mca/ras/base/ras_base_allocate.c`` (the ``PMIX_ALLOC_RELEASE`` branch of
``prte_ras_base_complete_request()`` and the reservation-teardown xcast in
``prte_ras_base_teardown_reservation()`` — both send ``PRTE_DAEMON_SHRINK_CMD``).
Carry the ``prte_shrink_campaign_t *`` as the callback ``cbdata`` so the handler
has the target list and requester in hand.

The handler, running once per campaign on the master, must do everything the
per-death errmgr path currently does across ``m`` invocations — but for the
whole batch, and exactly once:

#. **Batch routing-tree repair.**  Build a ``pmix_data_array_t`` from
   ``camp->targets`` and call
   ``prte_rml_repair_routing_tree(&failed, /*global=*/false)`` **once**.  This is
   the single promotion/descendant pass that replaces the per-daemon repairs.
#. **Per-target HNP bookkeeping.**  For each target rank, apply the same
   teardown the comm-failure block applies today (``errmgr_dvm.c`` lines
   269-274): unset ``PRTE_PROC_FLAG_ALIVE``, set the proc state, and decrement
   ``prte_process_info.num_daemons``.  This bookkeeping currently *rides on the
   comm-failure event*; when the loss is declared proactively it must be done
   here instead.  **This is the highest-risk part of the change** — see
   *Validation* below.
#. **Reset the node's launch state for re-grow.**  Detaching the daemon
   (``node->daemon = NULL``, ``node->session = NULL``) is not enough to make the
   node re-growable: the node object persists in the pool carrying the
   ``PRTE_NODE_FLAG_DAEMON_LAUNCHED`` flag every plm launcher checks, and it
   stays in the daemon-job map.  Left as-is, a later grow onto the same node is
   skipped ("daemon already exists") and its prted never relaunches, and the
   stale map entry lets ``setup_vm`` add the node a second time.  Call the shared
   helper ``prte_plm_base_reset_dvm_node()`` for each detached node — it clears
   ``PRTE_NODE_FLAG_DAEMON_LAUNCHED``/``PRTE_NODE_FLAG_LOC_VERIFIED`` and drops
   the node from the daemon-job map.  This is launcher-agnostic and is a
   prerequisite for re-growing a previously shrunk node (see #2491); it does not
   by itself complete the re-grow, which additionally needs the daemon vpid space
   left dense enough for the positional radix routing tree.
#. **Fence and completion.**  Decrement ``prte_dvm_launch_fence`` by
   ``camp->pending`` (all at once), invoke
   ``prte_ras_base_shrink_complete(camp)`` to give the RAS modules their release
   hook, emit ``PMIX_DVM_IS_READY`` to the requester via
   ``prte_plm_base_dvm_mod_notify()`` when ``camp->have_requester``, remove the
   campaign from ``prte_shrink_campaigns``, and call
   ``prte_plm_base_fence_release()`` when ``prte_dvm_launch_fence`` reaches zero.
   These are the same calls the errmgr path makes; they simply move here and run
   once for the batch.

Step 3 — Daemon side: record-and-wait instead of self-exit
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In ``src/prted/prted_comm.c``, the ``PRTE_DAEMON_SHRINK_CMD`` case currently, for
a daemon that finds its own rank in the target list, fires the
``PMIX_EVENT_JOB_END`` notification and immediately activates
``PRTE_JOB_STATE_DAEMONS_TERMINATED`` (self-exit).  Revise so that:

* **Every** daemon (target or survivor) uses the target list carried in the
  broadcast to repair its *own* routing tree locally —
  ``prte_rml_repair_routing_tree(targets, /*global=*/false)`` — so survivors
  drop the departing ranks from their children/parent sets.  Here ``global``
  describes the *source* of the failure information rather than the intended
  response: ``false`` marks the departures as learned locally (from the
  broadcast list), so each daemon repairs its own tree without re-raising a
  redundant global failure notice for ranks the whole DVM already knows are
  leaving.
* A daemon that finds **its own** rank among the targets records that it is
  leaving by entering **leaving mode** — a flag set as it *processes the
  ``PRTE_DAEMON_SHRINK_CMD`` itself*, not in response to any separate order (see
  the design decision below) — fires its ``JOB_END`` notification, and then
  **waits for its lifeline to disconnect** rather than self-exiting.  Entering
  this mode is what changes the daemon's response to lifeline loss from
  *recover* to *terminate*; see the warning below.

.. warning::

   **This code path does not exist today and must be created as part of this
   effort.**  A daemon that loses its lifeline (its parent) normally attempts
   **recovery** — it promotes and reconnects to an ancestor
   (``prte_rml_route_lost()`` and the promotion path in ``routed_radix.c``); it
   does **not** exit.  The collective scheme needs a doomed daemon to
   *terminate* on lifeline loss instead, and that new behavior must be gated
   strictly on the leaving mode set above: **termination on lifeline failure may
   occur only when this daemon has processed the shrink command naming its own
   rank.**  A daemon that loses its lifeline for any other reason (a genuine
   fault) must still take the normal recover/promote path — never this exit.
   The complementary half — actually *dropping* the doomed daemon's lifeline —
   comes from the survivor-side rewire (driven off the same broadcast) closing
   the connection to each departing child; for a single-branch shrink this
   cascades from the top of the branch down.  Prove both halves on the testbed
   before removing the daemon self-exit; a bounded self-exit fallback is the safe
   interim so a daemon that never sees its lifeline drop still terminates.

Design Decision — Leaving mode rides in the shrink command (race-free by construction)
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

A tempting but wrong shape is to make "enter leaving mode" a *second*, separate
order the HNP sends after the shrink command.  That opens a race: a doomed
daemon's lifeline could fail (because a survivor already rewired and dropped it)
**before** the separate leaving-mode order arrived, and the daemon — not yet in
leaving mode — would take the recovery path and try to promote instead of
exiting.  The fix is to **not** have a second order at all: leaving mode is set
by the daemon as it processes ``PRTE_DAEMON_SHRINK_CMD``, so the "you are
leaving" fact travels in the *same* broadcast that will ultimately tear the
lifeline down.

This is race-free by construction because the shrink broadcast propagates **down
the routing tree, which is the very set of lifelines**:

* A doomed daemon receives ``PRTE_DAEMON_SHRINK_CMD`` *through its own lifeline*
  (its parent forwards the xcast to it over exactly the connection whose loss
  will later trigger termination).
* The reliable xcast **forwards to children before processing locally** for
  ``PRTE_RML_TAG_DAEMON`` — the tag the shrink command uses.  In
  ``grpcomm_direct_xcast.c`` (``prte_grpcomm_direct_xcast_recv()``), only
  ``PRTE_RML_TAG_WIREUP`` and ``PRTE_RML_TAG_DAEMON_DIED`` are in the
  ``process_first`` set; every other tag runs ``forward_op()`` then
  ``process_msg()``.  So a parent hands the command to each child **before** it
  runs its own handler and rewires/drops that child.
* TCP in-order delivery on the lifeline then guarantees the doomed daemon reads
  the command (and enters leaving mode) *before* it can observe that same
  connection failing.  For a whole branch the cascade is inductive: each doomed
  daemon forwards the command down to its own children before its later death
  drops their lifelines, so every daemon on the branch is in leaving mode before
  its lifeline dies.

Two consequences for the implementation:

#. The shrink command **must stay on** ``PRTE_RML_TAG_DAEMON`` (forward-first).
   It must *not* be moved into the ``process_first`` set, or a parent could drop
   a child before forwarding the command to it — reopening the race.
#. The only remaining way a doomed daemon can see its lifeline fail before
   entering leaving mode is a genuine, independent fault that races the
   broadcast.  That is precisely the case the leaving-mode gate handles
   correctly: not yet in leaving mode ⇒ recover/promote (correct — it really is
   a fault), and the reliable xcast re-propagates the command through the
   repaired tree, so the daemon still enters leaving mode and exits once the
   command reaches it.

Step 4 — Remove the per-death completion logic from the errmgr
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once completion is driven by Step 2, the shrink-campaign block in
``proc_errors()`` (``errmgr_dvm.c`` lines 286-323 — the ``PMIX_LIST_FOREACH_SAFE``
over ``prte_shrink_campaigns`` with the ``PMIX_RANK_INVALID`` stamping, the
per-death ``pending``/fence decrement, ``prte_ras_base_shrink_complete()``, and
``prte_plm_base_dvm_mod_notify()``) is removed.  The comm-failure event for a
doomed daemon must then be **harmless**: because the daemon was already stamped
gone and removed from routing in Step 2, a later comm-failure for the same rank
should fall through the general daemon-loss handling without re-aborting the DVM.

.. warning::

   This reverses the current *ordering of cause and effect*.  Today the daemon
   dies first, and the comm-failure event does the teardown.  In the collective
   scheme the HNP does the teardown first, and the daemon dies afterward.  The
   general daemon-loss handling below the removed block (``errmgr_dvm.c`` from
   line 324 on) must tolerate a comm-failure for a rank the HNP has *already*
   torn down — otherwise the doomed daemons' eventual deaths will trip DVM abort
   logic.  Auditing that fall-through is mandatory, not optional.

Interaction with the grow rollback path
----------------------------------------

The comm-failure block already special-cases grow targets via
``prte_plm_base_grow_target_failed()`` (``errmgr_dvm.c`` line 283), which returns
early.  That check must remain **ahead of** any shrink handling and is
unaffected: a rank cannot be simultaneously a grow target and a shrink target,
and the grow path still relies on the real comm-failure event.  Only the shrink
branch moves to the collective callback.

The grow-failure rollback (``prte_plm_base_grow_rollback``) tears nodes out of
the DVM for the same reason a shrink does, so it shares the same
``prte_plm_base_reset_dvm_node()`` step: a node whose grow was rolled back must
also return to a pristine, never-launched state so a subsequent grow can reuse
it.  Relatedly, the errmgr must treat ``PRTE_PROC_STATE_FAILED_TO_CONNECT`` like
the other daemon comm-failures (``COMM_FAILED``, ``HEARTBEAT_FAILED``,
``UNABLE_TO_SEND_MSG``, ``FAILED_TO_START``) so it flows into the
``grow_target_failed`` rollback rather than the fatal
"UNSUPPORTED DAEMON ERROR STATE" path — otherwise a daemon that comes up during
a grow but cannot complete its connect-back takes the whole DVM down instead of
failing just that grow.

Design invariants preserved
----------------------------

* **Teardown before release.**  Held jobs are released only after the HNP-side
  teardown (routes, ``num_daemons``, node state, fence) has run — now in the
  Step-2 batch handler rather than the per-death path.
* **Per-campaign completion event.**  ``PMIX_DVM_IS_READY`` still fires once per
  request, when *this* campaign drains; other concurrent campaigns keeping the
  shared fence nonzero do not delay it.  Held-job release still waits for the
  global fence to reach zero.
* **Clean exit and crash are indistinguishable.**  A daemon that crashes mid
  shrink is still handled: its rank is in the batch, so it is torn down by
  Step 2 regardless of whether it would have exited cleanly.
* **Failure path unchanged.**  ``PMIX_ERR_DVM_MOD`` is still emitted only on the
  xcast-failure cleanup in :ref:`dvm-shrink-campaign-label` Step 1; once the
  shrink command is on the wire, every departure is a success for the campaign.

Implementation status
---------------------

This plan has been implemented and validated on the ten-node Docker testbed.
The change spans ``grpcomm/grpcomm.h``, ``grpcomm/direct/`` (the ``xcast_nb``
entry point plus the completion FIFO), ``ras/base/ras_base_allocate.c`` (the
collective completion handler), ``prted/prted_comm.c`` and ``rml/routed_radix.c``
(the daemon leaving mode), ``errmgr/dvm/errmgr_dvm.c`` (the already-departed
guard), and ``runtime/prte_globals.{h,c}``.  It builds warning-free under
``--enable-devel-check`` (``-Werror`` plus the full picky set).

A follow-on commit adds the shared ``prte_plm_base_reset_dvm_node()`` helper
(``plm/base/plm_base_launch_support.c``, declared in ``plm/base/plm_private.h``),
called from the shrink-completion teardown and the grow-failure rollback, plus
the ``PRTE_PROC_STATE_FAILED_TO_CONNECT`` routing in ``errmgr/dvm/errmgr_dvm.c``.
Both are launcher-agnostic and build warning-free under the same picky set.

The entire machinery is gated behind ``prte_elastic_mode``.  The master only
enqueues an ``xcast_nb`` completion (and pops it on the relay-to-self) when
elastic mode is active; the daemon only enters leaving mode — the bounded
departure timer and the lifeline-loss fast path — when ``prte_dvm_leaving`` is
set, which happens exclusively on an elastic shrink; and the already-departed
guard in ``errmgr/dvm`` is skipped entirely otherwise.  A default (non-elastic)
``prterun`` and a persistent DVM plus ``prun`` were both re-validated after the
gating went in: the launch and fault-handling paths are byte-for-byte the prior
behavior when elastic mode is off.

Threading the completion callback through ``xcast_nb`` takes one piece of care.
The op the master initiates is discarded once it has been relayed; the op the
master actually tracks and completes is a fresh one it builds when the broadcast
loops back to itself.  The callback therefore rides a small FIFO of pending
completions rather than the initiating op: one entry is queued per
master-originated broadcast and popped, in order, as the master builds each
tracked op on receipt.  The entry is enqueued in ``begin_xcast`` immediately
before the reliable send that emits the broadcast, and unwound if that send
fails, so the FIFO tracks exactly the broadcasts that reached the wire — a
dropped send cannot shift the alignment onto the wrong op.  Both the enqueue and
the receive-side op-id stamping run on the single progress thread with in-order
delivery to self, so the ordering the FIFO relies on holds.

Two implementation choices settled the open questions the plan had flagged:

* The completion hook is the **general** ``xcast_nb`` facility (open question #2's
  preferred option), not a shrink special case.
* The daemon departs on a **bounded timer** with a lifeline-loss fast path (the
  plan's endorsed fallback); the "survivor actively closes the connection" half
  of open question #1 was *not* built — the timer covers it.

One point on scope is worth recording.  This PR explicitly collapses the
**HNP-side** repair — the cost issue #2492 actually names — into a single pass.
The survivors were *not* separately reworked, but per the reliable xcast's
author the batching effectively falls out of the existing mechanism rather than
needing new work: batching the repair at the root drives a batched repair on the
rest as well.  The reliable xcast's ACK bookkeeping is designed to absorb repairs
that happen *mid-broadcast*, so a survivor repairing from the broadcast target
list does not race it.  The scheme stays two-phased — a lifeline first reports
adoption status plus any failures within its new subtree, then the global
broadcast conveys the full list to keep every daemon's view of the tree
consistent — but the first phase already covers every failure that could require
a repair, and the second is a follow-up that only informs about unrelated
failures.

Validation results
------------------

The collective / whole-branch failure path is far less exercised than individual
daemon deaths, so it was exercised on the in-repo Docker testbed
(``contrib/dockerswarm/``), sized to **ten** nodes and driven with
``--prtemca rml_radix 2`` to force a real multi-level tree
(``0 → 1,2   1 → 3,4   2 → 5,6   3 → 7,8``) rather than the default flat fan-out.

#. **Single-branch multi-daemon shrink** (subtree ``{3,7,8}``): completed with a
   single ``PMIX_DVM_IS_READY``; the HNP survived and ``prun`` still worked.  With
   ``routed_base_verbose`` on the flat tree the collapse is visible directly — a
   **single** repair pass takes children ``4,5,6 → INVALID`` in one shot, and
   every one of the departing daemons' comm-failures is absorbed by the
   already-departed guard (``errmgr_base_verbose`` shows the "ignoring it" line),
   driving **zero** per-death repairs.
#. **Multi-branch shrink** (ranks ``4`` and ``6``, one leaf under each of the
   HNP's two children): one completion event, correct survivors, ``prun`` works.
#. **Crash during shrink** (``pkill -9`` a target's ``prted`` inside the
   departure window): the campaign still drained to a single completion event and
   the HNP survived — confirming clean exit and crash are indistinguishable.
#. **Fence under load** (forty rapid ``prun`` launches spanning a shrink): all
   forty succeeded and the DVM stayed healthy, so the fence raised during a shrink
   does not wedge concurrent traffic.

The **in-flight-job remap-onto-survivors** path (a job held at the ``LAUNCH_APPS``
hold point during a shrink, then remapped) could **not** be exercised in this
harness: a plain ``prun`` maps only onto the head node's base pool, not the
reservation the grown/shrunk nodes belong to, so a normal job is never held for a
reservation-node shrink; and the ``elastic`` tool cannot connect while concurrent
``prun`` sessions litter ``$TMPDIR`` with rendezvous files (it fails
``PMIX_ERR_UNREACH`` — "multiple possible servers").  The hold/remap machinery is
inherited unchanged from :ref:`dvm-shrink-campaign-label`; this plan only moves
*when* the fence releases, which the tests above confirm fires correctly.  A
reservation-targeted in-flight shrink remains to be validated.

Two bugs surfaced and were fixed during validation, both worth noting because
they are easy to reintroduce:

* The completion callback was **lost across the master's relay-to-self**: the
  op created in ``xcast_nb`` is discarded by ``begin_xcast`` and the master
  rebuilds a fresh op on receipt, so the callback has to be carried in the FIFO
  and re-attached when the master relays its own broadcast back.
* The FIFO was first declared with ``PMIX_LIST_STATIC_INIT``, whose sentinel
  ``next``/``prev`` are ``NULL``; appending to such a list corrupts memory and
  silently wedged normal launches.  The list must be ``PMIX_CONSTRUCT``-ed — it
  now lives in the xcast-ops object and is constructed in ``xcast_con``.

See ``contrib/dockerswarm/README.md`` for the elastic-mode flag and the cleanup
loop between runs.

Re-grow of a just-shrunk node (#2491) is **partially** addressed here.  The
``prte_plm_base_reset_dvm_node()`` step above fixes the launcher-agnostic half —
a shrunk node is no longer skipped ("daemon already exists") or duplicated on a
later grow, verified on the testbed — and the prerequisite that landed in #2497
also routes ``PRTE_PROC_STATE_FAILED_TO_CONNECT`` into the grow-failure rollback
instead of the fatal "unsupported daemon error state" abort.  The remaining half
is the daemon vpid space.  A shrink leaves a dead vpid in the daemon job array
(the proc is marked terminated and ``prte_process_info.num_daemons`` is
decremented, but it is *not* removed from ``daemons->procs`` and
``daemons->num_procs`` is not decremented), and a later grow always **appends**
(``proc->name.rank = daemons->num_procs``), so the hole becomes permanent.  The
positional radix routing tree is pure vpid arithmetic over ``[0, num_daemons)``:
the shrink's ``prte_rml_repair_routing_tree()`` correctly marks the dead rank as
failed, but the next grow's ``prte_rml_compute_routing_tree()`` re-inits its
failed-daemon bitmaps and **wipes those marks**, so it re-treats the dead rank
as live and the re-grown daemon fails wireup.  Reusing the vacated vpid would
fix it for the ssh launcher but breaks the non-ssh launchers (slurm/pals/lsf),
which launch daemons as a **sequential** vpid range; the launcher-agnostic fix
is therefore to route the tree *around* the dead rank rather than reuse the vpid.

**This routing-side fix is now implemented.**  ``prte_rml_base`` carries a
persistent ``dead_dmns`` bitmap that ``prte_rml_repair_routing_tree()`` sets
whenever a rank departs and that ``prte_rml_compute_routing_tree()`` — unlike the
per-event ``failed_dmns`` set — never re-initializes.  On every recompute the
departed ranks are restored into ``failed_dmns`` and the freshly-built base tree
is repaired around them (``prte_rml_update_ancestors`` / ``handle_promotion`` /
``update_descendants``), so ``radix_is_living`` keeps the hole out of every
survivor's ancestors, lifeline, and children.  One of the two supporting bugs is
fixed alongside: the grow path into ``setup_vm`` now resets
``map->num_new_daemons`` and ``map->daemon_vpid_start`` on entry rather than
accumulating them across successive grows.  The other — the VM-ready gate
(``num_reported == num_procs``) after a hole — is expected to be moot with
append-only vpids plus the routing fix and is left tracked in #2491.

A brand-new daemon starts with an empty ``dead_dmns`` set, so it could not learn
the holes from the shrink events it never saw.  That gap is closed in the nidmap.
``prte_util_nidmap_create`` now packs the true daemon vpid-space size
(``prte_process_info.num_daemons``) in addition to the live daemon list, and
``prte_util_decode_nidmap`` sets ``num_daemons`` from it and marks every vpid in
``[0, num_daemons)`` that has no live daemon entry into ``dead_dmns`` before
(re)computing the routing tree.  A shrunk-out rank is exactly such a gap, so
every daemon — freshly grown or long-standing — converges on the same vpid span
and the same dead set as the HNP, and the tree is routed around the hole even in
a deep (multi-level) tree where a new daemon's own computed ancestor is a
departed rank.  On an unshrunk DVM the packed span equals the live count, nothing
is marked dead, and behavior is unchanged.

Open questions
--------------

#. **Terminate-on-lifeline mode — resolved (with a fallback).**  Implemented: a
   target sets ``prte_dvm_leaving`` as it processes the shrink command naming its
   own rank, and departs on a bounded timer or, sooner, on the **first** lost
   connection (``prte_rml_route_lost`` exits early on any route loss while
   ``prte_dvm_leaving`` is set, so a genuine unrelated fault still recovers).  It
   exits on any lost route, not just the lifeline: a leaving daemon can see a
   child connection drop before its lifeline, and treating that as a child
   failure would emit an adoption notice that could be misread as a real fault
   and propagated up the tree, so a leaving daemon never lets a disconnect be
   mistaken for a fault.  The race is closed by construction
   as the design decision argues.  The half that was **not** built is the
   survivor *actively closing* the connection to each departing child; the timer
   makes that unnecessary for correctness, at the cost of the doomed daemons
   lingering a second or two after completion.  Building the active close would
   let the fast path fire deterministically and retire the timer.
#. **General callback vs. shrink special case — resolved.**  Implemented as the
   general ``xcast_nb`` facility, usable beyond shrink.
#. **Comm-failure fall-through — resolved.**  A daemon comm-failure is ignored
   when the daemon is already not-alive **and** its recorded state is at or past
   ``PRTE_PROC_STATE_TERMINATED`` (the state the completion handler stamps).  The
   state test is what keeps the guard from swallowing a ``FAILED_TO_START`` daemon
   (never alive, but its state is still below ``TERMINATED`` at that point).
#. **Survivor-side batching — resolved by the existing mechanism.**  Batching the
   repair at the root drives a batched repair on the survivors as well, and the
   reliable xcast's ACK bookkeeping is designed to absorb the mid-broadcast repair
   rather than race it, so no separate survivor-side work is required.  See
   *Implementation status* for the two-phase description.
#. **Profiling.**  The original concern was unprofiled.  Worth measuring the
   per-daemon vs. batch repair cost on a large single-branch shrink to confirm
   the optimization earns its complexity — especially since only the HNP side is
   batched so far.

Summary of files changed
------------------------

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - File
     - Change
   * - ``src/mca/grpcomm/grpcomm.h``
     - Extend the xcast module interface with an optional per-op completion
       callback + ``cbdata`` (Step 1, preferred option).
   * - ``src/mca/grpcomm/direct/grpcomm_direct_xcast.c``
     - Cache the completion callback on ``op_t``; invoke it from ``finish_op()``
       **only** when ``PRTE_PROC_IS_MASTER``.
   * - ``src/mca/ras/base/ras_base_allocate.c``
     - Register the completion callback (carrying the ``prte_shrink_campaign_t``)
       on both ``PRTE_DAEMON_SHRINK_CMD`` xcasts.  Add the Step-2 handler:
       batch ``prte_rml_repair_routing_tree()``, per-target HNP bookkeeping,
       fence decrement, ``prte_ras_base_shrink_complete()``, ``dvm_mod_notify``,
       campaign removal, ``prte_plm_base_fence_release()``.
   * - ``src/prted/prted_comm.c``
     - ``PRTE_DAEMON_SHRINK_CMD`` handler: every daemon repairs its own tree from
       the broadcast target list; a targeted daemon records-and-waits for
       lifeline loss instead of self-exiting (with a bounded fallback).
   * - ``src/mca/errmgr/dvm/errmgr_dvm.c``
     - Remove the per-death shrink-campaign block (lines 286-323) including the
       ``PMIX_RANK_INVALID`` stamping.  Ensure the general daemon-loss handling
       tolerates a comm-failure for a rank already torn down by Step 2.
   * - ``src/rml/routed_radix.c``
     - No change expected — ``prte_rml_repair_routing_tree()`` already accepts a
       rank array and does one pass.  Listed for reference.
   * - ``contrib/dockerswarm/``
     - Testbed grown to ten nodes to exercise single-branch multi-daemon shrink
       (already done alongside this plan).