.. _dvm-shrink-campaign-label: DVM Shrink-Campaign Fence Tracking ================================== This document describes the implementation of the DVM **shrink** (node-removal) path: how a ``PMIX_ALLOC_RELEASE`` that removes daemons is tracked against the launch fence, the second hold point that protects in-flight jobs, how campaign completion is detected, and the completion event delivered to the requester. For the shared fence mechanism it builds on — the counter, the held-job arrays, the ``VM_READY → MAP`` hold point, the fence-release helper, and the completion-event helper — see the parent plan :ref:`elastic-dvm-plan-label`. The externally observable contract is specified in :ref:`elastic-dvm-spec-label`, which is authoritative for observable behavior. The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan. Background ---------- The ``PRTE_DAEMON_SHRINK_CMD`` xcast is fire-and-forget: daemons exit asynchronously and the HNP has no built-in notification when all targeted daemons have terminated. Two race windows must be closed: **Race 1 — new job maps onto a shrinking node.** A job that checks the ``VM_READY → MAP`` fence while a shrink is in progress may pass the fence (if it was raised after the check), get mapped to a node whose daemon is dying, and then send a launch message to a daemon that has already exited. **Race 2 — in-flight job at LAUNCH_APPS.** A job that completed MAP before the shrink started and then enters ``prte_plm_base_launch_apps()`` may pack and send launch data to a daemon that dies between MAP and the send. The existing VM_READY fence does not protect this window because the job already passed the fence before the shrink was initiated. Race 1 is covered by the shared fence (the fence is incremented for shrink — see Step 1). Race 2 requires a second hold point guarded by checking the shrink campaign list (nonempty only during shrink) so that a concurrent grow does not unnecessarily hold jobs that have already been mapped to surviving nodes. Multiple concurrent shrink campaigns are supported: each campaign tracks its own count of still-living targets and is removed from the list when all of them have departed the DVM. Design Decision — Complete on Death, Not on Acknowledgement ----------------------------------------------------------- An earlier revision of this plan had each targeted daemon send an explicit ``PRTE_PLM_SHRINK_ACK_CMD`` to the HNP just before it exited, and the HNP decremented the campaign on *receipt of the ACK*. The errmgr comm-failure path existed only as a *fallback* for a daemon that crashed before it could send its ACK. That design was abandoned for the following reasons. **The ACK is the wrong signal.** An ACK announces a daemon's *intent* to leave; it is sent while the daemon is still alive and still a participant in the DVM. But the state the fence protects against — a job being mapped onto, or having launch data sent to, a departing daemon — is only safe once the daemon's routes, its ``num_daemons`` count, and its node state have actually been torn down. That teardown happens on the comm-failure path (``errmgr_dvm.c``), *not* when the ACK is sent. Releasing held jobs on ACK receipt could therefore unpark them into a DVM that still believed the departing daemon was present. **The reason for departure carries no information.** The HNP only needs to know that a target is gone, not *why*. A clean shrink exit and a crash have identical consequences for the campaign: the node is being removed either way, and the application processes beneath the daemon are killed (or die) when it terminates. Distinguishing the two cases buys nothing, so the "clean ACK vs. crash fallback" split was pure complexity. **Two decrement paths caused double-counting.** With both the ACK handler and the errmgr fallback live, each target could be counted twice — once when its ACK arrived (daemon still alive) and again when its subsequent death was detected — because nothing marked a target as already counted and the campaign was only removed once ``pending`` hit zero. Worked through for a two-target campaign: .. code-block:: text camp: ntargets=2, pending=2 daemon A acks -> pending=1, fence-=1 (A still alive) daemon A dies -> errmgr matches A (still in targets, camp still listed) -> pending=0 -> camp removed+released, fence-=1 daemon B -> never counted; campaign already "complete" The campaign completed and the fence released while daemon B was still present, re-opening exactly the race the fence was meant to close. **Resolution.** The ACK was removed entirely (the daemon-side send, the ``PRTE_PLM_SHRINK_ACK_CMD`` constant, and the HNP-side handler). Campaign completion is driven solely by actual daemon departure on the comm-failure path, which is both the authoritative event and the point at which the relevant cleanup has occurred. To make the single decrement idempotent against a daemon that emits more than one failure event, each matched target slot is stamped ``PMIX_RANK_INVALID`` once counted (Step 5). Step 1 — Shrink campaign type, list, and fence increment --------------------------------------------------------- **1a — Campaign type** Add the following type to ``src/runtime/prte_globals.h`` and define the class instance in ``src/runtime/prte_globals.c``: .. code-block:: c /* one entry per in-progress shrink campaign */ typedef struct { pmix_list_item_t super; pmix_rank_t *targets; /* daemon ranks being terminated */ int ntargets; /* initial count */ int pending; /* targets not yet known to have departed */ /* requester recorded for the spec's phase-two completion event */ pmix_proc_t requester; /* who issued the PMIX_ALLOC_RELEASE */ char *alloc_id; /* PMIX_ALLOC_ID of the allocation */ char *req_id; /* PMIX_ALLOC_REQ_ID, or NULL */ bool have_requester; /* false for a scheduler-driven release */ } prte_shrink_campaign_t; PMIX_CLASS_DECLARATION(prte_shrink_campaign_t); In ``src/runtime/prte_globals.c``: .. code-block:: c static void campaign_destruct(prte_shrink_campaign_t *p) { free(p->targets); free(p->alloc_id); free(p->req_id); } PMIX_CLASS_INSTANCE(prte_shrink_campaign_t, pmix_list_item_t, NULL, campaign_destruct); **1b — Global list** In ``src/runtime/prte_globals.h`` declare, and in ``src/runtime/prte_globals.c`` define: .. code-block:: c pmix_list_t prte_shrink_campaigns; Initialize in ``src/runtime/prte_init.c``: .. code-block:: c PMIX_CONSTRUCT(&prte_shrink_campaigns, pmix_list_t); Destruct in ``src/runtime/prte_finalize.c``: .. code-block:: c PMIX_LIST_DESTRUCT(&prte_shrink_campaigns); **1c — Populate campaign and increment fence** In ``src/mca/ras/base/ras_base_allocate.c``, the ``PMIX_ALLOC_RELEASE`` branch of ``prte_ras_base_complete_request()`` builds the daemon rank array (``ranks``, count ``m``) and then calls ``free(ranks)`` before the xcast that carries ``PRTE_DAEMON_SHRINK_CMD`` to the daemons. Insert the campaign setup **before** ``free(ranks)``, recording the requester directly from the request object (``req``) so the completion event can be directed at it. Guard the whole setup on ``0 < m``: a release that removes no daemons creates no campaign, exactly as the grow path creates none when ``map->num_new_daemons == 0``. The file must ``#include "src/mca/plm/base/plm_private.h"`` to see ``prte_plm_base_dvm_mod_notify()``. .. code-block:: c /* record the campaign — must be before free(ranks). Skip entirely when the * release removes no daemons (m == 0): an empty campaign would never drain * (no target ever departs on the comm-failure path), so it would leave * prte_shrink_campaigns non-empty forever — wedging every later job at the * LAUNCH_APPS hold — and would emit no completion event. Mirrors the grow * path's `map->num_new_daemons > 0` guard and the spec's "no event when * nothing changes" clause. */ if (0 < m) { prte_shrink_campaign_t *_camp = PMIX_NEW(prte_shrink_campaign_t); _camp->targets = (pmix_rank_t *) malloc(m * sizeof(pmix_rank_t)); memcpy(_camp->targets, ranks, m * sizeof(pmix_rank_t)); _camp->ntargets = m; _camp->pending = m; /* this path always has a requesting process (req->tproc); a * scheduler-driven release that has no requester does not pass through * here. Capture the requester and the allocation ids from the request. */ PMIX_XFER_PROCID(&_camp->requester, &req->tproc); for (n = 0; n < req->ninfo; n++) { if (PMIx_Check_key(req->info[n].key, PMIX_ALLOC_ID)) { _camp->alloc_id = strdup(req->info[n].value.data.string); } else if (PMIx_Check_key(req->info[n].key, PMIX_ALLOC_REQ_ID)) { _camp->req_id = strdup(req->info[n].value.data.string); } } _camp->have_requester = true; pmix_list_append(&prte_shrink_campaigns, &_camp->super); prte_dvm_launch_fence += m; } free(ranks); /* existing xcast */ if (PRTE_SUCCESS != (rc = prte_grpcomm.xcast(PRTE_RML_TAG_DAEMON, &msg))) { PRTE_ERROR_LOG(rc); /* clean up the campaign we just added (only if one was created), and * tell the requester the DVM modification failed (spec phase-two * failure event). rc is a PRTE code, so convert it to the * pmix_status_t the event carries. */ if (0 < m) { prte_shrink_campaign_t *_camp = (prte_shrink_campaign_t *) pmix_list_remove_last(&prte_shrink_campaigns); prte_dvm_launch_fence -= _camp->pending; if (_camp->have_requester) { prte_plm_base_dvm_mod_notify(&_camp->requester, _camp->alloc_id, _camp->req_id, false, prte_pmix_convert_rc(rc)); } PMIX_RELEASE(_camp); } } Because the campaign is appended before the xcast, any ``VM_READY`` event that fires on the progress thread after this point will see a nonzero fence and park the job. Step 2 — Daemon exit (no acknowledgement) ------------------------------------------ A daemon that decides to exit in response to ``PRTE_DAEMON_SHRINK_CMD`` does **not** send any acknowledgement to the HNP. After firing its ``PMIX_EVENT_JOB_END`` notification it simply activates ``PRTE_JOB_STATE_DAEMONS_TERMINATED`` and exits. The HNP tracks campaign completion through the daemon's *actual departure*, not through a message announcing its intent to leave. An acknowledgement sent before the daemon dies would be premature: the HNP cares only that the daemon is gone — the reason is irrelevant, and the application processes under it are killed when it terminates regardless. More importantly, the acknowledgement would arrive *before* the daemon's routes, ``num_daemons`` count, and node state have been torn down, so acting on it could release held jobs into a DVM that still believes the departing daemon is present. The comm-failure event (Step 5) is the only signal that coincides with that cleanup, so it is the sole completion trigger. Step 3 — Second hold point at LAUNCH_APPS ------------------------------------------ In ``src/mca/plm/base/plm_base_launch_support.c``, ``prte_plm_base_launch_apps()`` (line 817), add a check after the job-state guard but before packing any data: .. code-block:: c /* if a shrink is in progress, hold this job until all targeted * daemons have departed the DVM, to prevent sending launch data to * a dying daemon */ if (!pmix_list_is_empty(&prte_shrink_campaigns)) { jdata->state = PRTE_JOB_STATE_WAITING_FOR_DAEMONS; PMIX_RETAIN(jdata); pmix_pointer_array_add(prte_prelaunch_held_jobs, jdata); PMIX_RELEASE(caddy); return; } Using ``!pmix_list_is_empty(...)`` rather than a counter means the check automatically handles concurrent campaigns: the list is nonempty as long as any shrink is in progress. Keying on the **shrink** list specifically (not the shared fence counter) ensures a concurrent grow does not stall a job that has already been mapped onto surviving nodes. Step 4 — Remap helpers ---------------------- The pre-launch branch of the shared ``prte_plm_base_fence_release()`` (parent plan, Step 4) calls two shrink-specific helpers, both defined in ``plm_base_launch_support.c`` and declared in ``src/mca/plm/base/plm_private.h``. **``prte_plm_base_job_needs_remap(jdata)``** iterates over ``jdata->procs`` and returns ``true`` if any proc's assigned node has a daemon rank appearing in any active campaign: .. code-block:: c bool prte_plm_base_job_needs_remap(prte_job_t *jdata) { prte_shrink_campaign_t *camp; prte_proc_t *proc; int p, t; PMIX_LIST_FOREACH(camp, &prte_shrink_campaigns, prte_shrink_campaign_t) { for (p = 0; p < jdata->procs->size; p++) { proc = (prte_proc_t *) pmix_pointer_array_get_item(jdata->procs, p); if (NULL == proc || NULL == proc->node || NULL == proc->node->daemon) continue; for (t = 0; t < camp->ntargets; t++) { if (camp->targets[t] == proc->node->daemon->name.rank) { return true; } } } } return false; } **``prte_plm_base_reset_proc_map(jdata)``** un-claims all slot assignments made during the previous MAP pass so that the job can be remapped cleanly. Mirror the mapper's ``prte_rmaps_base_claim_slot()`` accounting, which does ``node->num_procs++`` and ``++node->slots_inuse`` for each non-tool proc: .. code-block:: c void prte_plm_base_reset_proc_map(prte_job_t *jdata) { int p, np; prte_proc_t *proc; prte_node_t *node; prte_app_context_t *app; for (p = 0; p < jdata->procs->size; p++) { proc = (prte_proc_t *) pmix_pointer_array_get_item(jdata->procs, p); if (NULL == proc) continue; node = proc->node; if (NULL != node) { /* remove from node's proc list */ for (np = 0; np < node->procs->size; np++) { if (pmix_pointer_array_get_item(node->procs, np) == proc) { pmix_pointer_array_set_item(node->procs, np, NULL); node->num_procs--; /* mirror claim_slot: tool procs do not count * against slots_inuse */ app = (prte_app_context_t *) pmix_pointer_array_get_item(jdata->apps, proc->app_idx); if (NULL == app || !PRTE_FLAG_TEST(app, PRTE_APP_FLAG_TOOL)) { node->slots_inuse--; } break; } } } pmix_pointer_array_set_item(jdata->procs, p, NULL); PMIX_RELEASE(proc); } jdata->num_procs = 0; jdata->num_launched = 0; } After remapping, the job re-enters ``prte_rmaps_base_map_job()`` which re-creates proc objects on the surviving nodes using the original ``app->num_procs`` counts. Step 5 — Detect target departure in the errmgr and notify completion -------------------------------------------------------------------- Campaign completion is driven entirely by the daemon-loss path: when a targeted daemon leaves the DVM, the HNP's comm-failure handler matches its rank against the active campaigns and drives the fence down. This is the same event whether the daemon exited cleanly in response to the shrink command or crashed, so a single code path covers both — there is no separate "acknowledgement" message and no fallback to reconcile. In ``src/mca/errmgr/dvm/errmgr_dvm.c``, inside the ``PMIX_CHECK_NSPACE`` daemon-proc block of ``proc_errors()`` (line 252), within the ``PRTE_PROC_STATE_COMM_FAILED`` / heartbeat-failed handler, add after the "mark daemon as gone" logic: .. code-block:: c /* check if this daemon was a pending shrink target */ { prte_shrink_campaign_t *_camp, *_next; int _t; PMIX_LIST_FOREACH_SAFE(_camp, _next, &prte_shrink_campaigns, prte_shrink_campaign_t) { for (_t = 0; _t < _camp->ntargets; _t++) { if (_camp->targets[_t] != proc->rank) continue; /* stamp this slot so a repeated comm event for the same * daemon cannot decrement the campaign twice */ _camp->targets[_t] = PMIX_RANK_INVALID; _camp->pending--; prte_dvm_launch_fence--; if (0 == _camp->pending) { /* this request's shrink is complete — first let the * active RAS modules release the freed resources back * to their resource manager(s), then notify the * requester that the DVM now reflects the new size */ prte_ras_base_shrink_complete(_camp); if (_camp->have_requester) { /* success == true => PMIX_DVM_IS_READY */ prte_plm_base_dvm_mod_notify(&_camp->requester, _camp->alloc_id, _camp->req_id, true, PMIX_SUCCESS); } pmix_list_remove_item(&prte_shrink_campaigns, &_camp->super); PMIX_RELEASE(_camp); } if (0 == prte_dvm_launch_fence) { prte_plm_base_fence_release(); } goto errmgr_shrink_done; } } errmgr_shrink_done: ; } Because the progress thread is single-threaded, the counter decrements and the list manipulation are atomic with respect to all other state machine callbacks. Stamping the matched slot ``PMIX_RANK_INVALID`` makes the decrement idempotent: should the daemon generate more than one failure event, only the first is counted. A daemon that crashes during a shrink is handled identically to one that exits cleanly — the node was being removed anyway, and jobs mapped to it are detected by ``prte_plm_base_job_needs_remap()`` and re-routed to surviving nodes. Before the completion event is emitted, ``prte_ras_base_shrink_complete()`` cycles across every active RAS module (``prte_ras_base.selected_modules``) and invokes the optional ``shrink_complete`` entry point on each, passing the completed ``prte_shrink_campaign_t``. This is the component's opportunity to release the freed resources back to its resource manager; what it does with that opportunity is up to the component. Unlike ``modify``, the cycle is not keyed to a single component: a single ``PMIX_ALLOC_RELEASE`` may remove nodes drawn from more than one allocation (see the "Resource release at shrink completion" section of :ref:`elastic-dvm-spec-label`), so every module is offered the campaign and each handles only the share that belongs to it. A module with no ``shrink_complete`` pointer, or with no stake in the operation, is a no-op. The runtime guarantees only the ordering: the release cycle runs ahead of ``prte_plm_base_dvm_mod_notify()``, so by the time the requester sees ``PMIX_DVM_IS_READY`` every component has been given its chance to act — not that any particular resource was in fact returned, which is the component's decision. The ``PMIX_DVM_IS_READY`` notification is **per campaign**: it fires when *this* request's last target departs, regardless of whether other (grow or shrink) campaigns keep the shared fence nonzero. The fence-release of held jobs, by contrast, waits for the global fence to reach zero. The failure counterpart (``PMIX_ERR_DVM_MOD``) is emitted only on the xcast-failure cleanup in Step 1 — once the shrink command is on the wire, every targeted daemon's departure is a *success* for the campaign, since clean exit and crash are indistinguishable and both remove the node as requested. Summary of Files Changed (Shrink Fence) ----------------------------------------- .. list-table:: :widths: 50 50 :header-rows: 1 * - File - Change * - ``src/runtime/prte_globals.h`` - Declare ``prte_shrink_campaign_t`` (type + ``PMIX_CLASS_DECLARATION``, including the requester fields) and ``prte_shrink_campaigns`` (``pmix_list_t``). * - ``src/runtime/prte_globals.c`` - Define ``PMIX_CLASS_INSTANCE`` for ``prte_shrink_campaign_t`` (destructor frees ``targets``, ``alloc_id``, ``req_id``). Define ``prte_shrink_campaigns``. * - ``src/runtime/prte_init.c`` - ``PMIX_CONSTRUCT(&prte_shrink_campaigns, pmix_list_t)`` alongside ``prte_held_jobs`` initialization. * - ``src/runtime/prte_finalize.c`` - ``PMIX_LIST_DESTRUCT(&prte_shrink_campaigns)``. * - ``src/mca/ras/base/ras_base_allocate.c`` - Add ``#include "src/mca/plm/base/plm_private.h"`` for ``prte_plm_base_dvm_mod_notify()``. In the ``PMIX_ALLOC_RELEASE`` branch of ``prte_ras_base_complete_request()``, guarded on ``0 < m``: create a ``prte_shrink_campaign_t``, copy the rank array into it, record the requester from ``req->tproc`` and ``PMIX_ALLOC_ID`` / ``PMIX_ALLOC_REQ_ID`` from ``req->info``, append to ``prte_shrink_campaigns``, and increment ``prte_dvm_launch_fence`` by ``m`` — all before ``free(ranks)``. Add xcast-failure cleanup that removes the campaign, decrements the fence, and emits ``PMIX_ERR_DVM_MOD`` (carrying ``prte_pmix_convert_rc(rc)``) to the requester. * - ``src/prted/prted_comm.c`` - In ``PRTE_DAEMON_SHRINK_CMD`` handler: after the ``JOB_END`` notification wait, activate ``PRTE_JOB_STATE_DAEMONS_TERMINATED`` and exit. No acknowledgement is sent; the HNP detects departure via the comm-failure path. * - ``src/mca/plm/base/plm_base_launch_support.c`` - Add ``prte_plm_base_job_needs_remap()`` and ``prte_plm_base_reset_proc_map()``. Add hold check in ``prte_plm_base_launch_apps()`` on ``!pmix_list_is_empty(&prte_shrink_campaigns)``. * - ``src/mca/plm/base/plm_private.h`` - Declare the two remap helpers. * - ``src/mca/ras/ras.h`` - Add the ``shrink_complete`` module entry point: a ``prte_ras_base_module_shrink_complete_fn_t`` taking the completed ``prte_shrink_campaign_t``, and a field for it in ``prte_ras_base_module_t`` (after ``modify``). Components that hand resources back to a scheduler implement it; others leave it ``NULL``. * - ``src/mca/ras/base/base.h`` - Declare ``prte_ras_base_shrink_complete(prte_shrink_campaign_t *)``. * - ``src/mca/ras/base/ras_base_allocate.c`` - Define ``prte_ras_base_shrink_complete()``: cycle across ``prte_ras_base.selected_modules`` and invoke each module's ``shrink_complete`` (when non-NULL), passing the campaign. Not keyed to one component, since a release may span multiple allocations. * - ``src/mca/errmgr/dvm/errmgr_dvm.c`` - Add ``#include "src/mca/ras/base/base.h"``. In ``proc_errors()``, daemon-comm-failure block: search ``prte_shrink_campaigns`` for the dead daemon's rank; if found, stamp the matched target slot ``PMIX_RANK_INVALID``, decrement campaign ``pending`` and fence; when ``pending`` hits zero call ``prte_ras_base_shrink_complete()`` to release the resources RM-side, then emit ``PMIX_DVM_IS_READY`` to the requester and remove the campaign; call ``prte_plm_base_fence_release()`` when the fence hits zero. This is the sole shrink-completion trigger. The shared infrastructure this path relies on — the fence counter, held-job arrays, ``VM_READY → MAP`` hold point, ``prte_plm_base_fence_release()``, and the ``prte_plm_base_dvm_mod_notify()`` completion-event helper — is listed in the "Shared Fence Infrastructure" table in :ref:`elastic-dvm-plan-label`. Design Invariants ----------------- * ``prte_shrink_campaigns`` is a ``pmix_list_t``; each entry covers exactly one ``PMIX_ALLOC_RELEASE`` request. Multiple concurrent shrink campaigns are supported. * The fence is incremented by exactly ``m`` at campaign creation and decremented by 1 for each targeted daemon whose departure is detected on the errmgr comm-failure path (clean exit and crash are indistinguishable and handled identically). Each target slot is stamped ``PMIX_RANK_INVALID`` once counted, so a repeated comm event cannot decrement twice. A campaign is removed from the list when its ``pending`` count reaches zero. * The ``LAUNCH_APPS`` hold uses ``!pmix_list_is_empty(&prte_shrink_campaigns)``, not ``prte_dvm_launch_fence > 0``, so a concurrent grow does not stall already-mapped jobs on surviving nodes. * ``prte_shrink_campaigns`` is stable throughout each campaign: the targets array for a given campaign is valid from creation through removal, so ``prte_plm_base_job_needs_remap()`` can safely iterate it during release. * Jobs in ``prte_prelaunch_held_jobs`` hold a ``PMIX_RETAIN`` reference; ``prte_plm_base_fence_release()`` releases it after re-activating the job. These jobs wait only on a shrink and are never aborted by a concurrent grow failure (the grow-failure abort touches only ``prte_held_jobs``); since shrink completion is success-only, they are always re-activated, not failed. * The completion event is per campaign and fires from the campaign-removal point (``pending == 0``), so each accepted release yields exactly one ``PMIX_DVM_IS_READY`` (success) or, on an xcast failure at creation, exactly one ``PMIX_ERR_DVM_MOD`` — never both, and never for a scheduler-driven release with no requester. * The RM-side release cycle runs strictly *before* the completion event. At ``pending == 0`` the campaign-removal point calls ``prte_ras_base_shrink_complete()``, which offers the campaign to every active RAS module, and only then emits ``PMIX_DVM_IS_READY``. Because the departing nodes may span multiple allocations, all modules are cycled — each handling only its own share. The invariant is the *ordering* (every component is offered the campaign before the event fires), not that any resource was actually returned: that is the component's decision, outside the runtime's control. Follow-up — collective shrink completion ----------------------------------------- A possible optimization — repairing the routing tree once per shrink campaign (a collective completion scheme) rather than once per departing daemon — has been deferred out of the launch-fence work and tracked separately as `openpmix/prrte#2492 `_.