.. _dvm-grow-campaign-label: DVM Grow-Campaign Fence Tracking ================================ This document describes the implementation that makes the DVM **grow** (daemon-launch) path account for the launch fence on a per-daemon, rank-tracked basis, mirroring the design already used by the DVM **shrink** path (:ref:`dvm-shrink-campaign-label`). For the shared fence mechanism itself and the race it closes, see the parent plan :ref:`elastic-dvm-plan-label` and :ref:`state-machine-label`, section *DVM Extension and the Daemon-Launch Race*. The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan. The observable job-admission and placement guarantees that the grow path upholds are specified in :ref:`elastic-dvm-spec-label`, which is authoritative for observable behavior; this document describes the implementation that delivers them. Motivation ---------- The launch fence (``prte_dvm_launch_fence``) holds application jobs at the ``VM_READY → MAP`` boundary while a daemon-launch campaign is in progress, so that no job is mapped onto a node whose daemon is not yet up and wired. The shrink path tracks the specific daemon ranks it is removing in a ``prte_shrink_campaign_t`` and resolves the fence one rank at a time as each targeted daemon actually departs. The grow path, by contrast, originally encoded "a grow is in progress" as a single boolean — ``PRTE_JOB_LAUNCHED_DAEMONS`` — set on the one daemon job, together with a ``prte_dvm_launch_fence++`` performed once per campaign in ``prte_plm_base_setup_virtual_machine()``. The single decrement happened in ``vm_ready`` on success, or in the ``errmgr/dvm`` comm-failure handler if a daemon died first. Because that boolean carries no identity, two defects followed: #. **An unrelated daemon death consumed the campaign's token.** The comm-failure handler decremented the fence and cleared the boolean whenever *any* daemon died while a grow was in progress — there is only one daemon job, and it carried the token. A pre-existing daemon dying mid-grow would therefore release the held jobs early (reopening the very race the fence exists to close) and clear the token, after which ``vm_ready`` skipped the WIREUP xcast (it is gated on the same attribute), so the genuinely new daemons could come up without ever receiving the nidmap/wireup buffer. #. **Concurrent campaigns could wedge the fence.** Two overlapping grows would raise the fence to two but share the single boolean token, which can only be cleared once. A daemon failure would clear it, leaving the fence stuck above zero and the held jobs parked indefinitely. Both defects trace to the same root cause: the grow path tracked *that* a grow was happening, not *which* daemons it was launching. Design ------ Track each grow campaign explicitly, recording the ranks being launched, and hold the whole campaign's fence contribution until a single safe drain point. New campaign object ~~~~~~~~~~~~~~~~~~~~ In ``src/runtime/prte_globals.h`` / ``prte_globals.c``: .. code-block:: c typedef struct { pmix_list_item_t super; pmix_rank_t *targets; /* daemon ranks being launched */ int ntargets; /* == this campaign's fence contribution */ /* requester recorded for the spec's phase-two completion event */ pmix_proc_t requester; /* who requested the grow */ char *alloc_id; /* PMIX_ALLOC_ID of the allocation */ char *req_id; /* PMIX_ALLOC_REQ_ID, or NULL */ bool have_requester; /* false for a scheduler-driven push */ } prte_grow_campaign_t; PMIX_CLASS_DECLARATION(prte_grow_campaign_t); PRTE_EXPORT extern pmix_list_t prte_grow_campaigns; The campaign's destructor frees ``targets``, ``alloc_id``, and ``req_id``. The list is constructed in ``prte_init.c`` and destructed in ``prte_finalize.c`` alongside ``prte_shrink_campaigns``. A separate list (as opposed to unifying with the shrink list) is used deliberately: the ``LAUNCH_APPS`` hold and the remap-on-release logic key off the *shrink* list's non-emptiness, and a grow must **not** stall jobs that are already mapped onto existing nodes. Keeping the lists separate leaves the working shrink path untouched. Fence is campaign-granular, not per-rank ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Unlike shrink, the grow fence contribution is **held in full until the campaign is drained as a unit**. This is the key correctness point: the fence must not reach zero (for a successful grow) until *after* the WIREUP xcast in ``vm_ready``, otherwise an application job arriving in the window between "last daemon reported" and "wireup sent" would see a zero fence and map onto daemons that are up but not yet wired. A naive per-rank decrement at daemon-report time would reopen exactly that window. Holding the contribution until ``vm_ready`` drains it preserves the original ordering guarantee. The per-rank ``targets`` array serves two purposes: to decide whether a *failure* event belongs to this grow, and — when one does — to enumerate the daemons that must be torn down to roll the DVM back to its pre-grow membership (see `Rollback on failure`_). Lifecycle ~~~~~~~~~ #. **Create** — in ``prte_plm_base_setup_virtual_machine()``, when ``map->num_new_daemons > 0``: build a ``prte_grow_campaign_t`` recording the ``num_new_daemons`` consecutive vpids starting at ``map->daemon_vpid_start``, record the requester / ``PMIX_ALLOC_ID`` / ``PMIX_ALLOC_REQ_ID`` for the phase-two completion event, append it to ``prte_grow_campaigns``, and add ``num_new_daemons`` to the fence. The requester is taken from the first new daemon's ``node->session`` — the RAS reservation machinery back-points each reserved node at the session that records the driving request (``requestor``, ``alloc_refid``, ``user_refid``). When the grow was not driven by an allocation request (the initial DVM bring-up, or a scheduler push — the default session, or a session whose ``requestor`` rank is ``PMIX_RANK_INVALID``), ``have_requester`` stays false and no event is emitted. ``PRTE_JOB_LAUNCHED_DAEMONS`` is still set on the daemon job for its unrelated uses (the WIREUP gate in ``vm_ready`` and the odls path); it is no longer consulted for fence accounting. #. **Success drain** — ``vm_ready`` fires only once every expected daemon has reported (``num_reported == num_procs``), which means any in-progress grow campaigns have fully succeeded. After performing the WIREUP xcast, it calls ``prte_plm_base_grow_drain(true)``, which removes every grow campaign, subtracts each ``ntargets`` from the fence, emits a ``PMIX_DVM_IS_READY`` completion event to each drained campaign's requester (via ``prte_plm_base_dvm_mod_notify()`` — see :ref:`elastic-dvm-plan-label`, Step 5), and — if the fence has reached zero — admits the held jobs by calling ``prte_plm_base_fence_release()``. #. **Failure drain and rollback** — in the ``errmgr/dvm`` comm-failure / ``FAILED_TO_START`` handler, the dead daemon's rank is passed to ``prte_plm_base_grow_target_failed()``, which returns ``true`` iff the rank belonged to an in-progress grow campaign. An unrelated daemon loss matches nothing, returns ``false``, and is left to the errmgr's normal handling (fixing defect 1). When the rank *is* a grow target, the function handles the loss completely — it removes **that** campaign from the list, drops its ``ntargets`` from the fence, rolls it back out of the DVM (see `Rollback on failure`_), emits a ``PMIX_ERR_DVM_MOD`` completion event to its requester, and aborts the pre-map held jobs via ``prte_plm_base_abort_premap_held()`` (see :ref:`elastic-dvm-plan-label`, Step 4) — and the errmgr, seeing the ``true`` return, ``goto``\ s its cleanup so the general daemon-loss path (which would otherwise abort the whole DVM) is skipped. The failure is **campaign-scoped**: only the matched campaign is torn down, so a concurrent grow keeps its daemons and completes normally. Mirroring the original single-token behavior, any grow failure fails the whole **pre-map** held-job set — immediately, regardless of the fence value, so a concurrent shrink cannot later admit a job whose grow dependency has failed. It deliberately does **not** disturb the pre-launch (``LAUNCH_APPS``) held jobs: those wait only on a shrink, not on the grow, so per the spec's conformance guarantee #4 a grow failure must leave them parked. #. **Safety net** — ``check_job_complete``'s "received NULL job" branch calls ``prte_plm_base_grow_drain(false)`` to drain any still-pending grow campaigns as failures, so pre-map held jobs are never parked across a daemon-job teardown. (No rollback is needed there — the whole DVM is force-exiting.) The success drain (``grow_drain(true)`` from ``vm_ready``) still removes every grow campaign in one pass and zeroes the fence's entire grow contribution, independent of how many concurrent campaigns exist (fixing defect 2); the *failure* path, by contrast, is per-campaign so an unrelated concurrent grow is not dragged down with the failed one. Rollback on failure ~~~~~~~~~~~~~~~~~~~~ The spec (:ref:`elastic-dvm-spec-label`) requires that a failed grow leave the DVM in its pre-grow state rather than half-extended. Failing the held jobs is therefore necessary but not sufficient: the campaign's already-started daemons and the nodes it was adding must also be removed. ``grow_target_failed()`` performs this teardown (in the static helper ``grow_rollback()``) for the matched campaign before notifying the requester and aborting the held jobs. The campaign's ``targets`` array enumerates every daemon rank the grow launched. One of them is the rank whose loss triggered the failure; the remainder may be in any state from "not yet reported" through "reported and wired". Routing for the triggering rank is repaired here with ``prte_rml_route_lost()`` (the errmgr's own ``route_lost`` call is on the path that the ``true`` return skips). Each *other* target is handled according to whether a daemon actually came up: * **A target that started** (``PRTE_PROC_FLAG_ALIVE`` — it reported in) is terminated using the same ``PRTE_DAEMON_SHRINK_CMD`` xcast the DVM shrink path uses. It self-exits, and its departure is then reconciled on the normal daemon-loss path (``route_lost`` succeeds, ``num_daemons`` is decremented) as for any shrink — and because the campaign is already gone, that later event returns ``false`` and is handled without a second rollback. * **A target that never started** (the ``FAILED_TO_START`` case — e.g. the remote ``exec`` failed) has no daemon to signal, so no comm-failure event will arrive for it; its launch-time ``num_daemons`` bump is reverted directly in ``grow_rollback()``. In every case the node's daemon backpointer is cleared (``node->daemon = NULL``, releasing the retain taken at assignment, and detaching any reservation session), which removes the node from the mapper's usable set — the new nodes carry no application procs, since the jobs that would have used them were held at the fence and never launched, so clearing ``node->daemon`` is sufficient to keep any later job off them. The rollback is strictly campaign-scoped: it touches only the ranks in the failed campaign's ``targets`` array. A concurrently-running grow campaign keeps its own daemons and completes normally, and no pre-existing daemon or node is disturbed — the same identity-based discrimination that keeps an unrelated daemon death from consuming the fence (defect 1) also keeps it out of the rollback set. .. note:: Two edges remain, both within the rarely-exercised daemon-launch-failure path and neither yet validated against a real multi-node allocation: a target that is **slow to start** (neither ``ALIVE`` nor yet failed when the rollback runs) is treated as never-started, so a later report-in or failure for it is not specially handled; and node objects are detached via ``node->daemon`` rather than physically removed from ``prte_node_pool`` (matching how the shrink path leaves the pool), so ``num_nodes`` is not decremented. Why this is correct ------------------- * **Unrelated daemon death during a grow.** ``grow_target_failed()`` scans the campaign target arrays; a non-target rank matches nothing, so the fence is not touched, the held jobs are not released early, and the WIREUP xcast is not skipped. * **Concurrent campaigns.** Each campaign is an independent object with its own contribution. On success ``grow_drain()`` removes them all and the fence reaches zero only when no grow contribution remains; on failure only the matched campaign is removed. Either way there is no single token to exhaust. * **Wireup ordering.** The fence stays at its full value throughout the grow and is dropped only when ``vm_ready`` drains it after the WIREUP xcast (on success) or when a target dies (on failure). Jobs held at ``VM_READY → MAP`` are thus admitted only once the new daemons are wired up. * **Partial failure.** A grow in which any target dies is failed as a whole: the dying daemon triggers ``grow_target_failed()``, which rolls the matched campaign back out of the DVM — terminating its started daemons via the shrink command and detaching its nodes — and aborts the pre-map held jobs to ``NEVER_LAUNCHED`` (the pre-launch held jobs, which wait only on a shrink, are left untouched). This matches the original first-failure semantics for the held jobs and, per the spec, leaves the DVM at its pre-grow membership rather than half-extended; the errmgr skips its DVM-wide abort because the loss was reported as handled. Touched files ------------- .. list-table:: :widths: 45 55 :header-rows: 1 * - File - Change * - ``src/runtime/prte_globals.{h,c}`` - Add ``prte_grow_campaign_t`` (including the requester fields), ``prte_grow_campaigns`` list, and class (destructor frees ``targets``, ``alloc_id``, ``req_id``). * - ``src/runtime/prte_init.c`` / ``prte_finalize.c`` - Construct / destruct ``prte_grow_campaigns``. * - ``src/mca/plm/base/plm_base_launch_support.c`` - Create the campaign in ``setup_virtual_machine`` (recording the requester from the first new daemon's ``node->session`` — its ``requestor`` / ``alloc_refid`` / ``user_refid``); add ``prte_plm_base_grow_drain()``, the static ``grow_rollback()``, and ``prte_plm_base_grow_target_failed()``. ``grow_drain()`` (success drain / teardown safety net) emits the per-campaign completion event via the shared ``prte_plm_base_dvm_mod_notify()`` helper and, on success, admits the held jobs via ``prte_plm_base_fence_release()`` when the fence reaches zero. ``grow_target_failed()`` returns ``bool`` and, for the matched campaign, removes it, drops its fence contribution, calls ``grow_rollback()`` (terminate started daemons via the shrink command, revert never-started daemon counts, detach nodes), emits ``PMIX_ERR_DVM_MOD``, and aborts the pre-map held jobs. * - ``src/mca/plm/base/plm_private.h`` - Declare ``prte_plm_base_grow_drain()`` and the now-``bool``-returning ``prte_plm_base_grow_target_failed()`` (alongside the shared ``fence_release`` / ``abort_premap_held`` / ``dvm_mod_notify`` helpers, so all the ``prte_plm_base_*`` launch-fence helpers live in the one header the errmgr already includes). * - ``src/mca/errmgr/dvm/errmgr_dvm.c`` - In the daemon comm-failure block, ``goto cleanup`` when ``prte_plm_base_grow_target_failed()`` returns ``true``: the grow rollback has fully absorbed the loss, so the general daemon-loss handling (which would otherwise abort the whole DVM) must be skipped. * - ``src/mca/state/dvm/state_dvm.c`` - Drain on success in ``vm_ready`` after WIREUP; drop the per-error fence manipulation (the DVM is force-exiting); convert the ``check_job_complete`` safety net to a drain. Follow-up --------- * **Campaign-object unification.** The grow and shrink campaign objects are structurally similar and could be unified into a single ``prte_launch_campaign_t`` with a ``kind`` discriminator in a future cleanup. That was intentionally deferred here to avoid disturbing the ``LAUNCH_APPS`` hold and remap-on-release logic, which must remain shrink-only.