8.3.2. Elastic DVM Implementation Plan

This document describes the implementation of the launch fence — the shared mechanism that serialises application-job dispatch against in-progress DVM grow and shrink campaigns, closing the race between a DVM size change and concurrently-running application jobs. For background on the race itself see Job Launch State Machine, section DVM Extension and the Daemon-Launch Race.

The externally observable contract this implementation delivers — the job-admission and placement guarantees, and the two-phase completion notification — is specified in Elastic DVM: Specification, which is authoritative for observable behavior. Where this plan and that specification disagree about observable behavior, the specification wins and this plan must be corrected.

This plan is the parent of two campaign-specific plans:

  • DVM Grow-Campaign Fence Tracking — the grow (daemon-launch) path’s per-campaign fence accounting, failure rollback, and success/failure completion events.

  • DVM Shrink-Campaign Fence Tracking — the shrink (node-removal) path’s campaign tracking, the second (LAUNCH_APPS) hold point, completion detection, the RM-side resource release that runs at completion (a release may span multiple allocations, so every active RAS module is offered the completed campaign), and completion events.

It covers the shared infrastructure both paths build on: the fence counter, the held-job arrays, the VM_READY MAP hold point, the fence-release helper, and the completion-event emission common to both.

The mechanism is a global launch fence — a counter (prte_dvm_launch_fence) that tracks the number of in-progress daemon launch campaigns. An app job that reaches the VM_READY MAP transition checks the fence; if it is nonzero the job parks itself in a held-job array (prte_held_jobs) and is released when the fence reaches zero.

The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan.

Gated on elastic mode. Every piece of this machinery is active only when the DVM is in elastic mode (the pre-existing prte_elastic_mode MCA parameter, off by default). The gate is applied at the two points that raise the fence — grow-campaign creation in setup_virtual_machine() and shrink-campaign creation in the PMIX_ALLOC_RELEASE handler — so that outside elastic mode the fence is never raised, the campaign lists stay empty, and every downstream check (the VM_READY and LAUNCH_APPS holds, the errmgr campaign matching, the drains) is naturally inert. The consumer sites also carry an explicit prte_elastic_mode guard so the non-elastic path is provably identical to the pre-feature behavior — in particular, prte_plm_base_grow_target_failed() returns false immediately, so a daemon loss on a fixed-size DVM is handled by the ordinary errmgr abort path exactly as before.

Note

The app-triggered expansion path (--add-host / --add-hostfile) already sets prte_dvm_ready = false in add_hosts() before posting the asynchronous RAS modify request, which causes newly-arriving jobs to be stashed in prte_cache rather than dispatched immediately. The launch fence is still required for the scheduler-push path (e.g., Slurm firing LAUNCH_DAEMONS directly) where prte_dvm_ready is never cleared, and to ensure full correctness when both paths can interleave.

8.3.2.1. Step 1 — New state constant

In src/mca/plm/plm_types.h, add:

/* value 17 is currently unused in the running-state band */
#define PRTE_JOB_STATE_WAITING_FOR_DAEMONS  17

Add a corresponding string to src/util/error_strings.c.

This state is used purely as a marker so that debugging tools and verbose output show clearly why a job is parked; no callback is registered for it.

8.3.2.2. Step 2 — New global fence and held-job arrays

In src/runtime/prte_globals.c and src/runtime/prte_globals.h, add:

/* counts in-progress daemon launch campaigns */
int prte_dvm_launch_fence = 0;

/* jobs parked at the VM_READY → MAP boundary */
pmix_pointer_array_t *prte_held_jobs;

/* jobs parked at the LAUNCH_APPS boundary during a shrink */
pmix_pointer_array_t *prte_prelaunch_held_jobs;

Initialize both arrays in src/runtime/prte_init.c alongside the existing prte_cache initialization:

prte_held_jobs = PMIX_NEW(pmix_pointer_array_t);
pmix_pointer_array_init(prte_held_jobs, 1, INT_MAX, 1);

prte_prelaunch_held_jobs = PMIX_NEW(pmix_pointer_array_t);
pmix_pointer_array_init(prte_prelaunch_held_jobs, 1, INT_MAX, 1);

Destruct both in src/runtime/prte_finalize.c.

The grow and shrink campaign lists (prte_grow_campaigns and prte_shrink_campaigns) that drive the fence are declared, constructed, and destructed alongside these globals; their types and lifecycles are specified in the two child plans.

8.3.2.3. Step 3 — Park jobs at the VM_READY → MAP boundary

This is the first of the two hold points and is shared by both paths: it stops any in-progress campaign (grow or shrink) from letting a freshly-arriving job map onto a node whose daemon is not ready.

In vm_ready(), the code at line 360 is reached only by app jobs (the daemon-job branch returns at line 357). This is immediately before prte_filem.preposition_files() which leads to files_ready MAP. Add the hold check here:

/* position any required files */
if (0 < prte_dvm_launch_fence) {
    /* daemon launch in progress — park this job */
    caddy->jdata->state = PRTE_JOB_STATE_WAITING_FOR_DAEMONS;
    PMIX_RETAIN(caddy->jdata);
    pmix_pointer_array_add(prte_held_jobs, caddy->jdata);
    PMIX_RELEASE(caddy);
    return;
}
if (PRTE_SUCCESS !=
        prte_filem.preposition_files(caddy->jdata, files_ready, caddy->jdata)) {
    PRTE_ACTIVATE_JOB_STATE(caddy->jdata, PRTE_JOB_STATE_FILES_POSN_FAILED);
}
PMIX_RELEASE(caddy);

The second hold point — at LAUNCH_APPS, guarded by the shrink campaign list rather than the fence counter — is shrink-specific and is described in DVM Shrink-Campaign Fence Tracking.

8.3.2.4. Step 4 — Held-job release helpers

There are two distinct ways a held job leaves its parked state, and they are not symmetric, so they are handled by two separate helpers rather than a single bool success flag:

  • Global success. When the global fence reaches zero — every grow and shrink campaign has completed successfully — both classes of held job are admitted. This is prte_plm_base_fence_release().

  • Grow failure. When a grow campaign fails (see DVM Grow-Campaign Fence Tracking, Failure drain and rollback), the spec requires the whole pre-map held-job set to be aborted — the first-failure semantics of a non-elastic launch. But a grow failure must not touch the pre-launch held jobs: those are parked solely on account of an in-progress shrink (the LAUNCH_APPS hold is gated on the shrink list, not the fence), so they do not wait on the grow, and the spec’s conformance guarantee #4 states that a daemon failure may affect only the jobs waiting on the campaign it belongs to. This asymmetric abort is prte_plm_base_abort_premap_held().

Folding both paths into one fence_release(bool success) was the original shape of this plan, but it could not honor the spec when a grow failed while a shrink was still in progress. A single global success flag cannot express “abort the grow’s waiters but leave the shrink’s waiters parked,” and gating the failure abort on the fence reaching zero would let a later shrink-success release admit a pre-map job whose grow dependency had already failed (the last campaign to drain — the shrink — would call the release with success == true). Splitting the two release paths closes that gap: the grow-failure abort fires immediately on the pre-map array regardless of the fence value, and the success release is reached only when no campaign has failed.

Both helpers are declared in src/mca/plm/base/plm_private.h (the header the errmgr, ras, and state callers already include for the existing prte_plm_base_* launch helpers) and defined in plm_base_launch_support.c:

/* SUCCESS release — invoked only when the global fence reaches zero, i.e.
 * every grow and shrink campaign has completed successfully.  Admits both
 * classes of held job and defensively sweeps any residual campaigns. */
void prte_plm_base_fence_release(void)
{
    int _hi;
    prte_job_t *_held;

    /* --- pre-map held jobs (parked at VM_READY) --- */
    for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
        _held = (prte_job_t *)
            pmix_pointer_array_get_item(prte_held_jobs, _hi);
        if (NULL == _held) {
            continue;
        }
        pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
        PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_VM_READY);
        PMIX_RELEASE(_held);
    }

    /* --- pre-launch held jobs (parked at LAUNCH_APPS) --- */
    for (_hi = 0; _hi < prte_prelaunch_held_jobs->size; _hi++) {
        _held = (prte_job_t *)
            pmix_pointer_array_get_item(prte_prelaunch_held_jobs, _hi);
        if (NULL == _held) {
            continue;
        }
        pmix_pointer_array_set_item(prte_prelaunch_held_jobs, _hi, NULL);
        if (prte_plm_base_job_needs_remap(_held)) {
            prte_plm_base_reset_proc_map(_held);
            PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_MAP);
        } else {
            PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_LAUNCH_APPS);
        }
        PMIX_RELEASE(_held);
    }

    /* Campaigns are removed individually as their last target drains, so
     * both lists should be empty here.  Sweep each defensively anyway — so
     * a future change that can leave a residual campaign behind cannot wedge
     * the fence — and sweep *both* kinds for symmetry, not just shrink. */
    {
        prte_shrink_campaign_t *_sc, *_sn;
        PMIX_LIST_FOREACH_SAFE(_sc, _sn,
                               &prte_shrink_campaigns, prte_shrink_campaign_t) {
            pmix_list_remove_item(&prte_shrink_campaigns, &_sc->super);
            PMIX_RELEASE(_sc);
        }
    }
    {
        prte_grow_campaign_t *_gc, *_gn;
        PMIX_LIST_FOREACH_SAFE(_gc, _gn,
                               &prte_grow_campaigns, prte_grow_campaign_t) {
            pmix_list_remove_item(&prte_grow_campaigns, &_gc->super);
            PMIX_RELEASE(_gc);
        }
    }
}

/* GROW-FAILURE abort — fails every job parked at the VM_READY -> MAP
 * boundary to NEVER_LAUNCHED.  Called from the grow failure drain only, and
 * independent of the fence value, so a grow failure aborts its pre-map
 * waiters even while a concurrent shrink keeps the fence nonzero.  It
 * deliberately leaves prte_prelaunch_held_jobs untouched: those jobs wait
 * only on a shrink, never on the grow (conformance #4). */
void prte_plm_base_abort_premap_held(void)
{
    int _hi;
    prte_job_t *_held;

    for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
        _held = (prte_job_t *)
            pmix_pointer_array_get_item(prte_held_jobs, _hi);
        if (NULL == _held) {
            continue;
        }
        pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
        PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_NEVER_LAUNCHED);
        PMIX_RELEASE(_held);
    }
}

The pre-launch branch of fence_release() calls two shrink-specific helpers — prte_plm_base_job_needs_remap() (does any held proc sit on a departing daemon?) and prte_plm_base_reset_proc_map() (un-claim the previous mapping so the job can be remapped onto survivors) — specified in DVM Shrink-Campaign Fence Tracking. Because shrink completion treats a clean exit and a crash identically (a targeted daemon’s departure is always a success for its campaign), neither held-job array is ever failed on the shrink path; the only failure disposition of a held job is the grow-failure abort above.

prte_plm_base_fence_release() acts when the global fence reaches zero, which requires all grow and shrink campaigns to have completed. The per-campaign completion event (Step 5) is distinct: it fires for an individual request’s campaign when that campaign drains, independent of whether other campaigns are still in flight.

8.3.2.5. Step 5 — Completion-event emission (shared helper)

The spec’s two-phase contract (see Elastic DVM: Specification, Asynchronous size-change completion) requires that, when an accepted DVM operation finishes, the runtime deliver a directed event to the process that requested the size change: PMIX_DVM_IS_READY on success or PMIX_ERR_DVM_MOD (carrying the underlying cause) on failure.

Both campaign objects therefore record the requester so the event can be directed once the campaign drains:

pmix_proc_t  requester;       /* who requested the size change */
char        *alloc_id;        /* PMIX_ALLOC_ID of the affected allocation */
char        *req_id;          /* requester's PMIX_ALLOC_REQ_ID, or NULL */
bool         have_requester;  /* false for a scheduler push */

These fields are populated where the campaign is created, from the allocation request that drove the operation:

  • Shrink — directly in the PMIX_ALLOC_RELEASE handler, from the request object: requester is the request’s tproc, and alloc_id / req_id are read from its PMIX_ALLOC_ID / PMIX_ALLOC_REQ_ID info keys.

  • Grow — in setup_virtual_machine(), indirectly through the session. The RAS reservation machinery already records the driving request on the session and back-points every reserved node at it (add_nodes_to_session() sets node->session; the session carries requestor, alloc_refid, and user_refid). The grow campaign reads those from the first new daemon’s node->session, so the originating request need not be threaded explicitly into the launch path.

A size change initiated with no PMIx requester (a scheduler push, or the initial DVM bring-up, where the session is the default one or its requestor rank is PMIX_RANK_INVALID) leaves have_requester false and emits no event.

A single shared helper, declared in src/mca/plm/base/plm_private.h and defined in plm_base_launch_support.c, performs the emission. Its prototype takes a bool success rather than an event code, so that the two new status codes are named only inside the helper body — call sites pass a bool and a cause and never reference PMIX_DVM_IS_READY / PMIX_ERR_DVM_MOD themselves:

/* success => emit PMIX_DVM_IS_READY; otherwise emit PMIX_ERR_DVM_MOD
 * carrying `cause` (the underlying failure pmix_status_t) */
void prte_plm_base_dvm_mod_notify(const pmix_proc_t *requester,
                                  const char *alloc_id,
                                  const char *req_id,
                                  bool success,
                                  pmix_status_t cause);

It packs PMIX_ALLOC_ID (always), PMIX_ALLOC_REQ_ID (when req_id is non-NULL), and — on failure — the underlying cause status (carried under PMIX_JOB_TERM_STATUS, the standard pmix_status_t-typed info key; the PMIx contract for PMIX_ERR_DVM_MOD asks only for “any available information describing the cause”), then delivers the event only to requester as a directed, custom-range notification — the same PMIX_RANGE_CUSTOM mechanism used for PMIX_ALLOC_TIMEOUT_WARNING.

The grow and shrink plans call this helper at their respective drain points: the grow path inside prte_plm_base_grow_drain() (reached on success from vm_ready after the WIREUP xcast, and on failure from prte_plm_base_grow_target_failed() and the check_job_complete safety net); the shrink path when a campaign’s last target departs (success, in the errmgr) and on the xcast-failure cleanup at campaign creation (failure).

PMIX_DVM_IS_READY and PMIX_ERR_DVM_MOD are plain #defined pmix_status_t values (PMIx status codes are preprocessor macros, not enum constants), so their availability is decided entirely by whether the installed PMIx headers define the symbols — no PMIx capability flag is involved (the PRTE_CHECK_PMIX_CAP machinery is for PMIX_CAP_* behavioral flags and does not apply here).

To keep the project’s #if FOO discipline — so a mistyped guard is a compile error rather than a silently-false #ifdef — a probe in config/prte_setup_pmix.m4 defines PRTE_HAVE_DVM_MOD_EVENTS to 0 or 1 from the presence of the two symbols:

AC_MSG_CHECKING([for PMIx DVM modification event codes])
AC_PREPROC_IFELSE(
    [AC_LANG_PROGRAM([[#include <pmix.h>
#if !defined(PMIX_DVM_IS_READY) || !defined(PMIX_ERR_DVM_MOD)
#error DVM modification event codes not present
#endif
]], [[]])],
    [AC_MSG_RESULT([yes])
     AC_DEFINE([PRTE_HAVE_DVM_MOD_EVENTS], [1],
               [PMIx defines the DVM modification event codes])],
    [AC_MSG_RESULT([no])
     AC_DEFINE([PRTE_HAVE_DVM_MOD_EVENTS], [0],
               [PMIx defines the DVM modification event codes])])

The macro is defined to 0 or 1 on both branches (never #undefed), so it can be tested with #if PRTE_HAVE_DVM_MOD_EVENTS. Because the helper’s bool-based prototype keeps the two codes out of every call site, only the helper body needs the guard: when PRTE_HAVE_DVM_MOD_EVENTS is 0 the body compiles to a no-op (the bool/pmix_status_t prototype still compiles), the call sites are unchanged, and no completion event is delivered — exactly as the spec’s backward-compatibility clause requires. Because this touches *.m4, ./autogen.pl must be re-run before configuring.

8.3.2.6. Summary of Files Changed (Shared Fence Infrastructure)

File

Change

src/mca/plm/plm_types.h

Add PRTE_JOB_STATE_WAITING_FOR_DAEMONS = 17.

src/util/error_strings.c

Add string for PRTE_JOB_STATE_WAITING_FOR_DAEMONS.

src/runtime/prte_globals.h

Declare prte_dvm_launch_fence, prte_held_jobs, and prte_prelaunch_held_jobs.

src/runtime/prte_globals.c

Define and initialize prte_dvm_launch_fence = 0.

src/runtime/prte_init.c

Allocate and init prte_held_jobs and prte_prelaunch_held_jobs.

src/runtime/prte_finalize.c

Destruct prte_held_jobs and prte_prelaunch_held_jobs.

src/mca/plm/base/plm_base_launch_support.c

Define prte_plm_base_fence_release() and prte_plm_base_abort_premap_held() (Step 4) and prte_plm_base_dvm_mod_notify() (Step 5).

src/mca/plm/base/plm_private.h

Declare prte_plm_base_fence_release(), prte_plm_base_abort_premap_held(), and prte_plm_base_dvm_mod_notify() (and, for the grow path, prte_plm_base_grow_drain() / prte_plm_base_grow_target_failed() — see DVM Grow-Campaign Fence Tracking). This is the header the errmgr, ras, and state callers already include.

src/mca/state/dvm/state_dvm.c

In vm_ready: add the VM_READY MAP hold-check before preposition_files (Step 3).

config/prte_setup_pmix.m4

Add the AC_PREPROC_IFELSE probe that defines PRTE_HAVE_DVM_MOD_EVENTS (0/1) from the presence of PMIX_DVM_IS_READY / PMIX_ERR_DVM_MOD (Step 5). Re-run autogen.pl afterward.

For the grow path’s file changes see the “Touched files” table in DVM Grow-Campaign Fence Tracking; for the shrink path’s, the “Touched files” table in DVM Shrink-Campaign Fence Tracking.

8.3.2.7. Design Invariants

Shared fence

  • The fence is a single int accessed only on the progress thread, so all increments, decrements, and the zero test are race-free without locking.

  • A job is parked iff the fence is nonzero at the VM_READY MAP boundary (Step 3); the LAUNCH_APPS hold (shrink plan) is gated on the shrink campaign list, not the fence, so a concurrent grow does not stall an already-mapped job on surviving nodes.

  • prte_plm_base_fence_release() is the success-only release; it is called only when the fence reaches zero, which requires all grow and shrink campaigns to have completed successfully. The campaign lists are therefore empty (or nearly so — fence_release does a defensive sweep of both the grow and shrink lists for the degenerate case where some future change leaves a partially-setup campaign behind).

  • A grow failure is the only path that fails a held job. It calls prte_plm_base_abort_premap_held(), which aborts the pre-map held jobs (prte_held_jobsNEVER_LAUNCHED) immediately and independently of the fence, and never touches prte_prelaunch_held_jobs: those jobs wait only on a shrink, so per conformance #4 a grow failure must leave them parked until the shrink completes.

  • The per-campaign completion event (Step 5) is independent of the global fence: it fires when an individual request’s campaign drains, even if other campaigns keep the fence nonzero.

Grow fence — see the “Why this is correct” and “Design” sections of DVM Grow-Campaign Fence Tracking.

Shrink fence — see the “Design Invariants” section of DVM Shrink-Campaign Fence Tracking.