8.3.2. Elastic DVM Implementation Plan
This document describes the implementation of the launch fence — the shared mechanism that serialises application-job dispatch against in-progress DVM grow and shrink campaigns, closing the race between a DVM size change and concurrently-running application jobs. For background on the race itself see Job Launch State Machine, section DVM Extension and the Daemon-Launch Race.
The externally observable contract this implementation delivers — the job-admission and placement guarantees, and the two-phase completion notification — is specified in Elastic DVM: Specification, which is authoritative for observable behavior. Where this plan and that specification disagree about observable behavior, the specification wins and this plan must be corrected.
This plan is the parent of two campaign-specific plans:
DVM Grow-Campaign Fence Tracking — the grow (daemon-launch) path’s per-campaign fence accounting, failure rollback, and success/failure completion events.
DVM Shrink-Campaign Fence Tracking — the shrink (node-removal) path’s campaign tracking, the second (
LAUNCH_APPS) hold point, completion detection, the RM-side resource release that runs at completion (a release may span multiple allocations, so every active RAS module is offered the completed campaign), and completion events.
It covers the shared infrastructure both paths build on: the fence counter,
the held-job arrays, the VM_READY → MAP hold point, the fence-release
helper, and the completion-event emission common to both.
The mechanism is a global launch fence — a counter
(prte_dvm_launch_fence) that tracks the number of in-progress daemon launch
campaigns. An app job that reaches the VM_READY → MAP transition checks the
fence; if it is nonzero the job parks itself in a held-job array
(prte_held_jobs) and is released when the fence reaches zero.
The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan.
Gated on elastic mode. Every piece of this machinery is active only when
the DVM is in elastic mode (the pre-existing prte_elastic_mode MCA
parameter, off by default). The gate is applied at the two points that raise
the fence — grow-campaign creation in setup_virtual_machine() and
shrink-campaign creation in the PMIX_ALLOC_RELEASE handler — so that outside
elastic mode the fence is never raised, the campaign lists stay empty, and every
downstream check (the VM_READY and LAUNCH_APPS holds, the errmgr
campaign matching, the drains) is naturally inert. The consumer sites also
carry an explicit prte_elastic_mode guard so the non-elastic path is
provably identical to the pre-feature behavior — in particular,
prte_plm_base_grow_target_failed() returns false immediately, so a
daemon loss on a fixed-size DVM is handled by the ordinary errmgr abort path
exactly as before.
Note
The app-triggered expansion path (--add-host / --add-hostfile)
already sets prte_dvm_ready = false in add_hosts() before posting
the asynchronous RAS modify request, which causes newly-arriving jobs to be
stashed in prte_cache rather than dispatched immediately. The launch
fence is still required for the scheduler-push path (e.g., Slurm firing
LAUNCH_DAEMONS directly) where prte_dvm_ready is never cleared, and
to ensure full correctness when both paths can interleave.
8.3.2.1. Step 1 — New state constant
In src/mca/plm/plm_types.h, add:
/* value 17 is currently unused in the running-state band */
#define PRTE_JOB_STATE_WAITING_FOR_DAEMONS 17
Add a corresponding string to src/util/error_strings.c.
This state is used purely as a marker so that debugging tools and verbose output show clearly why a job is parked; no callback is registered for it.
8.3.2.2. Step 2 — New global fence and held-job arrays
In src/runtime/prte_globals.c and src/runtime/prte_globals.h, add:
/* counts in-progress daemon launch campaigns */
int prte_dvm_launch_fence = 0;
/* jobs parked at the VM_READY → MAP boundary */
pmix_pointer_array_t *prte_held_jobs;
/* jobs parked at the LAUNCH_APPS boundary during a shrink */
pmix_pointer_array_t *prte_prelaunch_held_jobs;
Initialize both arrays in src/runtime/prte_init.c alongside the existing
prte_cache initialization:
prte_held_jobs = PMIX_NEW(pmix_pointer_array_t);
pmix_pointer_array_init(prte_held_jobs, 1, INT_MAX, 1);
prte_prelaunch_held_jobs = PMIX_NEW(pmix_pointer_array_t);
pmix_pointer_array_init(prte_prelaunch_held_jobs, 1, INT_MAX, 1);
Destruct both in src/runtime/prte_finalize.c.
The grow and shrink campaign lists (prte_grow_campaigns and
prte_shrink_campaigns) that drive the fence are declared, constructed, and
destructed alongside these globals; their types and lifecycles are specified in
the two child plans.
8.3.2.3. Step 3 — Park jobs at the VM_READY → MAP boundary
This is the first of the two hold points and is shared by both paths: it stops any in-progress campaign (grow or shrink) from letting a freshly-arriving job map onto a node whose daemon is not ready.
In vm_ready(), the code at line 360 is reached only by app jobs (the
daemon-job branch returns at line 357). This is immediately before
prte_filem.preposition_files() which leads to files_ready → MAP.
Add the hold check here:
/* position any required files */
if (0 < prte_dvm_launch_fence) {
/* daemon launch in progress — park this job */
caddy->jdata->state = PRTE_JOB_STATE_WAITING_FOR_DAEMONS;
PMIX_RETAIN(caddy->jdata);
pmix_pointer_array_add(prte_held_jobs, caddy->jdata);
PMIX_RELEASE(caddy);
return;
}
if (PRTE_SUCCESS !=
prte_filem.preposition_files(caddy->jdata, files_ready, caddy->jdata)) {
PRTE_ACTIVATE_JOB_STATE(caddy->jdata, PRTE_JOB_STATE_FILES_POSN_FAILED);
}
PMIX_RELEASE(caddy);
The second hold point — at LAUNCH_APPS, guarded by the shrink campaign list
rather than the fence counter — is shrink-specific and is described in
DVM Shrink-Campaign Fence Tracking.
8.3.2.4. Step 4 — Held-job release helpers
There are two distinct ways a held job leaves its parked state, and they are
not symmetric, so they are handled by two separate helpers rather than a
single bool success flag:
Global success. When the global fence reaches zero — every grow and shrink campaign has completed successfully — both classes of held job are admitted. This is
prte_plm_base_fence_release().Grow failure. When a grow campaign fails (see DVM Grow-Campaign Fence Tracking, Failure drain and rollback), the spec requires the whole pre-map held-job set to be aborted — the first-failure semantics of a non-elastic launch. But a grow failure must not touch the pre-launch held jobs: those are parked solely on account of an in-progress shrink (the
LAUNCH_APPShold is gated on the shrink list, not the fence), so they do not wait on the grow, and the spec’s conformance guarantee #4 states that a daemon failure may affect only the jobs waiting on the campaign it belongs to. This asymmetric abort isprte_plm_base_abort_premap_held().
Folding both paths into one fence_release(bool success) was the original
shape of this plan, but it could not honor the spec when a grow failed while a
shrink was still in progress. A single global success flag cannot express
“abort the grow’s waiters but leave the shrink’s waiters parked,” and gating the
failure abort on the fence reaching zero would let a later shrink-success
release admit a pre-map job whose grow dependency had already failed (the
last campaign to drain — the shrink — would call the release with
success == true). Splitting the two release paths closes that gap: the
grow-failure abort fires immediately on the pre-map array regardless of the
fence value, and the success release is reached only when no campaign has
failed.
Both helpers are declared in src/mca/plm/base/plm_private.h (the header the
errmgr, ras, and state callers already include for the existing
prte_plm_base_* launch helpers) and defined in
plm_base_launch_support.c:
/* SUCCESS release — invoked only when the global fence reaches zero, i.e.
* every grow and shrink campaign has completed successfully. Admits both
* classes of held job and defensively sweeps any residual campaigns. */
void prte_plm_base_fence_release(void)
{
int _hi;
prte_job_t *_held;
/* --- pre-map held jobs (parked at VM_READY) --- */
for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
_held = (prte_job_t *)
pmix_pointer_array_get_item(prte_held_jobs, _hi);
if (NULL == _held) {
continue;
}
pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_VM_READY);
PMIX_RELEASE(_held);
}
/* --- pre-launch held jobs (parked at LAUNCH_APPS) --- */
for (_hi = 0; _hi < prte_prelaunch_held_jobs->size; _hi++) {
_held = (prte_job_t *)
pmix_pointer_array_get_item(prte_prelaunch_held_jobs, _hi);
if (NULL == _held) {
continue;
}
pmix_pointer_array_set_item(prte_prelaunch_held_jobs, _hi, NULL);
if (prte_plm_base_job_needs_remap(_held)) {
prte_plm_base_reset_proc_map(_held);
PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_MAP);
} else {
PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_LAUNCH_APPS);
}
PMIX_RELEASE(_held);
}
/* Campaigns are removed individually as their last target drains, so
* both lists should be empty here. Sweep each defensively anyway — so
* a future change that can leave a residual campaign behind cannot wedge
* the fence — and sweep *both* kinds for symmetry, not just shrink. */
{
prte_shrink_campaign_t *_sc, *_sn;
PMIX_LIST_FOREACH_SAFE(_sc, _sn,
&prte_shrink_campaigns, prte_shrink_campaign_t) {
pmix_list_remove_item(&prte_shrink_campaigns, &_sc->super);
PMIX_RELEASE(_sc);
}
}
{
prte_grow_campaign_t *_gc, *_gn;
PMIX_LIST_FOREACH_SAFE(_gc, _gn,
&prte_grow_campaigns, prte_grow_campaign_t) {
pmix_list_remove_item(&prte_grow_campaigns, &_gc->super);
PMIX_RELEASE(_gc);
}
}
}
/* GROW-FAILURE abort — fails every job parked at the VM_READY -> MAP
* boundary to NEVER_LAUNCHED. Called from the grow failure drain only, and
* independent of the fence value, so a grow failure aborts its pre-map
* waiters even while a concurrent shrink keeps the fence nonzero. It
* deliberately leaves prte_prelaunch_held_jobs untouched: those jobs wait
* only on a shrink, never on the grow (conformance #4). */
void prte_plm_base_abort_premap_held(void)
{
int _hi;
prte_job_t *_held;
for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
_held = (prte_job_t *)
pmix_pointer_array_get_item(prte_held_jobs, _hi);
if (NULL == _held) {
continue;
}
pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_NEVER_LAUNCHED);
PMIX_RELEASE(_held);
}
}
The pre-launch branch of fence_release() calls two shrink-specific
helpers — prte_plm_base_job_needs_remap() (does any held proc sit on a
departing daemon?) and prte_plm_base_reset_proc_map() (un-claim the previous
mapping so the job can be remapped onto survivors) — specified in
DVM Shrink-Campaign Fence Tracking. Because shrink completion treats a clean exit
and a crash identically (a targeted daemon’s departure is always a success for
its campaign), neither held-job array is ever failed on the shrink path; the
only failure disposition of a held job is the grow-failure abort above.
prte_plm_base_fence_release() acts when the global fence reaches zero,
which requires all grow and shrink campaigns to have completed. The
per-campaign completion event (Step 5) is distinct: it fires for an
individual request’s campaign when that campaign drains, independent of whether
other campaigns are still in flight.
8.3.2.7. Design Invariants
Shared fence
The fence is a single
intaccessed only on the progress thread, so all increments, decrements, and the zero test are race-free without locking.A job is parked iff the fence is nonzero at the
VM_READY → MAPboundary (Step 3); theLAUNCH_APPShold (shrink plan) is gated on the shrink campaign list, not the fence, so a concurrent grow does not stall an already-mapped job on surviving nodes.prte_plm_base_fence_release()is the success-only release; it is called only when the fence reaches zero, which requires all grow and shrink campaigns to have completed successfully. The campaign lists are therefore empty (or nearly so —fence_releasedoes a defensive sweep of both the grow and shrink lists for the degenerate case where some future change leaves a partially-setup campaign behind).A grow failure is the only path that fails a held job. It calls
prte_plm_base_abort_premap_held(), which aborts the pre-map held jobs (prte_held_jobs→NEVER_LAUNCHED) immediately and independently of the fence, and never touchesprte_prelaunch_held_jobs: those jobs wait only on a shrink, so per conformance #4 a grow failure must leave them parked until the shrink completes.The per-campaign completion event (Step 5) is independent of the global fence: it fires when an individual request’s campaign drains, even if other campaigns keep the fence nonzero.
Grow fence — see the “Why this is correct” and “Design” sections of DVM Grow-Campaign Fence Tracking.
Shrink fence — see the “Design Invariants” section of DVM Shrink-Campaign Fence Tracking.