8.3.4. DVM Shrink-Campaign Fence Tracking

This document describes the implementation of the DVM shrink (node-removal) path: how a PMIX_ALLOC_RELEASE that removes daemons is tracked against the launch fence, the second hold point that protects in-flight jobs, how campaign completion is detected, and the completion event delivered to the requester. For the shared fence mechanism it builds on — the counter, the held-job arrays, the VM_READY MAP hold point, the fence-release helper, and the completion-event helper — see the parent plan Elastic DVM Implementation Plan. The externally observable contract is specified in Elastic DVM: Specification, which is authoritative for observable behavior.

The state machine is single-threaded on the progress thread, so no locking is required anywhere in this plan.

8.3.4.1. Background

The PRTE_DAEMON_SHRINK_CMD xcast is fire-and-forget: daemons exit asynchronously and the HNP has no built-in notification when all targeted daemons have terminated. Two race windows must be closed:

Race 1 — new job maps onto a shrinking node. A job that checks the VM_READY MAP fence while a shrink is in progress may pass the fence (if it was raised after the check), get mapped to a node whose daemon is dying, and then send a launch message to a daemon that has already exited.

Race 2 — in-flight job at LAUNCH_APPS. A job that completed MAP before the shrink started and then enters prte_plm_base_launch_apps() may pack and send launch data to a daemon that dies between MAP and the send. The existing VM_READY fence does not protect this window because the job already passed the fence before the shrink was initiated.

Race 1 is covered by the shared fence (the fence is incremented for shrink — see Step 1). Race 2 requires a second hold point guarded by checking the shrink campaign list (nonempty only during shrink) so that a concurrent grow does not unnecessarily hold jobs that have already been mapped to surviving nodes.

Multiple concurrent shrink campaigns are supported: each campaign tracks its own count of still-living targets and is removed from the list when all of them have departed the DVM.

8.3.4.2. Design Decision — Complete on Death, Not on Acknowledgement

An earlier revision of this plan had each targeted daemon send an explicit PRTE_PLM_SHRINK_ACK_CMD to the HNP just before it exited, and the HNP decremented the campaign on receipt of the ACK. The errmgr comm-failure path existed only as a fallback for a daemon that crashed before it could send its ACK. That design was abandoned for the following reasons.

The ACK is the wrong signal. An ACK announces a daemon’s intent to leave; it is sent while the daemon is still alive and still a participant in the DVM. But the state the fence protects against — a job being mapped onto, or having launch data sent to, a departing daemon — is only safe once the daemon’s routes, its num_daemons count, and its node state have actually been torn down. That teardown happens on the comm-failure path (errmgr_dvm.c), not when the ACK is sent. Releasing held jobs on ACK receipt could therefore unpark them into a DVM that still believed the departing daemon was present.

The reason for departure carries no information. The HNP only needs to know that a target is gone, not why. A clean shrink exit and a crash have identical consequences for the campaign: the node is being removed either way, and the application processes beneath the daemon are killed (or die) when it terminates. Distinguishing the two cases buys nothing, so the “clean ACK vs. crash fallback” split was pure complexity.

Two decrement paths caused double-counting. With both the ACK handler and the errmgr fallback live, each target could be counted twice — once when its ACK arrived (daemon still alive) and again when its subsequent death was detected — because nothing marked a target as already counted and the campaign was only removed once pending hit zero. Worked through for a two-target campaign:

camp: ntargets=2, pending=2
  daemon A acks   -> pending=1, fence-=1     (A still alive)
  daemon A dies   -> errmgr matches A (still in targets, camp still listed)
                  -> pending=0 -> camp removed+released, fence-=1
  daemon B        -> never counted; campaign already "complete"

The campaign completed and the fence released while daemon B was still present, re-opening exactly the race the fence was meant to close.

Resolution. The ACK was removed entirely (the daemon-side send, the PRTE_PLM_SHRINK_ACK_CMD constant, and the HNP-side handler). Campaign completion is driven solely by actual daemon departure on the comm-failure path, which is both the authoritative event and the point at which the relevant cleanup has occurred. To make the single decrement idempotent against a daemon that emits more than one failure event, each matched target slot is stamped PMIX_RANK_INVALID once counted (Step 5).

8.3.4.3. Step 1 — Shrink campaign type, list, and fence increment

1a — Campaign type

Add the following type to src/runtime/prte_globals.h and define the class instance in src/runtime/prte_globals.c:

/* one entry per in-progress shrink campaign */
typedef struct {
    pmix_list_item_t super;
    pmix_rank_t     *targets;        /* daemon ranks being terminated */
    int              ntargets;       /* initial count */
    int              pending;        /* targets not yet known to have departed */
    /* requester recorded for the spec's phase-two completion event */
    pmix_proc_t      requester;      /* who issued the PMIX_ALLOC_RELEASE */
    char            *alloc_id;       /* PMIX_ALLOC_ID of the allocation */
    char            *req_id;         /* PMIX_ALLOC_REQ_ID, or NULL */
    bool             have_requester; /* false for a scheduler-driven release */
} prte_shrink_campaign_t;
PMIX_CLASS_DECLARATION(prte_shrink_campaign_t);

In src/runtime/prte_globals.c:

static void campaign_destruct(prte_shrink_campaign_t *p)
{
    free(p->targets);
    free(p->alloc_id);
    free(p->req_id);
}
PMIX_CLASS_INSTANCE(prte_shrink_campaign_t, pmix_list_item_t,
                    NULL, campaign_destruct);

1b — Global list

In src/runtime/prte_globals.h declare, and in src/runtime/prte_globals.c define:

pmix_list_t prte_shrink_campaigns;

Initialize in src/runtime/prte_init.c:

PMIX_CONSTRUCT(&prte_shrink_campaigns, pmix_list_t);

Destruct in src/runtime/prte_finalize.c:

PMIX_LIST_DESTRUCT(&prte_shrink_campaigns);

1c — Populate campaign and increment fence

In src/mca/ras/base/ras_base_allocate.c, the PMIX_ALLOC_RELEASE branch of prte_ras_base_complete_request() builds the daemon rank array (ranks, count m) and then calls free(ranks) before the xcast that carries PRTE_DAEMON_SHRINK_CMD to the daemons. Insert the campaign setup before free(ranks), recording the requester directly from the request object (req) so the completion event can be directed at it. Guard the whole setup on 0 < m: a release that removes no daemons creates no campaign, exactly as the grow path creates none when map->num_new_daemons == 0. The file must #include "src/mca/plm/base/plm_private.h" to see prte_plm_base_dvm_mod_notify().

/* record the campaign — must be before free(ranks).  Skip entirely when the
 * release removes no daemons (m == 0): an empty campaign would never drain
 * (no target ever departs on the comm-failure path), so it would leave
 * prte_shrink_campaigns non-empty forever — wedging every later job at the
 * LAUNCH_APPS hold — and would emit no completion event.  Mirrors the grow
 * path's `map->num_new_daemons > 0` guard and the spec's "no event when
 * nothing changes" clause. */
if (0 < m) {
    prte_shrink_campaign_t *_camp = PMIX_NEW(prte_shrink_campaign_t);
    _camp->targets = (pmix_rank_t *) malloc(m * sizeof(pmix_rank_t));
    memcpy(_camp->targets, ranks, m * sizeof(pmix_rank_t));
    _camp->ntargets = m;
    _camp->pending  = m;
    /* this path always has a requesting process (req->tproc); a
     * scheduler-driven release that has no requester does not pass through
     * here.  Capture the requester and the allocation ids from the request. */
    PMIX_XFER_PROCID(&_camp->requester, &req->tproc);
    for (n = 0; n < req->ninfo; n++) {
        if (PMIx_Check_key(req->info[n].key, PMIX_ALLOC_ID)) {
            _camp->alloc_id = strdup(req->info[n].value.data.string);
        } else if (PMIx_Check_key(req->info[n].key, PMIX_ALLOC_REQ_ID)) {
            _camp->req_id = strdup(req->info[n].value.data.string);
        }
    }
    _camp->have_requester = true;
    pmix_list_append(&prte_shrink_campaigns, &_camp->super);
    prte_dvm_launch_fence += m;
}
free(ranks);

/* existing xcast */
if (PRTE_SUCCESS != (rc = prte_grpcomm.xcast(PRTE_RML_TAG_DAEMON, &msg))) {
    PRTE_ERROR_LOG(rc);
    /* clean up the campaign we just added (only if one was created), and
     * tell the requester the DVM modification failed (spec phase-two
     * failure event).  rc is a PRTE code, so convert it to the
     * pmix_status_t the event carries. */
    if (0 < m) {
        prte_shrink_campaign_t *_camp =
            (prte_shrink_campaign_t *) pmix_list_remove_last(&prte_shrink_campaigns);
        prte_dvm_launch_fence -= _camp->pending;
        if (_camp->have_requester) {
            prte_plm_base_dvm_mod_notify(&_camp->requester, _camp->alloc_id,
                                         _camp->req_id, false,
                                         prte_pmix_convert_rc(rc));
        }
        PMIX_RELEASE(_camp);
    }
}

Because the campaign is appended before the xcast, any VM_READY event that fires on the progress thread after this point will see a nonzero fence and park the job.

8.3.4.4. Step 2 — Daemon exit (no acknowledgement)

A daemon that decides to exit in response to PRTE_DAEMON_SHRINK_CMD does not send any acknowledgement to the HNP. After firing its PMIX_EVENT_JOB_END notification it simply activates PRTE_JOB_STATE_DAEMONS_TERMINATED and exits.

The HNP tracks campaign completion through the daemon’s actual departure, not through a message announcing its intent to leave. An acknowledgement sent before the daemon dies would be premature: the HNP cares only that the daemon is gone — the reason is irrelevant, and the application processes under it are killed when it terminates regardless. More importantly, the acknowledgement would arrive before the daemon’s routes, num_daemons count, and node state have been torn down, so acting on it could release held jobs into a DVM that still believes the departing daemon is present. The comm-failure event (Step 5) is the only signal that coincides with that cleanup, so it is the sole completion trigger.

8.3.4.5. Step 3 — Second hold point at LAUNCH_APPS

In src/mca/plm/base/plm_base_launch_support.c, prte_plm_base_launch_apps() (line 817), add a check after the job-state guard but before packing any data:

/* if a shrink is in progress, hold this job until all targeted
 * daemons have departed the DVM, to prevent sending launch data to
 * a dying daemon */
if (!pmix_list_is_empty(&prte_shrink_campaigns)) {
    jdata->state = PRTE_JOB_STATE_WAITING_FOR_DAEMONS;
    PMIX_RETAIN(jdata);
    pmix_pointer_array_add(prte_prelaunch_held_jobs, jdata);
    PMIX_RELEASE(caddy);
    return;
}

Using !pmix_list_is_empty(...) rather than a counter means the check automatically handles concurrent campaigns: the list is nonempty as long as any shrink is in progress. Keying on the shrink list specifically (not the shared fence counter) ensures a concurrent grow does not stall a job that has already been mapped onto surviving nodes.

8.3.4.6. Step 4 — Remap helpers

The pre-launch branch of the shared prte_plm_base_fence_release() (parent plan, Step 4) calls two shrink-specific helpers, both defined in plm_base_launch_support.c and declared in src/mca/plm/base/plm_private.h.

``prte_plm_base_job_needs_remap(jdata)`` iterates over jdata->procs and returns true if any proc’s assigned node has a daemon rank appearing in any active campaign:

bool prte_plm_base_job_needs_remap(prte_job_t *jdata)
{
    prte_shrink_campaign_t *camp;
    prte_proc_t *proc;
    int p, t;

    PMIX_LIST_FOREACH(camp, &prte_shrink_campaigns, prte_shrink_campaign_t) {
        for (p = 0; p < jdata->procs->size; p++) {
            proc = (prte_proc_t *)
                pmix_pointer_array_get_item(jdata->procs, p);
            if (NULL == proc || NULL == proc->node ||
                NULL == proc->node->daemon) continue;
            for (t = 0; t < camp->ntargets; t++) {
                if (camp->targets[t] == proc->node->daemon->name.rank) {
                    return true;
                }
            }
        }
    }
    return false;
}

``prte_plm_base_reset_proc_map(jdata)`` un-claims all slot assignments made during the previous MAP pass so that the job can be remapped cleanly. Mirror the mapper’s prte_rmaps_base_claim_slot() accounting, which does node->num_procs++ and ++node->slots_inuse for each non-tool proc:

void prte_plm_base_reset_proc_map(prte_job_t *jdata)
{
    int p, np;
    prte_proc_t *proc;
    prte_node_t *node;
    prte_app_context_t *app;

    for (p = 0; p < jdata->procs->size; p++) {
        proc = (prte_proc_t *) pmix_pointer_array_get_item(jdata->procs, p);
        if (NULL == proc) continue;
        node = proc->node;
        if (NULL != node) {
            /* remove from node's proc list */
            for (np = 0; np < node->procs->size; np++) {
                if (pmix_pointer_array_get_item(node->procs, np) == proc) {
                    pmix_pointer_array_set_item(node->procs, np, NULL);
                    node->num_procs--;
                    /* mirror claim_slot: tool procs do not count
                     * against slots_inuse */
                    app = (prte_app_context_t *)
                        pmix_pointer_array_get_item(jdata->apps,
                                                    proc->app_idx);
                    if (NULL == app ||
                        !PRTE_FLAG_TEST(app, PRTE_APP_FLAG_TOOL)) {
                        node->slots_inuse--;
                    }
                    break;
                }
            }
        }
        pmix_pointer_array_set_item(jdata->procs, p, NULL);
        PMIX_RELEASE(proc);
    }
    jdata->num_procs = 0;
    jdata->num_launched = 0;
}

After remapping, the job re-enters prte_rmaps_base_map_job() which re-creates proc objects on the surviving nodes using the original app->num_procs counts.

8.3.4.7. Step 5 — Detect target departure in the errmgr and notify completion

Campaign completion is driven entirely by the daemon-loss path: when a targeted daemon leaves the DVM, the HNP’s comm-failure handler matches its rank against the active campaigns and drives the fence down. This is the same event whether the daemon exited cleanly in response to the shrink command or crashed, so a single code path covers both — there is no separate “acknowledgement” message and no fallback to reconcile.

In src/mca/errmgr/dvm/errmgr_dvm.c, inside the PMIX_CHECK_NSPACE daemon-proc block of proc_errors() (line 252), within the PRTE_PROC_STATE_COMM_FAILED / heartbeat-failed handler, add after the “mark daemon as gone” logic:

/* check if this daemon was a pending shrink target */
{
    prte_shrink_campaign_t *_camp, *_next;
    int _t;
    PMIX_LIST_FOREACH_SAFE(_camp, _next,
                           &prte_shrink_campaigns, prte_shrink_campaign_t) {
        for (_t = 0; _t < _camp->ntargets; _t++) {
            if (_camp->targets[_t] != proc->rank) continue;
            /* stamp this slot so a repeated comm event for the same
             * daemon cannot decrement the campaign twice */
            _camp->targets[_t] = PMIX_RANK_INVALID;
            _camp->pending--;
            prte_dvm_launch_fence--;
            if (0 == _camp->pending) {
                /* this request's shrink is complete — first let the
                 * active RAS modules release the freed resources back
                 * to their resource manager(s), then notify the
                 * requester that the DVM now reflects the new size */
                prte_ras_base_shrink_complete(_camp);
                if (_camp->have_requester) {
                    /* success == true => PMIX_DVM_IS_READY */
                    prte_plm_base_dvm_mod_notify(&_camp->requester,
                                                 _camp->alloc_id,
                                                 _camp->req_id,
                                                 true, PMIX_SUCCESS);
                }
                pmix_list_remove_item(&prte_shrink_campaigns,
                                     &_camp->super);
                PMIX_RELEASE(_camp);
            }
            if (0 == prte_dvm_launch_fence) {
                prte_plm_base_fence_release();
            }
            goto errmgr_shrink_done;
        }
    }
    errmgr_shrink_done: ;
}

Because the progress thread is single-threaded, the counter decrements and the list manipulation are atomic with respect to all other state machine callbacks. Stamping the matched slot PMIX_RANK_INVALID makes the decrement idempotent: should the daemon generate more than one failure event, only the first is counted. A daemon that crashes during a shrink is handled identically to one that exits cleanly — the node was being removed anyway, and jobs mapped to it are detected by prte_plm_base_job_needs_remap() and re-routed to surviving nodes.

Before the completion event is emitted, prte_ras_base_shrink_complete() cycles across every active RAS module (prte_ras_base.selected_modules) and invokes the optional shrink_complete entry point on each, passing the completed prte_shrink_campaign_t. This is the component’s opportunity to release the freed resources back to its resource manager; what it does with that opportunity is up to the component. Unlike modify, the cycle is not keyed to a single component: a single PMIX_ALLOC_RELEASE may remove nodes drawn from more than one allocation (see the “Resource release at shrink completion” section of Elastic DVM: Specification), so every module is offered the campaign and each handles only the share that belongs to it. A module with no shrink_complete pointer, or with no stake in the operation, is a no-op. The runtime guarantees only the ordering: the release cycle runs ahead of prte_plm_base_dvm_mod_notify(), so by the time the requester sees PMIX_DVM_IS_READY every component has been given its chance to act — not that any particular resource was in fact returned, which is the component’s decision.

The PMIX_DVM_IS_READY notification is per campaign: it fires when this request’s last target departs, regardless of whether other (grow or shrink) campaigns keep the shared fence nonzero. The fence-release of held jobs, by contrast, waits for the global fence to reach zero. The failure counterpart (PMIX_ERR_DVM_MOD) is emitted only on the xcast-failure cleanup in Step 1 — once the shrink command is on the wire, every targeted daemon’s departure is a success for the campaign, since clean exit and crash are indistinguishable and both remove the node as requested.

8.3.4.8. Summary of Files Changed (Shrink Fence)

File

Change

src/runtime/prte_globals.h

Declare prte_shrink_campaign_t (type + PMIX_CLASS_DECLARATION, including the requester fields) and prte_shrink_campaigns (pmix_list_t).

src/runtime/prte_globals.c

Define PMIX_CLASS_INSTANCE for prte_shrink_campaign_t (destructor frees targets, alloc_id, req_id). Define prte_shrink_campaigns.

src/runtime/prte_init.c

PMIX_CONSTRUCT(&prte_shrink_campaigns, pmix_list_t) alongside prte_held_jobs initialization.

src/runtime/prte_finalize.c

PMIX_LIST_DESTRUCT(&prte_shrink_campaigns).

src/mca/ras/base/ras_base_allocate.c

Add #include "src/mca/plm/base/plm_private.h" for prte_plm_base_dvm_mod_notify(). In the PMIX_ALLOC_RELEASE branch of prte_ras_base_complete_request(), guarded on 0 < m: create a prte_shrink_campaign_t, copy the rank array into it, record the requester from req->tproc and PMIX_ALLOC_ID / PMIX_ALLOC_REQ_ID from req->info, append to prte_shrink_campaigns, and increment prte_dvm_launch_fence by m — all before free(ranks). Add xcast-failure cleanup that removes the campaign, decrements the fence, and emits PMIX_ERR_DVM_MOD (carrying prte_pmix_convert_rc(rc)) to the requester.

src/prted/prted_comm.c

In PRTE_DAEMON_SHRINK_CMD handler: after the JOB_END notification wait, activate PRTE_JOB_STATE_DAEMONS_TERMINATED and exit. No acknowledgement is sent; the HNP detects departure via the comm-failure path.

src/mca/plm/base/plm_base_launch_support.c

Add prte_plm_base_job_needs_remap() and prte_plm_base_reset_proc_map(). Add hold check in prte_plm_base_launch_apps() on !pmix_list_is_empty(&prte_shrink_campaigns).

src/mca/plm/base/plm_private.h

Declare the two remap helpers.

src/mca/ras/ras.h

Add the shrink_complete module entry point: a prte_ras_base_module_shrink_complete_fn_t taking the completed prte_shrink_campaign_t, and a field for it in prte_ras_base_module_t (after modify). Components that hand resources back to a scheduler implement it; others leave it NULL.

src/mca/ras/base/base.h

Declare prte_ras_base_shrink_complete(prte_shrink_campaign_t *).

src/mca/ras/base/ras_base_allocate.c

Define prte_ras_base_shrink_complete(): cycle across prte_ras_base.selected_modules and invoke each module’s shrink_complete (when non-NULL), passing the campaign. Not keyed to one component, since a release may span multiple allocations.

src/mca/errmgr/dvm/errmgr_dvm.c

Add #include "src/mca/ras/base/base.h". In proc_errors(), daemon-comm-failure block: search prte_shrink_campaigns for the dead daemon’s rank; if found, stamp the matched target slot PMIX_RANK_INVALID, decrement campaign pending and fence; when pending hits zero call prte_ras_base_shrink_complete() to release the resources RM-side, then emit PMIX_DVM_IS_READY to the requester and remove the campaign; call prte_plm_base_fence_release() when the fence hits zero. This is the sole shrink-completion trigger.

The shared infrastructure this path relies on — the fence counter, held-job arrays, VM_READY MAP hold point, prte_plm_base_fence_release(), and the prte_plm_base_dvm_mod_notify() completion-event helper — is listed in the “Shared Fence Infrastructure” table in Elastic DVM Implementation Plan.

8.3.4.9. Design Invariants

  • prte_shrink_campaigns is a pmix_list_t; each entry covers exactly one PMIX_ALLOC_RELEASE request. Multiple concurrent shrink campaigns are supported.

  • The fence is incremented by exactly m at campaign creation and decremented by 1 for each targeted daemon whose departure is detected on the errmgr comm-failure path (clean exit and crash are indistinguishable and handled identically). Each target slot is stamped PMIX_RANK_INVALID once counted, so a repeated comm event cannot decrement twice. A campaign is removed from the list when its pending count reaches zero.

  • The LAUNCH_APPS hold uses !pmix_list_is_empty(&prte_shrink_campaigns), not prte_dvm_launch_fence > 0, so a concurrent grow does not stall already-mapped jobs on surviving nodes.

  • prte_shrink_campaigns is stable throughout each campaign: the targets array for a given campaign is valid from creation through removal, so prte_plm_base_job_needs_remap() can safely iterate it during release.

  • Jobs in prte_prelaunch_held_jobs hold a PMIX_RETAIN reference; prte_plm_base_fence_release() releases it after re-activating the job. These jobs wait only on a shrink and are never aborted by a concurrent grow failure (the grow-failure abort touches only prte_held_jobs); since shrink completion is success-only, they are always re-activated, not failed.

  • The completion event is per campaign and fires from the campaign-removal point (pending == 0), so each accepted release yields exactly one PMIX_DVM_IS_READY (success) or, on an xcast failure at creation, exactly one PMIX_ERR_DVM_MOD — never both, and never for a scheduler-driven release with no requester.

  • The RM-side release cycle runs strictly before the completion event. At pending == 0 the campaign-removal point calls prte_ras_base_shrink_complete(), which offers the campaign to every active RAS module, and only then emits PMIX_DVM_IS_READY. Because the departing nodes may span multiple allocations, all modules are cycled — each handling only its own share. The invariant is the ordering (every component is offered the campaign before the event fires), not that any resource was actually returned: that is the component’s decision, outside the runtime’s control.

8.3.4.10. Follow-up — collective shrink completion

A possible optimization — repairing the routing tree once per shrink campaign (a collective completion scheme) rather than once per departing daemon — has been deferred out of the launch-fence work and tracked separately as openpmix/prrte#2492.