8.3.1. Elastic DVM: Specification

8.3.1.1. Purpose

This document specifies the externally observable behavior of the DVM while it changes size — what an application, tool, or scheduler may rely on when the DVM grows (a new-daemon launch campaign) or shrinks (a node-removal campaign). It covers two distinct audiences and contracts:

  • Job admission — what happens to an application job that is submitted while a grow or shrink is in progress (the bulk of this document).

  • Size-change completion — how the process that requested the size change learns, asynchronously, whether the DVM operation eventually succeeded or failed (see Asynchronous size-change completion).

It defines what the runtime guarantees, not how it achieves it. The companion design plans — Elastic DVM Implementation Plan (the shared fence mechanism and completion-event helper), DVM Grow-Campaign Fence Tracking (the grow path’s per-campaign accounting), and DVM Shrink-Campaign Fence Tracking (the shrink path’s campaign tracking) — describe the internal data structures, code paths, and implementation order. Where this specification and those plans disagree about observable behavior, this specification is authoritative and the plan must be corrected.

The whole of this behavior is gated on the DVM being in elastic mode, selected by the pre-existing prte_elastic_mode MCA parameter (off by default). When it is not set the DVM is fixed-size: none of the job-admission deferral, parking, or completion-event machinery is active, and the runtime behaves exactly as it did before this feature — a daemon loss, for instance, is handled by the ordinary error path rather than absorbed as a campaign event. Everything below describes the behavior when elastic mode is enabled.

Within elastic mode, the job-admission guarantees are stated purely in terms of job lifecycle outcomes and introduce no new command-line options, environment variables, or PMIx attributes: the grow and shrink triggers that already exist (--add-host / --add-hostfile, a scheduler-driven daemon launch, and a PMIX_ALLOC_RELEASE that removes nodes) simply acquire correct concurrency semantics. The completion contract does introduce two new PMIx event (status) codes — PMIX_DVM_IS_READY and PMIX_ERR_DVM_MOD — used to notify the requester when the asynchronous DVM operation finishes; these are the only new caller-visible interface this feature adds, and both are optional (see Backward compatibility and transparency).

8.3.1.2. Scope

8.3.1.2.1. In scope

  • The admission guarantee for an application job that reaches the map-eligible boundary while a grow or shrink is in progress.

  • The placement guarantee for a job launched across a size change — onto which set of nodes (pre- or post-change) it is permitted to map.

  • The disposition of a job that was already mapped onto a node that is about to depart in a shrink.

  • The outcome of a daemon failure that occurs during a grow or a shrink, the distinction between a failure that belongs to the in-progress campaign and an unrelated one, and the rollback of a failed grow to the DVM’s pre-grow membership.

  • The behavior when grow and shrink campaigns, or multiple campaigns of the same kind, overlap in time.

  • The two-phase model by which a dynamic allocation request that drives a size change is answered: a synchronous response when the request is accepted, and a later asynchronous event when the DVM operation completes or fails.

  • The two new PMIx event codes — PMIX_DVM_IS_READY and PMIX_ERR_DVM_MOD — that carry the asynchronous completion result to the requester, and the allocation-identifying payload each delivers.

  • The PRTE_JOB_STATE_WAITING_FOR_DAEMONS job state reported for a parked job in verbose and debugging output.

8.3.1.2.2. Out of scope / non-goals

  • The policy that decides when the DVM grows or shrinks. That decision is driven entirely by the existing triggers (a tool or application expansion request, a scheduler action, or an allocation release); this specification governs only how concurrent job submissions behave around such a change, never whether the change happens.

  • Node selection — which physical nodes are added or removed — is the responsibility of the resource manager and the existing allocation and mapping machinery, and is unchanged.

  • The wire encoding of internal messages, the names and types of internal counters and lists, and the precise timing of internal state transitions are implementation details and are not specified here.

  • This document does not redefine the meaning of any DVM size-change trigger; it specifies only the admission and placement guarantees that now hold while one is in flight.

8.3.1.3. Definitions

Grow campaign

A single in-progress launch of one or more new daemons that extends the DVM onto additional nodes. A grow campaign is complete when every new daemon has reported in and the head node has distributed the wireup (nidmap) information to the DVM.

Shrink campaign

A single in-progress removal of one or more nodes from the DVM, initiated by an allocation release. A shrink campaign targets a fixed set of daemon ranks and is complete when every targeted daemon has actually departed the DVM.

Size change

A grow campaign or a shrink campaign. More than one may be in progress at the same time.

Map-eligible boundary

The point in a job’s lifecycle at which it is ready to be assigned to nodes (the VM_READY MAP transition). A job that has not yet crossed this boundary has no node assignments.

Launch boundary

The point at which a job’s mapping is complete and the runtime is ready to send launch data to the daemons that will host its processes (the LAUNCH_APPS transition). A job at this boundary already has node assignments.

Parked job

A job that has been held at the map-eligible boundary or the launch boundary because a size change is in progress. A parked job is reported in the PRTE_JOB_STATE_WAITING_FOR_DAEMONS state. Parking is invisible to the submitting tool beyond a delay in launch; the job is neither failed nor restarted.

Surviving node

A node that remains in the DVM after a shrink campaign completes.

Departing node

A node whose daemon is a target of an in-progress shrink campaign.

Size-change requester

The process whose request initiated a grow or shrink — for a dynamic allocation this is the process that issued the PMIX_ALLOC_NEW / PMIX_ALLOC_EXTEND (grow) or PMIX_ALLOC_RELEASE (shrink) request. A size change initiated without a PMIx requester (for example a scheduler pushing daemons directly) has no requester.

Request acceptance

The point at which the runtime has finished processing a size-change request — validated it and initiated the corresponding DVM operation — and returns its synchronous response. Acceptance is phase one; it does not assert that the operation has finished.

Operation completion

The point at which the initiated DVM operation actually finishes — the grow’s new daemons are launched and wired, or the shrink’s targeted daemons have departed and the routing tree is repaired — or definitively fails. Completion is phase two and is reported asynchronously by event (see Asynchronous size-change completion).

8.3.1.4. Admission contract

The central guarantee is one of non-destructive deferral:

A job submitted while a size change is in progress is never failed merely because the DVM is changing size. It is held until the change completes and then admitted onto the post-change set of nodes — except that a job whose admission depends on a grow that fails is aborted rather than launched onto an incomplete DVM.

Two distinct placement hazards are closed by this contract, and a conforming implementation must close both:

  1. A job must never be mapped onto a node whose daemon is not ready. During a grow this means a node whose daemon has started but has not yet received its wireup information; during a shrink it means a node whose daemon is a departure target.

  2. A job must never have launch data sent to a daemon that is departing. A job that completed mapping before a shrink began, and was placed on a node now targeted for removal, must not transmit its launch message to that daemon.

8.3.1.4.1. Behavior during a grow

While a grow campaign is in progress:

  • A job that reaches the map-eligible boundary is parked. It is admitted only once every in-progress grow campaign has completed — that is, only after the new daemons are not merely running but fully wired into the DVM. This guarantees hazard 1: the job cannot be mapped onto a node whose daemon is up but not yet wired.

  • A job that had already crossed the map-eligible boundary before the grow began is unaffected: it continues to launch on the nodes it was assigned, and a grow never stalls it.

  • When the grow completes successfully, every job parked at the map-eligible boundary is admitted and proceeds to mapping, now able to consider the newly added nodes.

  • If a daemon belonging to the grow campaign fails before the campaign completes, the grow is treated as failed as a whole and is rolled back: every daemon that the same campaign had already started is terminated, and the nodes the campaign was adding leave the DVM, so the DVM is restored to exactly the membership it had before that campaign began. A failed grow never leaves the DVM half-extended with a partial, un-wired set of new daemons. Every job parked on account of the grow is then aborted (it never launches) rather than being admitted onto an incomplete DVM. This matches the first-failure semantics of a non-elastic launch. The rollback is scoped to the failed campaign: an unrelated grow campaign running concurrently keeps its own daemons and completes normally, and pre-existing daemons and nodes are untouched.

  • A daemon failure that does not belong to any in-progress grow campaign (for example, a pre-existing daemon dying for an unrelated reason) does not release parked jobs early and does not abort them; the grow proceeds unaffected.

8.3.1.4.2. Behavior during a shrink

While a shrink campaign is in progress:

  • A job that reaches the map-eligible boundary is parked, exactly as for a grow, and is admitted only once every in-progress size change has completed. This closes hazard 1 for the shrink case: a newly arriving job cannot be mapped onto a node that is in the act of leaving.

  • A job that had already completed mapping and reaches the launch boundary is parked if and only if a shrink is in progress, until every targeted daemon has departed. This closes hazard 2. A grow in progress does not park a job at the launch boundary, because a grow removes no node and so cannot invalidate an existing mapping.

  • When the shrink completes, each parked job is admitted according to whether the shrink invalidated its placement:

    • A job parked at the launch boundary whose processes were not assigned to any departing node proceeds directly to launch on its existing mapping.

    • A job parked at the launch boundary that had one or more processes assigned to a departing node is remapped onto the surviving nodes and then launched. Its placement after the change reflects the smaller DVM; the job is not failed.

    • A job parked at the map-eligible boundary is admitted to mapping and naturally considers only the surviving nodes.

  • Completion of a shrink is driven by the actual departure of each targeted daemon, not by any advance announcement of intent to leave. A daemon that exits cleanly in response to the shrink and a daemon that crashes during the shrink are indistinguishable to the contract: in both cases the node is leaving, and any job mapped onto it is remapped onto survivors as described above. No job is admitted until the departing daemons are genuinely gone and the DVM’s view of its membership has been updated to match.

8.3.1.5. Concurrency

  • Any number of grow campaigns, any number of shrink campaigns, or a mixture, may be in progress simultaneously. Parked jobs are admitted only when all in-progress campaigns have completed; no campaign can release another campaign’s held jobs, and no campaign’s completion is consumed by an unrelated daemon event.

  • A grow campaign in progress concurrently with a shrink does not stall a job at the launch boundary that has already been mapped onto surviving nodes; only an in-progress shrink holds jobs there.

  • Each campaign is accounted for independently, so an overlapping or partially-overlapping pair of campaigns cannot leave a job parked indefinitely once the last campaign it is waiting on completes.

8.3.1.6. Asynchronous size-change completion

A dynamic allocation request that grows or shrinks the DVM is answered in two phases, and the runtime separates the point at which the allocation is complete from the point at which the runtime is ready.

8.3.1.6.1. Two-phase model

  1. Acceptance (synchronous). When the runtime has finished processing the request — validated it, decided the resulting session/reservation, and initiated the corresponding grow or shrink — it returns the allocation response (status plus PMIX_ALLOC_ID, as specified in the companion allocation contract node-reservation-spec.rst). This response confirms only that the request was accepted and the DVM operation has begun. It does not assert that a grow’s new daemons are up and wired, or that a shrink’s targeted daemons have departed.

  2. Completion (asynchronous, by event). When the DVM operation later finishes — or fails — the runtime delivers a directed PMIx event to the size-change requester. This is the signal that the runtime is ready (or that the change will not happen), as distinct from the acceptance in phase one.

The two phases are decoupled because the DVM operation is inherently asynchronous and unbounded in time (daemon launch and wireup, or daemon termination and tree repair). Blocking the allocation response until the operation finished would stall the requester and entangle the request’s validity with unrelated launch/teardown failures. Decoupling lets the requester learn promptly that its request was accepted and then act only when the runtime actually reflects the new size.

8.3.1.6.2. Resource release at shrink completion

The nodes a single shrink removes need not all come from the same underlying allocation. A PMIX_ALLOC_RELEASE may name resources that the runtime originally obtained from more than one source — for example nodes acquired through different resource managers, or a mixture of scheduler-provided and statically-configured nodes — so the set of departing nodes can span several allocations, each managed by a different resource component.

Accordingly, when a shrink completes — every targeted daemon has departed — the runtime offers the completed operation to each active resource component in turn before it emits the completion event, giving each the opportunity to release the share of the departing resources that belongs to it back to its resource manager. What a component does with that opportunity is up to the component: it may return the nodes to a scheduler, defer, or do nothing if it has no stake in the operation. The runtime guarantees only the ordering — that the release cycle is offered to every component, and runs to completion, before the completion event is delivered. It does not guarantee that any particular resource was in fact handed back, since that is the component’s decision, not the runtime’s.

8.3.1.6.3. Success event — PMIX_DVM_IS_READY

When the DVM operation completes successfully — for a grow, the new daemons are launched and wired into the DVM; for a shrink, every targeted daemon has departed, the routing tree is repaired, and each resource component has been given the opportunity to release the freed resources back to its resource manager (see Resource release at shrink completion) — the runtime delivers a PMIX_DVM_IS_READY event to the requester. After this event the DVM reflects the requested size: a grow’s new nodes are available to spawn onto, a shrink’s removed nodes are gone. The event payload carries:

  • PMIX_ALLOC_ID (char*) — the allocation whose operation completed; always present.

  • PMIX_ALLOC_REQ_ID (char*) — the requester’s own request id, included whenever one was supplied on the original request, so the recipient can match the event by either identifier.

8.3.1.6.4. Failure event — PMIX_ERR_DVM_MOD

If the accepted DVM modification fails — a grow cannot launch or wire its new daemons (and is rolled back per Failure semantics), a shrink cannot be carried out, or any other condition leaves the requested change unrealized — the runtime delivers a PMIX_ERR_DVM_MOD event to the requester. The event states that no (further) DVM modification will be made for this request and that the DVM has been returned to a stable state (for a failed grow, its pre-grow membership). The payload carries:

  • PMIX_ALLOC_ID (char*) — always present.

  • PMIX_ALLOC_REQ_ID (char*) — when one was supplied.

  • The underlying cause — the specific pmix_status_t that prevented the modification (for example a daemon-launch failure versus a resource error), conveyed in the event’s info array so the requester can distinguish what went wrong rather than only that something did.

Both events are directed to the requesting process only — they are not broadcast — mirroring the delivery of PMIX_ALLOC_TIMEOUT_WARNING specified in node-reservation-spec.rst.

8.3.1.6.5. Delivery guarantees

  • Exactly one terminal event per operation. Each accepted request that initiates a DVM grow or shrink yields exactly one of PMIX_DVM_IS_READY or PMIX_ERR_DVM_MOD to its requester.

  • No event for a phase-one rejection. A request rejected during processing (the error cases in the companion node-reservation-spec.rst, e.g. a malformed or unauthorized request) fails synchronously in the allocation response and produces no completion event; the phase-two events report only the outcome of an accepted request’s DVM operation.

  • No event when nothing changes. A request that is accepted but initiates no actual DVM size change (for example an extend that adds no new daemons) is fully complete at acceptance and emits no asynchronous event.

  • No requester, no directed event. A size change with no PMIx requester (a scheduler-driven push) updates the runtime’s own state but has no specific process to direct a completion event to; none is sent.

8.3.1.6.6. Proposed PMIx status codes

This contract requires two PMIx event codes that PRRTE cannot define on its own (they belong to the PMIx standard and headers):

Code

Meaning

PMIX_DVM_IS_READY

Non-error event: an accepted DVM size change has completed and the runtime now reflects the new size. Carries PMIX_ALLOC_ID and, when supplied, PMIX_ALLOC_REQ_ID.

PMIX_ERR_DVM_MOD

Error event: an accepted DVM size change failed and will not be made; the DVM has returned to a stable state. Carries PMIX_ALLOC_ID, the underlying failure pmix_status_t, and — when supplied — PMIX_ALLOC_REQ_ID.

Because PMIx status codes are plain preprocessor #defines, their use is guarded at build time by a simple check for the codes’ presence in the installed PMIx headers — no PMIx capability flag is required. A PRRTE built against a PMIx that defines neither code simply omits the completion notification (see Backward compatibility and transparency).

8.3.1.7. Failure semantics

Event

Observable outcome

A grow-target daemon fails before its grow completes

The grow fails as a whole and is rolled back: every daemon the same campaign had already started is terminated and its nodes leave the DVM, restoring the pre-grow membership; all jobs parked on account of the grow are aborted and never launch.

A daemon unrelated to any in-progress grow fails during a grow

The grow is unaffected; parked jobs are neither released early nor aborted.

A shrink-target daemon exits cleanly

Counts as that target’s departure; when the last target departs the shrink completes and parked jobs are admitted (remapped if they were on a departing node).

A shrink-target daemon crashes during the shrink

Handled identically to a clean exit: the node is leaving regardless, and jobs mapped onto it are remapped onto survivors.

The same departing daemon emits more than one failure event

Counted once; a repeated event for an already-departed target has no further effect on admission.

The controlling daemon job is torn down with grow campaigns still pending

The pending grows are drained as failures so that no job is left parked across the teardown.

In every case a parked job has no partial effect: until it is admitted it has launched no processes, and a job aborted because its grow failed leaves nothing running.

8.3.1.8. Observability

This feature introduces two externally visible artifacts.

The first is a job state. A parked job is reported as PRTE_JOB_STATE_WAITING_FOR_DAEMONS in verbose and debugging output (for example under --prtemca state_base_verbose), so an operator can see precisely why a job has not yet been mapped or launched. The state is a passive marker: it triggers no callback and changes no other behavior. Once the size change the job is waiting on completes, the job advances out of this state on its own.

The second is the pair of completion events described under Asynchronous size-change completion: PMIX_DVM_IS_READY on success and PMIX_ERR_DVM_MOD on failure, each directed to the requester of the size change and carrying the allocation identifiers (and, on failure, the underlying cause). Unlike the job state, these are an active interface a requester registers an event handler for; they are the requester’s only signal that the asynchronous DVM operation has finished.

8.3.1.9. Backward compatibility and transparency

The job-admission contract is transparent to every caller:

  • It is inert unless the DVM is in elastic mode (prte_elastic_mode, off by default). A DVM started without that parameter is fixed-size and runs exactly as it did before this feature — the launch fence is never raised, no job is ever parked, and daemon losses follow the ordinary error path.

  • No new command-line option, environment variable, or PMIx attribute is defined or required for it (prte_elastic_mode already existed). A tool, application, or scheduler issues the same requests it always has.

  • On a DVM that never grows or shrinks, no job is ever parked and behavior is identical to a non-elastic DVM.

  • The only difference a submitting caller can observe when a size change is in progress is a launch delay for an affected job and, in debugging output, the PRTE_JOB_STATE_WAITING_FOR_DAEMONS state — never a spurious failure and never a launch onto a node that is not ready or is leaving.

The completion contract adds the two new PMIx event codes, which are optional and degrade cleanly:

  • PMIX_DVM_IS_READY and PMIX_ERR_DVM_MOD are delivered only to a requester that registers a handler for them; a requester that ignores them is unaffected beyond losing the completion signal.

  • When the underlying PMIx defines neither code, a PRRTE built against it (the call sites are guarded by a preprocessor check for the two #defined status codes) omits the asynchronous completion notification entirely. The allocation response (phase one) is unchanged, so a request is still accepted and the DVM still grows or shrinks; the requester simply receives no event-based signal that the operation finished or failed, exactly as before this feature existed. This is a functional gap, not merely a cosmetic one: without the event a requester cannot reliably know when the runtime is ready and must fall back to whatever coarse means it used previously.

8.3.1.10. Conformance summary

A conforming implementation guarantees that:

  1. A job submitted while a size change is in progress is held, not failed, solely on account of the change — and is then admitted onto the post-change node set, with the sole exception of a job whose grow dependency fails, which is aborted.

  2. A job is never mapped onto a node whose daemon is not yet wired into the DVM (grow) or is a departure target (shrink).

  3. A job is never sent launch data destined for a departing daemon; a job already mapped onto a departing node is remapped onto surviving nodes before it launches.

  4. A daemon failure affects only the jobs waiting on the campaign that failure belongs to; an unrelated daemon loss neither releases nor aborts parked jobs.

  5. A grow that fails is rolled back to the DVM’s pre-grow membership: the campaign’s already-started daemons are terminated and its nodes leave the DVM, so a failed grow never leaves the DVM half-extended. The rollback is scoped to the failed campaign and leaves concurrent campaigns and pre-existing daemons untouched.

  6. Shrink completion is driven by the actual departure of every targeted daemon — clean exit and crash being indistinguishable — and is idempotent against a daemon that reports its departure more than once.

  7. Concurrent campaigns of either kind never deadlock a parked job: it is admitted once, and only once, every campaign it is waiting on has completed.

  8. A dynamic allocation that drives a size change is answered in two phases: a synchronous response on acceptance (which does not assert the operation has finished), and exactly one asynchronous terminal event — PMIX_DVM_IS_READY on success or PMIX_ERR_DVM_MOD (carrying the underlying cause) on failure — directed to the requester when the DVM operation completes. A phase-one rejection, a request that changes nothing, and a requester-less scheduler push each produce no such event.

  9. The job-admission contract adds no caller-visible interface; the completion contract adds only the two optional PMIx event codes above, and when the underlying PMIx lacks them the runtime omits the completion notification while leaving every other guarantee intact. The job state PRTE_JOB_STATE_WAITING_FOR_DAEMONS remains observable for a parked job in verbose and debugging output.