.. _elastic-dvm-spec-label: Elastic DVM: Specification ========================== Purpose ------- This document specifies the externally observable behavior of the DVM while it changes size — what an application, tool, or scheduler may rely on when the DVM **grows** (a new-daemon launch campaign) or **shrinks** (a node-removal campaign). It covers two distinct audiences and contracts: * **Job admission** — what happens to an application job that is submitted *while* a grow or shrink is in progress (the bulk of this document). * **Size-change completion** — how the process that *requested* the size change learns, asynchronously, whether the DVM operation eventually succeeded or failed (see `Asynchronous size-change completion`_). It defines *what* the runtime guarantees, not *how* it achieves it. The companion design plans — :ref:`elastic-dvm-plan-label` (the shared fence mechanism and completion-event helper), :ref:`dvm-grow-campaign-label` (the grow path's per-campaign accounting), and :ref:`dvm-shrink-campaign-label` (the shrink path's campaign tracking) — describe the internal data structures, code paths, and implementation order. Where this specification and those plans disagree about observable behavior, **this specification is authoritative** and the plan must be corrected. The whole of this behavior is gated on the DVM being in **elastic mode**, selected by the pre-existing ``prte_elastic_mode`` MCA parameter (off by default). When it is not set the DVM is fixed-size: none of the job-admission deferral, parking, or completion-event machinery is active, and the runtime behaves exactly as it did before this feature — a daemon loss, for instance, is handled by the ordinary error path rather than absorbed as a campaign event. Everything below describes the behavior **when elastic mode is enabled**. Within elastic mode, the job-admission guarantees are stated purely in terms of job lifecycle outcomes and introduce **no** new command-line options, environment variables, or PMIx attributes: the grow and shrink triggers that already exist (``--add-host`` / ``--add-hostfile``, a scheduler-driven daemon launch, and a ``PMIX_ALLOC_RELEASE`` that removes nodes) simply acquire correct concurrency semantics. The completion contract does introduce two new PMIx event (status) codes — ``PMIX_DVM_IS_READY`` and ``PMIX_ERR_DVM_MOD`` — used to notify the requester when the asynchronous DVM operation finishes; these are the only new caller-visible interface this feature adds, and both are optional (see `Backward compatibility and transparency`_). Scope ----- In scope ~~~~~~~~ * The admission guarantee for an application job that reaches the map-eligible boundary while a grow or shrink is in progress. * The placement guarantee for a job launched across a size change — onto which set of nodes (pre- or post-change) it is permitted to map. * The disposition of a job that was already mapped onto a node that is about to depart in a shrink. * The outcome of a daemon failure that occurs during a grow or a shrink, the distinction between a failure that belongs to the in-progress campaign and an unrelated one, and the rollback of a failed grow to the DVM's pre-grow membership. * The behavior when grow and shrink campaigns, or multiple campaigns of the same kind, overlap in time. * The two-phase model by which a dynamic allocation request that drives a size change is answered: a synchronous response when the request is *accepted*, and a later asynchronous event when the DVM operation *completes* or *fails*. * The two new PMIx event codes — ``PMIX_DVM_IS_READY`` and ``PMIX_ERR_DVM_MOD`` — that carry the asynchronous completion result to the requester, and the allocation-identifying payload each delivers. * The ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS`` job state reported for a parked job in verbose and debugging output. Out of scope / non-goals ~~~~~~~~~~~~~~~~~~~~~~~~~~ * The *policy* that decides when the DVM grows or shrinks. That decision is driven entirely by the existing triggers (a tool or application expansion request, a scheduler action, or an allocation release); this specification governs only how concurrent job submissions behave around such a change, never whether the change happens. * Node selection — which physical nodes are added or removed — is the responsibility of the resource manager and the existing allocation and mapping machinery, and is unchanged. * The wire encoding of internal messages, the names and types of internal counters and lists, and the precise timing of internal state transitions are implementation details and are not specified here. * This document does not redefine the meaning of any DVM size-change trigger; it specifies only the admission and placement guarantees that now hold while one is in flight. Definitions ----------- Grow campaign A single in-progress launch of one or more new daemons that extends the DVM onto additional nodes. A grow campaign is *complete* when every new daemon has reported in and the head node has distributed the wireup (nidmap) information to the DVM. Shrink campaign A single in-progress removal of one or more nodes from the DVM, initiated by an allocation release. A shrink campaign targets a fixed set of daemon ranks and is *complete* when every targeted daemon has actually departed the DVM. Size change A grow campaign or a shrink campaign. More than one may be in progress at the same time. Map-eligible boundary The point in a job's lifecycle at which it is ready to be assigned to nodes (the ``VM_READY → MAP`` transition). A job that has not yet crossed this boundary has no node assignments. Launch boundary The point at which a job's mapping is complete and the runtime is ready to send launch data to the daemons that will host its processes (the ``LAUNCH_APPS`` transition). A job at this boundary already has node assignments. Parked job A job that has been held at the map-eligible boundary or the launch boundary because a size change is in progress. A parked job is reported in the ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS`` state. Parking is invisible to the submitting tool beyond a delay in launch; the job is neither failed nor restarted. Surviving node A node that remains in the DVM after a shrink campaign completes. Departing node A node whose daemon is a target of an in-progress shrink campaign. Size-change requester The process whose request initiated a grow or shrink — for a dynamic allocation this is the process that issued the ``PMIX_ALLOC_NEW`` / ``PMIX_ALLOC_EXTEND`` (grow) or ``PMIX_ALLOC_RELEASE`` (shrink) request. A size change initiated without a PMIx requester (for example a scheduler pushing daemons directly) has no requester. Request acceptance The point at which the runtime has finished *processing* a size-change request — validated it and initiated the corresponding DVM operation — and returns its synchronous response. Acceptance is **phase one**; it does not assert that the operation has finished. Operation completion The point at which the initiated DVM operation actually finishes — the grow's new daemons are launched and wired, or the shrink's targeted daemons have departed and the routing tree is repaired — or definitively fails. Completion is **phase two** and is reported asynchronously by event (see `Asynchronous size-change completion`_). Admission contract ------------------- The central guarantee is one of **non-destructive deferral**: A job submitted while a size change is in progress is never failed merely because the DVM is changing size. It is held until the change completes and then admitted onto the post-change set of nodes — except that a job whose admission depends on a grow that *fails* is aborted rather than launched onto an incomplete DVM. Two distinct placement hazards are closed by this contract, and a conforming implementation must close both: #. **A job must never be mapped onto a node whose daemon is not ready.** During a grow this means a node whose daemon has started but has not yet received its wireup information; during a shrink it means a node whose daemon is a departure target. #. **A job must never have launch data sent to a daemon that is departing.** A job that completed mapping *before* a shrink began, and was placed on a node now targeted for removal, must not transmit its launch message to that daemon. Behavior during a grow ~~~~~~~~~~~~~~~~~~~~~~~~ While a grow campaign is in progress: * A job that reaches the map-eligible boundary is parked. It is admitted only once **every** in-progress grow campaign has completed — that is, only after the new daemons are not merely running but fully wired into the DVM. This guarantees hazard 1: the job cannot be mapped onto a node whose daemon is up but not yet wired. * A job that had already crossed the map-eligible boundary before the grow began is **unaffected**: it continues to launch on the nodes it was assigned, and a grow never stalls it. * When the grow completes successfully, every job parked at the map-eligible boundary is admitted and proceeds to mapping, now able to consider the newly added nodes. * If a daemon belonging to the grow campaign **fails** before the campaign completes, the grow is treated as failed as a whole and is **rolled back**: every daemon that the same campaign had already started is terminated, and the nodes the campaign was adding leave the DVM, so the DVM is restored to exactly the membership it had before that campaign began. A failed grow never leaves the DVM half-extended with a partial, un-wired set of new daemons. Every job parked on account of the grow is then aborted (it never launches) rather than being admitted onto an incomplete DVM. This matches the first-failure semantics of a non-elastic launch. The rollback is scoped to the failed campaign: an unrelated grow campaign running concurrently keeps its own daemons and completes normally, and pre-existing daemons and nodes are untouched. * A daemon failure that does **not** belong to any in-progress grow campaign (for example, a pre-existing daemon dying for an unrelated reason) does not release parked jobs early and does not abort them; the grow proceeds unaffected. Behavior during a shrink ~~~~~~~~~~~~~~~~~~~~~~~~~~ While a shrink campaign is in progress: * A job that reaches the map-eligible boundary is parked, exactly as for a grow, and is admitted only once every in-progress size change has completed. This closes hazard 1 for the shrink case: a newly arriving job cannot be mapped onto a node that is in the act of leaving. * A job that had already completed mapping and reaches the launch boundary is parked **if and only if a shrink is in progress**, until every targeted daemon has departed. This closes hazard 2. A grow in progress does *not* park a job at the launch boundary, because a grow removes no node and so cannot invalidate an existing mapping. * When the shrink completes, each parked job is admitted according to whether the shrink invalidated its placement: - A job parked at the launch boundary whose processes were **not** assigned to any departing node proceeds directly to launch on its existing mapping. - A job parked at the launch boundary that had one or more processes assigned to a departing node is **remapped** onto the surviving nodes and then launched. Its placement after the change reflects the smaller DVM; the job is not failed. - A job parked at the map-eligible boundary is admitted to mapping and naturally considers only the surviving nodes. * Completion of a shrink is driven by the **actual departure** of each targeted daemon, not by any advance announcement of intent to leave. A daemon that exits cleanly in response to the shrink and a daemon that crashes during the shrink are indistinguishable to the contract: in both cases the node is leaving, and any job mapped onto it is remapped onto survivors as described above. No job is admitted until the departing daemons are genuinely gone and the DVM's view of its membership has been updated to match. Concurrency ----------- * Any number of grow campaigns, any number of shrink campaigns, or a mixture, may be in progress simultaneously. Parked jobs are admitted only when **all** in-progress campaigns have completed; no campaign can release another campaign's held jobs, and no campaign's completion is consumed by an unrelated daemon event. * A grow campaign in progress concurrently with a shrink does not stall a job at the launch boundary that has already been mapped onto surviving nodes; only an in-progress shrink holds jobs there. * Each campaign is accounted for independently, so an overlapping or partially-overlapping pair of campaigns cannot leave a job parked indefinitely once the last campaign it is waiting on completes. Asynchronous size-change completion ----------------------------------- A dynamic allocation request that grows or shrinks the DVM is answered in **two phases**, and the runtime separates the point at which the *allocation is complete* from the point at which the *runtime is ready*. Two-phase model ~~~~~~~~~~~~~~~ #. **Acceptance (synchronous).** When the runtime has finished *processing* the request — validated it, decided the resulting session/reservation, and initiated the corresponding grow or shrink — it returns the allocation response (status plus ``PMIX_ALLOC_ID``, as specified in the companion allocation contract ``node-reservation-spec.rst``). This response confirms only that the request was **accepted** and the DVM operation has **begun**. It does *not* assert that a grow's new daemons are up and wired, or that a shrink's targeted daemons have departed. #. **Completion (asynchronous, by event).** When the DVM operation later finishes — or fails — the runtime delivers a directed PMIx event to the **size-change requester**. This is the signal that the *runtime is ready* (or that the change will not happen), as distinct from the acceptance in phase one. The two phases are decoupled because the DVM operation is inherently asynchronous and unbounded in time (daemon launch and wireup, or daemon termination and tree repair). Blocking the allocation response until the operation finished would stall the requester and entangle the request's validity with unrelated launch/teardown failures. Decoupling lets the requester learn promptly that its request was accepted and then act only when the runtime actually reflects the new size. Resource release at shrink completion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The nodes a single shrink removes need not all come from the same underlying allocation. A ``PMIX_ALLOC_RELEASE`` may name resources that the runtime originally obtained from more than one source — for example nodes acquired through different resource managers, or a mixture of scheduler-provided and statically-configured nodes — so the set of departing nodes can span several allocations, each managed by a different resource component. Accordingly, when a shrink completes — every targeted daemon has departed — the runtime offers the completed operation to **each** active resource component in turn *before* it emits the completion event, giving each the opportunity to release the share of the departing resources that belongs to it back to its resource manager. What a component does with that opportunity is up to the component: it may return the nodes to a scheduler, defer, or do nothing if it has no stake in the operation. The runtime guarantees only the *ordering* — that the release cycle is offered to every component, and runs to completion, before the completion event is delivered. It does not guarantee that any particular resource was in fact handed back, since that is the component's decision, not the runtime's. Success event — ``PMIX_DVM_IS_READY`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When the DVM operation completes successfully — for a grow, the new daemons are launched and wired into the DVM; for a shrink, every targeted daemon has departed, the routing tree is repaired, and each resource component has been given the opportunity to release the freed resources back to its resource manager (see `Resource release at shrink completion`_) — the runtime delivers a ``PMIX_DVM_IS_READY`` event to the requester. After this event the DVM reflects the requested size: a grow's new nodes are available to spawn onto, a shrink's removed nodes are gone. The event payload carries: * ``PMIX_ALLOC_ID`` (``char*``) — the allocation whose operation completed; always present. * ``PMIX_ALLOC_REQ_ID`` (``char*``) — the requester's own request id, included whenever one was supplied on the original request, so the recipient can match the event by either identifier. Failure event — ``PMIX_ERR_DVM_MOD`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If the accepted DVM modification fails — a grow cannot launch or wire its new daemons (and is rolled back per `Failure semantics`_), a shrink cannot be carried out, or any other condition leaves the requested change unrealized — the runtime delivers a ``PMIX_ERR_DVM_MOD`` event to the requester. The event states that **no (further) DVM modification will be made** for this request and that the DVM has been returned to a stable state (for a failed grow, its pre-grow membership). The payload carries: * ``PMIX_ALLOC_ID`` (``char*``) — always present. * ``PMIX_ALLOC_REQ_ID`` (``char*``) — when one was supplied. * The **underlying cause** — the specific ``pmix_status_t`` that prevented the modification (for example a daemon-launch failure versus a resource error), conveyed in the event's info array so the requester can distinguish what went wrong rather than only that *something* did. Both events are **directed to the requesting process only** — they are not broadcast — mirroring the delivery of ``PMIX_ALLOC_TIMEOUT_WARNING`` specified in ``node-reservation-spec.rst``. Delivery guarantees ~~~~~~~~~~~~~~~~~~~ * **Exactly one terminal event per operation.** Each accepted request that initiates a DVM grow or shrink yields exactly one of ``PMIX_DVM_IS_READY`` or ``PMIX_ERR_DVM_MOD`` to its requester. * **No event for a phase-one rejection.** A request rejected during *processing* (the error cases in the companion ``node-reservation-spec.rst``, e.g. a malformed or unauthorized request) fails synchronously in the allocation response and produces **no** completion event; the phase-two events report only the outcome of an *accepted* request's DVM operation. * **No event when nothing changes.** A request that is accepted but initiates no actual DVM size change (for example an extend that adds no new daemons) is fully complete at acceptance and emits no asynchronous event. * **No requester, no directed event.** A size change with no PMIx requester (a scheduler-driven push) updates the runtime's own state but has no specific process to direct a completion event to; none is sent. Proposed PMIx status codes ~~~~~~~~~~~~~~~~~~~~~~~~~~~ This contract requires two PMIx event codes that PRRTE cannot define on its own (they belong to the PMIx standard and headers): .. list-table:: :header-rows: 1 :widths: 30 70 * - Code - Meaning * - ``PMIX_DVM_IS_READY`` - Non-error event: an accepted DVM size change has completed and the runtime now reflects the new size. Carries ``PMIX_ALLOC_ID`` and, when supplied, ``PMIX_ALLOC_REQ_ID``. * - ``PMIX_ERR_DVM_MOD`` - Error event: an accepted DVM size change failed and will not be made; the DVM has returned to a stable state. Carries ``PMIX_ALLOC_ID``, the underlying failure ``pmix_status_t``, and — when supplied — ``PMIX_ALLOC_REQ_ID``. Because PMIx status codes are plain preprocessor ``#define``\ s, their use is guarded at build time by a simple check for the codes' presence in the installed PMIx headers — no PMIx *capability* flag is required. A PRRTE built against a PMIx that defines neither code simply omits the completion notification (see `Backward compatibility and transparency`_). Failure semantics ----------------- .. list-table:: :header-rows: 1 :widths: 50 50 * - Event - Observable outcome * - A grow-target daemon fails before its grow completes - The grow fails as a whole and is rolled back: every daemon the same campaign had already started is terminated and its nodes leave the DVM, restoring the pre-grow membership; all jobs parked on account of the grow are aborted and never launch. * - A daemon unrelated to any in-progress grow fails during a grow - The grow is unaffected; parked jobs are neither released early nor aborted. * - A shrink-target daemon exits cleanly - Counts as that target's departure; when the last target departs the shrink completes and parked jobs are admitted (remapped if they were on a departing node). * - A shrink-target daemon crashes during the shrink - Handled identically to a clean exit: the node is leaving regardless, and jobs mapped onto it are remapped onto survivors. * - The same departing daemon emits more than one failure event - Counted once; a repeated event for an already-departed target has no further effect on admission. * - The controlling daemon job is torn down with grow campaigns still pending - The pending grows are drained as failures so that no job is left parked across the teardown. In every case a parked job has no partial effect: until it is admitted it has launched no processes, and a job aborted because its grow failed leaves nothing running. Observability ------------- This feature introduces two externally visible artifacts. The first is a **job state**. A parked job is reported as ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS`` in verbose and debugging output (for example under ``--prtemca state_base_verbose``), so an operator can see precisely why a job has not yet been mapped or launched. The state is a passive marker: it triggers no callback and changes no other behavior. Once the size change the job is waiting on completes, the job advances out of this state on its own. The second is the pair of **completion events** described under `Asynchronous size-change completion`_: ``PMIX_DVM_IS_READY`` on success and ``PMIX_ERR_DVM_MOD`` on failure, each directed to the requester of the size change and carrying the allocation identifiers (and, on failure, the underlying cause). Unlike the job state, these are an active interface a requester registers an event handler for; they are the requester's only signal that the asynchronous DVM operation has finished. Backward compatibility and transparency ---------------------------------------- The **job-admission** contract is transparent to every caller: * It is inert unless the DVM is in elastic mode (``prte_elastic_mode``, off by default). A DVM started without that parameter is fixed-size and runs exactly as it did before this feature — the launch fence is never raised, no job is ever parked, and daemon losses follow the ordinary error path. * No new command-line option, environment variable, or PMIx attribute is defined or required for it (``prte_elastic_mode`` already existed). A tool, application, or scheduler issues the same requests it always has. * On a DVM that never grows or shrinks, no job is ever parked and behavior is identical to a non-elastic DVM. * The only difference a submitting caller can observe when a size change *is* in progress is a launch delay for an affected job and, in debugging output, the ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS`` state — never a spurious failure and never a launch onto a node that is not ready or is leaving. The **completion** contract adds the two new PMIx event codes, which are optional and degrade cleanly: * ``PMIX_DVM_IS_READY`` and ``PMIX_ERR_DVM_MOD`` are delivered only to a requester that registers a handler for them; a requester that ignores them is unaffected beyond losing the completion signal. * When the underlying PMIx defines neither code, a PRRTE built against it (the call sites are guarded by a preprocessor check for the two ``#define``\ d status codes) omits the asynchronous completion notification entirely. The allocation response (phase one) is unchanged, so a request is still accepted and the DVM still grows or shrinks; the requester simply receives no event-based signal that the operation finished or failed, exactly as before this feature existed. This is a functional gap, not merely a cosmetic one: without the event a requester cannot reliably know when the runtime is ready and must fall back to whatever coarse means it used previously. Conformance summary ------------------- A conforming implementation guarantees that: #. A job submitted while a size change is in progress is held, not failed, solely on account of the change — and is then admitted onto the post-change node set, with the sole exception of a job whose grow dependency fails, which is aborted. #. A job is never mapped onto a node whose daemon is not yet wired into the DVM (grow) or is a departure target (shrink). #. A job is never sent launch data destined for a departing daemon; a job already mapped onto a departing node is remapped onto surviving nodes before it launches. #. A daemon failure affects only the jobs waiting on the campaign that failure belongs to; an unrelated daemon loss neither releases nor aborts parked jobs. #. A grow that fails is rolled back to the DVM's pre-grow membership: the campaign's already-started daemons are terminated and its nodes leave the DVM, so a failed grow never leaves the DVM half-extended. The rollback is scoped to the failed campaign and leaves concurrent campaigns and pre-existing daemons untouched. #. Shrink completion is driven by the actual departure of every targeted daemon — clean exit and crash being indistinguishable — and is idempotent against a daemon that reports its departure more than once. #. Concurrent campaigns of either kind never deadlock a parked job: it is admitted once, and only once, every campaign it is waiting on has completed. #. A dynamic allocation that drives a size change is answered in two phases: a synchronous response on acceptance (which does not assert the operation has finished), and exactly one asynchronous terminal event — ``PMIX_DVM_IS_READY`` on success or ``PMIX_ERR_DVM_MOD`` (carrying the underlying cause) on failure — directed to the requester when the DVM operation completes. A phase-one rejection, a request that changes nothing, and a requester-less scheduler push each produce no such event. #. The job-admission contract adds no caller-visible interface; the completion contract adds only the two optional PMIx event codes above, and when the underlying PMIx lacks them the runtime omits the completion notification while leaving every other guarantee intact. The job state ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS`` remains observable for a parked job in verbose and debugging output.