.. _elastic-dvm-plan-label:

Elastic DVM Implementation Plan
===============================

This document describes the implementation of the **launch fence** — the
shared mechanism that serialises application-job dispatch against in-progress
DVM grow and shrink campaigns, closing the race between a DVM size change and
concurrently-running application jobs.  For background on the race itself see
:ref:`state-machine-label`, section *DVM Extension and the Daemon-Launch Race*.

The externally observable contract this implementation delivers — the
job-admission and placement guarantees, and the two-phase completion
notification — is specified in :ref:`elastic-dvm-spec-label`, which is
authoritative for observable behavior.  Where this plan and that specification
disagree about observable behavior, the specification wins and this plan must
be corrected.

This plan is the **parent** of two campaign-specific plans:

* :ref:`dvm-grow-campaign-label` — the grow (daemon-launch) path's per-campaign
  fence accounting, failure rollback, and success/failure completion events.
* :ref:`dvm-shrink-campaign-label` — the shrink (node-removal) path's campaign
  tracking, the second (``LAUNCH_APPS``) hold point, completion detection, the
  RM-side resource release that runs at completion (a release may span multiple
  allocations, so every active RAS module is offered the completed campaign),
  and completion events.

It covers the **shared** infrastructure both paths build on: the fence counter,
the held-job arrays, the ``VM_READY → MAP`` hold point, the fence-release
helper, and the completion-event emission common to both.

The mechanism is a **global launch fence** — a counter
(``prte_dvm_launch_fence``) that tracks the number of in-progress daemon launch
campaigns.  An app job that reaches the ``VM_READY → MAP`` transition checks the
fence; if it is nonzero the job parks itself in a held-job array
(``prte_held_jobs``) and is released when the fence reaches zero.

The state machine is single-threaded on the progress thread, so no locking is
required anywhere in this plan.

**Gated on elastic mode.**  Every piece of this machinery is active only when
the DVM is in elastic mode (the pre-existing ``prte_elastic_mode`` MCA
parameter, off by default).  The gate is applied at the two points that *raise*
the fence — grow-campaign creation in ``setup_virtual_machine()`` and
shrink-campaign creation in the ``PMIX_ALLOC_RELEASE`` handler — so that outside
elastic mode the fence is never raised, the campaign lists stay empty, and every
downstream check (the ``VM_READY`` and ``LAUNCH_APPS`` holds, the errmgr
campaign matching, the drains) is naturally inert.  The consumer sites also
carry an explicit ``prte_elastic_mode`` guard so the non-elastic path is
provably identical to the pre-feature behavior — in particular,
``prte_plm_base_grow_target_failed()`` returns ``false`` immediately, so a
daemon loss on a fixed-size DVM is handled by the ordinary errmgr abort path
exactly as before.

.. note::
   The app-triggered expansion path (``--add-host`` / ``--add-hostfile``)
   already sets ``prte_dvm_ready = false`` in ``add_hosts()`` before posting
   the asynchronous RAS modify request, which causes newly-arriving jobs to be
   stashed in ``prte_cache`` rather than dispatched immediately.  The launch
   fence is still required for the scheduler-push path (e.g., Slurm firing
   ``LAUNCH_DAEMONS`` directly) where ``prte_dvm_ready`` is never cleared, and
   to ensure full correctness when both paths can interleave.

Step 1 — New state constant
---------------------------

In ``src/mca/plm/plm_types.h``, add:

.. code-block:: c

   /* value 17 is currently unused in the running-state band */
   #define PRTE_JOB_STATE_WAITING_FOR_DAEMONS  17

Add a corresponding string to ``src/util/error_strings.c``.

This state is used purely as a marker so that debugging tools and verbose
output show clearly why a job is parked; no callback is registered for it.

Step 2 — New global fence and held-job arrays
---------------------------------------------

In ``src/runtime/prte_globals.c`` and ``src/runtime/prte_globals.h``, add:

.. code-block:: c

   /* counts in-progress daemon launch campaigns */
   int prte_dvm_launch_fence = 0;

   /* jobs parked at the VM_READY → MAP boundary */
   pmix_pointer_array_t *prte_held_jobs;

   /* jobs parked at the LAUNCH_APPS boundary during a shrink */
   pmix_pointer_array_t *prte_prelaunch_held_jobs;

Initialize both arrays in ``src/runtime/prte_init.c`` alongside the existing
``prte_cache`` initialization:

.. code-block:: c

   prte_held_jobs = PMIX_NEW(pmix_pointer_array_t);
   pmix_pointer_array_init(prte_held_jobs, 1, INT_MAX, 1);

   prte_prelaunch_held_jobs = PMIX_NEW(pmix_pointer_array_t);
   pmix_pointer_array_init(prte_prelaunch_held_jobs, 1, INT_MAX, 1);

Destruct both in ``src/runtime/prte_finalize.c``.

The grow and shrink campaign lists (``prte_grow_campaigns`` and
``prte_shrink_campaigns``) that drive the fence are declared, constructed, and
destructed alongside these globals; their types and lifecycles are specified in
the two child plans.

Step 3 — Park jobs at the VM_READY → MAP boundary
--------------------------------------------------

This is the first of the two hold points and is shared by both paths: it stops
*any* in-progress campaign (grow or shrink) from letting a freshly-arriving job
map onto a node whose daemon is not ready.

In ``vm_ready()``, the code at line 360 is reached only by app jobs (the
daemon-job branch returns at line 357).  This is immediately before
``prte_filem.preposition_files()`` which leads to ``files_ready → MAP``.
Add the hold check here:

.. code-block:: c

   /* position any required files */
   if (0 < prte_dvm_launch_fence) {
       /* daemon launch in progress — park this job */
       caddy->jdata->state = PRTE_JOB_STATE_WAITING_FOR_DAEMONS;
       PMIX_RETAIN(caddy->jdata);
       pmix_pointer_array_add(prte_held_jobs, caddy->jdata);
       PMIX_RELEASE(caddy);
       return;
   }
   if (PRTE_SUCCESS !=
           prte_filem.preposition_files(caddy->jdata, files_ready, caddy->jdata)) {
       PRTE_ACTIVATE_JOB_STATE(caddy->jdata, PRTE_JOB_STATE_FILES_POSN_FAILED);
   }
   PMIX_RELEASE(caddy);

The second hold point — at ``LAUNCH_APPS``, guarded by the shrink campaign list
rather than the fence counter — is shrink-specific and is described in
:ref:`dvm-shrink-campaign-label`.

Step 4 — Held-job release helpers
---------------------------------

There are two distinct ways a held job leaves its parked state, and they are
**not** symmetric, so they are handled by two separate helpers rather than a
single ``bool success`` flag:

* **Global success.**  When the global fence reaches zero — every grow and
  shrink campaign has completed *successfully* — both classes of held job are
  admitted.  This is ``prte_plm_base_fence_release()``.

* **Grow failure.**  When a grow campaign fails (see
  :ref:`dvm-grow-campaign-label`, *Failure drain and rollback*), the spec
  requires the whole **pre-map** held-job set to be aborted — the first-failure
  semantics of a non-elastic launch.  But a grow failure must **not** touch the
  **pre-launch** held jobs: those are parked solely on account of an in-progress
  shrink (the ``LAUNCH_APPS`` hold is gated on the shrink list, not the fence),
  so they do not wait on the grow, and the spec's conformance guarantee #4
  states that a daemon failure may affect only the jobs waiting on the campaign
  it belongs to.  This asymmetric abort is ``prte_plm_base_abort_premap_held()``.

Folding both paths into one ``fence_release(bool success)`` was the original
shape of this plan, but it could not honor the spec when a grow failed while a
shrink was still in progress.  A single global ``success`` flag cannot express
"abort the grow's waiters but leave the shrink's waiters parked," and gating the
failure abort on the fence reaching zero would let a *later* shrink-success
release **admit** a pre-map job whose grow dependency had already failed (the
last campaign to drain — the shrink — would call the release with
``success == true``).  Splitting the two release paths closes that gap: the
grow-failure abort fires immediately on the pre-map array regardless of the
fence value, and the success release is reached only when no campaign has
failed.

Both helpers are declared in ``src/mca/plm/base/plm_private.h`` (the header the
errmgr, ras, and state callers already include for the existing
``prte_plm_base_*`` launch helpers) and defined in
``plm_base_launch_support.c``:

.. code-block:: c

   /* SUCCESS release — invoked only when the global fence reaches zero, i.e.
    * every grow and shrink campaign has completed successfully.  Admits both
    * classes of held job and defensively sweeps any residual campaigns. */
   void prte_plm_base_fence_release(void)
   {
       int _hi;
       prte_job_t *_held;

       /* --- pre-map held jobs (parked at VM_READY) --- */
       for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
           _held = (prte_job_t *)
               pmix_pointer_array_get_item(prte_held_jobs, _hi);
           if (NULL == _held) {
               continue;
           }
           pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
           PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_VM_READY);
           PMIX_RELEASE(_held);
       }

       /* --- pre-launch held jobs (parked at LAUNCH_APPS) --- */
       for (_hi = 0; _hi < prte_prelaunch_held_jobs->size; _hi++) {
           _held = (prte_job_t *)
               pmix_pointer_array_get_item(prte_prelaunch_held_jobs, _hi);
           if (NULL == _held) {
               continue;
           }
           pmix_pointer_array_set_item(prte_prelaunch_held_jobs, _hi, NULL);
           if (prte_plm_base_job_needs_remap(_held)) {
               prte_plm_base_reset_proc_map(_held);
               PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_MAP);
           } else {
               PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_LAUNCH_APPS);
           }
           PMIX_RELEASE(_held);
       }

       /* Campaigns are removed individually as their last target drains, so
        * both lists should be empty here.  Sweep each defensively anyway — so
        * a future change that can leave a residual campaign behind cannot wedge
        * the fence — and sweep *both* kinds for symmetry, not just shrink. */
       {
           prte_shrink_campaign_t *_sc, *_sn;
           PMIX_LIST_FOREACH_SAFE(_sc, _sn,
                                  &prte_shrink_campaigns, prte_shrink_campaign_t) {
               pmix_list_remove_item(&prte_shrink_campaigns, &_sc->super);
               PMIX_RELEASE(_sc);
           }
       }
       {
           prte_grow_campaign_t *_gc, *_gn;
           PMIX_LIST_FOREACH_SAFE(_gc, _gn,
                                  &prte_grow_campaigns, prte_grow_campaign_t) {
               pmix_list_remove_item(&prte_grow_campaigns, &_gc->super);
               PMIX_RELEASE(_gc);
           }
       }
   }

   /* GROW-FAILURE abort — fails every job parked at the VM_READY -> MAP
    * boundary to NEVER_LAUNCHED.  Called from the grow failure drain only, and
    * independent of the fence value, so a grow failure aborts its pre-map
    * waiters even while a concurrent shrink keeps the fence nonzero.  It
    * deliberately leaves prte_prelaunch_held_jobs untouched: those jobs wait
    * only on a shrink, never on the grow (conformance #4). */
   void prte_plm_base_abort_premap_held(void)
   {
       int _hi;
       prte_job_t *_held;

       for (_hi = 0; _hi < prte_held_jobs->size; _hi++) {
           _held = (prte_job_t *)
               pmix_pointer_array_get_item(prte_held_jobs, _hi);
           if (NULL == _held) {
               continue;
           }
           pmix_pointer_array_set_item(prte_held_jobs, _hi, NULL);
           PRTE_ACTIVATE_JOB_STATE(_held, PRTE_JOB_STATE_NEVER_LAUNCHED);
           PMIX_RELEASE(_held);
       }
   }

The pre-launch branch of ``fence_release()`` calls two shrink-specific
helpers — ``prte_plm_base_job_needs_remap()`` (does any held proc sit on a
departing daemon?) and ``prte_plm_base_reset_proc_map()`` (un-claim the previous
mapping so the job can be remapped onto survivors) — specified in
:ref:`dvm-shrink-campaign-label`.  Because shrink completion treats a clean exit
and a crash identically (a targeted daemon's departure is always a success for
its campaign), neither held-job array is ever failed on the shrink path; the
only failure disposition of a held job is the grow-failure abort above.

``prte_plm_base_fence_release()`` acts when the **global** fence reaches zero,
which requires *all* grow and shrink campaigns to have completed.  The
per-campaign **completion event** (Step 5) is distinct: it fires for an
individual request's campaign when that campaign drains, independent of whether
other campaigns are still in flight.

Step 5 — Completion-event emission (shared helper)
--------------------------------------------------

The spec's two-phase contract (see :ref:`elastic-dvm-spec-label`,
*Asynchronous size-change completion*) requires that, when an accepted DVM
operation finishes, the runtime deliver a directed event to the process that
requested the size change: ``PMIX_DVM_IS_READY`` on success or
``PMIX_ERR_DVM_MOD`` (carrying the underlying cause) on failure.

Both campaign objects therefore record the requester so the event can be
directed once the campaign drains:

.. code-block:: c

   pmix_proc_t  requester;       /* who requested the size change */
   char        *alloc_id;        /* PMIX_ALLOC_ID of the affected allocation */
   char        *req_id;          /* requester's PMIX_ALLOC_REQ_ID, or NULL */
   bool         have_requester;  /* false for a scheduler push */

These fields are populated where the campaign is created, from the allocation
request that drove the operation:

* **Shrink** — directly in the ``PMIX_ALLOC_RELEASE`` handler, from the request
  object: ``requester`` is the request's ``tproc``, and ``alloc_id`` / ``req_id``
  are read from its ``PMIX_ALLOC_ID`` / ``PMIX_ALLOC_REQ_ID`` info keys.
* **Grow** — in ``setup_virtual_machine()``, *indirectly through the session*.
  The RAS reservation machinery already records the driving request on the
  session and back-points every reserved node at it
  (``add_nodes_to_session()`` sets ``node->session``; the session carries
  ``requestor``, ``alloc_refid``, and ``user_refid``).  The grow campaign reads
  those from the first new daemon's ``node->session``, so the originating
  request need not be threaded explicitly into the launch path.

A size change initiated with no PMIx requester (a scheduler push, or the initial
DVM bring-up, where the session is the default one or its ``requestor`` rank is
``PMIX_RANK_INVALID``) leaves ``have_requester`` false and emits no event.

A single shared helper, declared in ``src/mca/plm/base/plm_private.h`` and
defined in ``plm_base_launch_support.c``, performs the emission.  Its prototype
takes a
``bool success`` rather than an event code, so that **the two new status codes
are named only inside the helper body** — call sites pass a bool and a cause and
never reference ``PMIX_DVM_IS_READY`` / ``PMIX_ERR_DVM_MOD`` themselves:

.. code-block:: c

   /* success => emit PMIX_DVM_IS_READY; otherwise emit PMIX_ERR_DVM_MOD
    * carrying `cause` (the underlying failure pmix_status_t) */
   void prte_plm_base_dvm_mod_notify(const pmix_proc_t *requester,
                                     const char *alloc_id,
                                     const char *req_id,
                                     bool success,
                                     pmix_status_t cause);

It packs ``PMIX_ALLOC_ID`` (always), ``PMIX_ALLOC_REQ_ID`` (when ``req_id`` is
non-NULL), and — on failure — the underlying ``cause`` status (carried under
``PMIX_JOB_TERM_STATUS``, the standard ``pmix_status_t``-typed info key; the PMIx
contract for ``PMIX_ERR_DVM_MOD`` asks only for "any available information
describing the cause"), then delivers the event **only** to ``requester`` as a
directed, custom-range notification — the same ``PMIX_RANGE_CUSTOM`` mechanism
used for ``PMIX_ALLOC_TIMEOUT_WARNING``.

The grow and shrink plans call this helper at their respective drain points: the
grow path inside ``prte_plm_base_grow_drain()`` (reached on success from
``vm_ready`` after the WIREUP xcast, and on failure from
``prte_plm_base_grow_target_failed()`` and the ``check_job_complete`` safety
net); the shrink path when a campaign's last target departs (success, in the
errmgr) and on the xcast-failure cleanup at campaign creation (failure).

``PMIX_DVM_IS_READY`` and ``PMIX_ERR_DVM_MOD`` are plain ``#define``\ d
``pmix_status_t`` values (PMIx status codes are preprocessor macros, not enum
constants), so their availability is decided entirely by whether the installed
PMIx headers define the symbols — **no PMIx capability flag is involved** (the
``PRTE_CHECK_PMIX_CAP`` machinery is for ``PMIX_CAP_*`` behavioral flags and
does not apply here).

To keep the project's ``#if FOO`` discipline — so a mistyped guard is a compile
error rather than a silently-false ``#ifdef`` — a probe in
``config/prte_setup_pmix.m4`` defines ``PRTE_HAVE_DVM_MOD_EVENTS`` to ``0`` or
``1`` from the presence of the two symbols:

.. code-block:: none

   AC_MSG_CHECKING([for PMIx DVM modification event codes])
   AC_PREPROC_IFELSE(
       [AC_LANG_PROGRAM([[#include <pmix.h>
   #if !defined(PMIX_DVM_IS_READY) || !defined(PMIX_ERR_DVM_MOD)
   #error DVM modification event codes not present
   #endif
   ]], [[]])],
       [AC_MSG_RESULT([yes])
        AC_DEFINE([PRTE_HAVE_DVM_MOD_EVENTS], [1],
                  [PMIx defines the DVM modification event codes])],
       [AC_MSG_RESULT([no])
        AC_DEFINE([PRTE_HAVE_DVM_MOD_EVENTS], [0],
                  [PMIx defines the DVM modification event codes])])

The macro is defined to ``0`` or ``1`` on both branches (never ``#undef``\ ed),
so it can be tested with ``#if PRTE_HAVE_DVM_MOD_EVENTS``.  Because the helper's
``bool``-based prototype keeps the two codes out of every call site, **only the
helper body needs the guard**: when ``PRTE_HAVE_DVM_MOD_EVENTS`` is ``0`` the
body compiles to a no-op (the ``bool``/``pmix_status_t`` prototype still
compiles), the call sites are unchanged, and no completion event is delivered —
exactly as the spec's backward-compatibility clause requires.  Because this
touches ``*.m4``, ``./autogen.pl`` must be re-run before configuring.

Summary of Files Changed (Shared Fence Infrastructure)
-------------------------------------------------------

.. list-table::
   :widths: 50 50
   :header-rows: 1

   * - File
     - Change
   * - ``src/mca/plm/plm_types.h``
     - Add ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS = 17``.
   * - ``src/util/error_strings.c``
     - Add string for ``PRTE_JOB_STATE_WAITING_FOR_DAEMONS``.
   * - ``src/runtime/prte_globals.h``
     - Declare ``prte_dvm_launch_fence``, ``prte_held_jobs``, and
       ``prte_prelaunch_held_jobs``.
   * - ``src/runtime/prte_globals.c``
     - Define and initialize ``prte_dvm_launch_fence = 0``.
   * - ``src/runtime/prte_init.c``
     - Allocate and init ``prte_held_jobs`` and ``prte_prelaunch_held_jobs``.
   * - ``src/runtime/prte_finalize.c``
     - Destruct ``prte_held_jobs`` and ``prte_prelaunch_held_jobs``.
   * - ``src/mca/plm/base/plm_base_launch_support.c``
     - Define ``prte_plm_base_fence_release()`` and
       ``prte_plm_base_abort_premap_held()`` (Step 4) and
       ``prte_plm_base_dvm_mod_notify()`` (Step 5).
   * - ``src/mca/plm/base/plm_private.h``
     - Declare ``prte_plm_base_fence_release()``,
       ``prte_plm_base_abort_premap_held()``, and
       ``prte_plm_base_dvm_mod_notify()`` (and, for the grow path,
       ``prte_plm_base_grow_drain()`` / ``prte_plm_base_grow_target_failed()`` —
       see :ref:`dvm-grow-campaign-label`).  This is the header the errmgr, ras,
       and state callers already include.
   * - ``src/mca/state/dvm/state_dvm.c``
     - In ``vm_ready``: add the ``VM_READY → MAP`` hold-check before
       ``preposition_files`` (Step 3).
   * - ``config/prte_setup_pmix.m4``
     - Add the ``AC_PREPROC_IFELSE`` probe that defines
       ``PRTE_HAVE_DVM_MOD_EVENTS`` (``0``/``1``) from the presence of
       ``PMIX_DVM_IS_READY`` / ``PMIX_ERR_DVM_MOD`` (Step 5).  Re-run
       ``autogen.pl`` afterward.

For the grow path's file changes see the "Touched files" table in
:ref:`dvm-grow-campaign-label`; for the shrink path's, the "Touched files"
table in :ref:`dvm-shrink-campaign-label`.

Design Invariants
-----------------

**Shared fence**

* The fence is a single ``int`` accessed only on the progress thread, so all
  increments, decrements, and the zero test are race-free without locking.
* A job is parked iff the fence is nonzero at the ``VM_READY → MAP`` boundary
  (Step 3); the ``LAUNCH_APPS`` hold (shrink plan) is gated on the shrink
  campaign list, not the fence, so a concurrent grow does not stall an
  already-mapped job on surviving nodes.
* ``prte_plm_base_fence_release()`` is the **success-only** release; it is
  called only when the fence reaches zero, which requires *all* grow and shrink
  campaigns to have completed successfully.  The campaign lists are therefore
  empty (or nearly so — ``fence_release`` does a defensive sweep of **both** the
  grow and shrink lists for the degenerate case where some future change leaves
  a partially-setup campaign behind).
* A grow failure is the only path that fails a held job.  It calls
  ``prte_plm_base_abort_premap_held()``, which aborts the pre-map held jobs
  (``prte_held_jobs`` → ``NEVER_LAUNCHED``) immediately and independently of the
  fence, and never touches ``prte_prelaunch_held_jobs``: those jobs wait only on
  a shrink, so per conformance #4 a grow failure must leave them parked until the
  shrink completes.
* The per-campaign completion event (Step 5) is independent of the global fence:
  it fires when an individual request's campaign drains, even if other campaigns
  keep the fence nonzero.

**Grow fence** — see the "Why this is correct" and "Design" sections of
:ref:`dvm-grow-campaign-label`.

**Shrink fence** — see the "Design Invariants" section of
:ref:`dvm-shrink-campaign-label`.