.. _state-machine-label: Job Launch State Machine ======================== PRRTE drives the full lifecycle of a job — from daemon launch through application launch and termination — through an explicit, event-driven state machine. Every transition is represented as an event posted to the PRRTE progress thread; the callback for each state runs single-threaded, performs its work, and posts the next event when done. Nothing blocks the calling thread and there are no race conditions between state handlers. There are two cooperating state machines: one for **jobs** (tracking the lifecycle of an entire job or the DVM itself) and one for **processes** (tracking each individual application process). Architecture ------------ The state machine is implemented in ``src/mca/state/``. The **DVM module** (``src/mca/state/dvm/state_dvm.c``) is used when ``prte`` runs as a persistent Distributed Virtual Machine; it owns the authoritative ordered table of states and callbacks. The **prted module** (``src/mca/state/prted/state_prted.c``) runs inside each daemon and handles only the small set of states relevant to a daemon's local work. The state machine is a linked list (``prte_job_states``) of ``(state, callback)`` pairs. The macro ``PRTE_ACTIVATE_JOB_STATE(jdata, state)`` packages the job object and the target state into a *caddy* and posts it to the event loop. The matching callback is looked up and invoked asynchronously. Job State Definitions --------------------- All job-state constants are defined in ``src/mca/plm/plm_types.h`` (lines 116–194). The states relevant to daemon launch, in numeric order, are: .. list-table:: :widths: 35 10 55 :header-rows: 1 * - Name - Value - Meaning * - ``PRTE_JOB_STATE_INIT`` - 1 - Job record created; ready to receive a job ID. * - ``PRTE_JOB_STATE_INIT_COMPLETE`` - 2 - Job ID assigned; initial setup done. * - ``PRTE_JOB_STATE_ALLOCATE`` - 3 - Ready to request resources from the scheduler/RAS. * - ``PRTE_JOB_STATE_ALLOCATION_COMPLETE`` - 4 - Resource allocation finished. * - ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` - 8 - Ready to spawn ``prted`` processes. *Not* in the DVM default table; registered by each PLM component at startup. * - ``PRTE_JOB_STATE_DAEMONS_LAUNCHED`` - 9 - The PLM has initiated daemon spawning; waiting for daemons to call home. * - ``PRTE_JOB_STATE_DAEMONS_REPORTED`` - 10 - All expected daemons have connected and sent their contact information. * - ``PRTE_JOB_STATE_VM_READY`` - 11 - The DVM is fully operational; node map and wireup info have been broadcast to all daemons. * - ``PRTE_JOB_STATE_MAP`` - 5 - Ready to map processes to nodes. * - ``PRTE_JOB_STATE_MAP_COMPLETE`` - 6 - Process mapping finished. * - ``PRTE_JOB_STATE_SYSTEM_PREP`` - 7 - Final sanity checks and environment setup before launch. * - ``PRTE_JOB_STATE_LAUNCH_APPS`` - 12 - Ready to send launch directives to daemons. * - ``PRTE_JOB_STATE_SEND_LAUNCH_MSG`` - 13 - Launch message being assembled and sent. * - ``PRTE_JOB_STATE_STARTED`` - 20 - At least one application process has been forked. * - ``PRTE_JOB_STATE_LOCAL_LAUNCH_COMPLETE`` - 18 - All local processes on a daemon have attempted to launch. * - ``PRTE_JOB_STATE_READY_FOR_DEBUG`` - 19 - All local processes report ready for a debugger attach. * - ``PRTE_JOB_STATE_RUNNING`` - 14 - All processes across all daemons have been forked. * - ``PRTE_JOB_STATE_REGISTERED`` - 16 - All processes have registered with the PMIx server (called ``PMIx_Init``). Termination states (values ≥ 30) and error states (values ≥ 51) are described at the bottom of this page. The Daemon Launch Sequence -------------------------- The DVM module registers the following ordered table at startup (``src/mca/state/dvm/state_dvm.c``, ``launch_states[]`` / ``launch_callbacks[]``): .. code-block:: text State Callback ───────────────────────────────────────────────────────────────────── PRTE_JOB_STATE_INIT prte_plm_base_setup_job PRTE_JOB_STATE_INIT_COMPLETE init_complete (dvm-local) PRTE_JOB_STATE_ALLOCATE prte_ras_base_allocate PRTE_JOB_STATE_ALLOCATION_COMPLETE prte_plm_base_allocation_complete PRTE_JOB_STATE_DAEMONS_LAUNCHED prte_plm_base_daemons_launched PRTE_JOB_STATE_DAEMONS_REPORTED prte_plm_base_daemons_reported PRTE_JOB_STATE_VM_READY vm_ready (dvm-local) PRTE_JOB_STATE_MAP prte_rmaps_base_map_job PRTE_JOB_STATE_MAP_COMPLETE prte_plm_base_mapping_complete PRTE_JOB_STATE_SYSTEM_PREP prte_plm_base_complete_setup PRTE_JOB_STATE_LAUNCH_APPS prte_plm_base_launch_apps PRTE_JOB_STATE_SEND_LAUNCH_MSG prte_plm_base_send_launch_msg PRTE_JOB_STATE_STARTED job_started (dvm-local) PRTE_JOB_STATE_LOCAL_LAUNCH_COMPLETE prte_state_base_local_launch_complete PRTE_JOB_STATE_READY_FOR_DEBUG ready_for_debug (dvm-local) PRTE_JOB_STATE_RUNNING prte_plm_base_post_launch PRTE_JOB_STATE_REGISTERED prte_plm_base_registered PRTE_JOB_STATE_TERMINATED check_complete (dvm-local) PRTE_JOB_STATE_NOTIFY_COMPLETED dvm_notify (dvm-local) PRTE_JOB_STATE_NOTIFIED cleanup_job (dvm-local) PRTE_JOB_STATE_ALL_JOBS_COMPLETE prte_quit (plus DAEMONS_TERMINATED → prte_quit and FORCED_EXIT → force_quit, registered separately) Note that ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` is **not** in this table. Each Process Launch Manager (PLM) component—ssh, slurm, pals, lsf—inserts its own ``launch_daemons`` callback for that state during its own ``init``. Step-by-step walk-through ~~~~~~~~~~~~~~~~~~~~~~~~~ **1. INIT → prte_plm_base_setup_job** The job record is validated and initial app-context setup is performed. On success the callback posts ``INIT_COMPLETE``. **2. INIT_COMPLETE → init_complete** The DVM-local ``init_complete`` immediately posts ``ALLOCATE`` so that a potential DVM expansion can go through the allocation step. **3. ALLOCATE → prte_ras_base_allocate** The Resource Allocation Subsystem (RAS) queries the scheduler or hostfile for available nodes and records them in the node pool. On completion it posts ``ALLOCATION_COMPLETE``. **4. ALLOCATION_COMPLETE → prte_plm_base_allocation_complete** Decision point (``src/mca/plm/base/plm_base_launch_support.c``:186): * If ``PRTE_JOB_DO_NOT_LAUNCH`` is set (e.g., ``--map-by :display``), skip daemon spawning entirely and jump straight to ``DAEMONS_REPORTED``. * Otherwise, post ``LAUNCH_DAEMONS``. **5. LAUNCH_DAEMONS → ** This state is handled by the active PLM component, not by the DVM module. The ssh PLM's handler (``src/mca/plm/ssh/plm_ssh_module.c``:1077) is representative: a. Calls ``prte_plm_base_setup_virtual_machine()`` to compute which nodes need new daemons (nodes already hosting a daemon from a prior job are reused). b. If no new daemons are needed (``map->num_new_daemons == 0``), fast-paths to ``DAEMONS_REPORTED``. c. Otherwise, builds the ``prted`` command line and spawns one daemon per node via ssh (or pdsh, or the equivalent for slurm/pals/lsf). d. Registers ``prte_plm_base_daemon_callback`` on ``PRTE_RML_TAG_DAEMON_REPORT`` to hear from daemons as they start. e. Posts ``DAEMONS_LAUNCHED`` to indicate spawning has been initiated. **6. DAEMONS_LAUNCHED → prte_plm_base_daemons_launched** This callback is intentionally a no-op (``src/mca/plm/base/plm_base_launch_support.c``:218). The state machine parks here and waits for daemons to call home asynchronously. **7. Daemons call home (asynchronous)** As each ``prted`` process starts up it: a. Initializes via its ESS (Environment-Specific Services) component. b. Connects to the HNP (Head Node Process) via the RML. c. Sends a report containing its process name, RML contact URI, node name, and hwloc topology to the HNP on ``PRTE_RML_TAG_DAEMON_REPORT``. The HNP receives these reports in ``prte_plm_base_daemon_callback`` (``src/mca/plm/base/plm_base_launch_support.c``:1237). For each arriving daemon it: * Records the daemon's contact URI (stored via ``PMIx_Store_internal`` as ``PMIX_PROC_URI``). * Records the node name and hwloc topology. * Marks the node ``PRTE_NODE_STATE_UP``. * Increments ``jdatorted->num_reported``. * Calls ``progress_daemons()`` (line 1173), which fires ``DAEMONS_REPORTED`` once ``num_reported == num_procs``. **8. DAEMONS_REPORTED → prte_plm_base_daemons_reported** (``src/mca/plm/base/plm_base_launch_support.c``:118) * If using an unmanaged allocation (e.g., a hostfile), sets the default slot count on each node according to ``--set-slots`` (cores, sockets, hwthreads, or a literal number). * Totals up ``jdata->total_slots_alloc``. * Posts ``VM_READY``. At this point every daemon is up and the HNP knows how to reach each of them. **9. VM_READY → vm_ready** (``src/mca/state/dvm/state_dvm.c``:261) If new daemons were actually launched (``PRTE_JOB_LAUNCHED_DAEMONS`` is set) and more than one daemon is running: * Serializes the node map via ``prte_util_nidmap_create()`` into a buffer. * Looks up each daemon's ``PMIX_PROC_URI`` and packs it into the same buffer. * Broadcasts the combined nidmap + wireup buffer to all daemons via ``prte_grpcomm.xcast(PRTE_RML_TAG_WIREUP, &buf)``. After the broadcast: * Sets ``prte_dvm_ready = true``. * If running as a persistent DVM (``prte`` without an immediate job), prints ``"DVM ready\n"`` to stdout or writes a ``'K'`` byte on the parent pipe so the caller knows the DVM is accepting work. * Dispatches any jobs that arrived and were cached while the DVM was starting (``prte_cache``). **The DVM is now fully operational.** For a standalone ``prterun`` invocation the state machine continues immediately into the app-launch phase below. Application Launch (after the DVM is ready) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once the DVM is ready, each new application job that arrives at the HNP (via ``PRTE_PLM_LAUNCH_JOB_CMD``) goes through a fast-path re-entry into the state machine (``plm_base_receive.c``:470). If ``prte_dvm_ready`` is not yet true (initial DVM startup still in progress), the job is stashed in ``prte_cache`` and flushed when ``vm_ready`` fires for the daemon job. Otherwise the job enters the state machine immediately via ``prte_plm.spawn(jdata)``. A DVM can run many application jobs concurrently. Each follows the same state machine independently. **10. MAP → prte_rmaps_base_map_job** (``src/mca/rmaps/base/``) The RMAPS framework assigns each application process to a specific node and slot. The mapping policy (``--map-by slot``, ``--map-by node``, ``--map-by core``, ``--map-by ppr:N:L``, etc.) determines how processes are distributed. Key actions: * Iterates over the node pool for the job's session. * For each app context, calls the selected RMAPS component (e.g., ``rmaps_round_robin``, ``rmaps_ppr``, ``rmaps_rank_file``). * Each component calls ``prte_rmaps_base_claim_slot()`` to assign a process to a node; this creates a ``prte_proc_t`` entry and links it to the node. * Sets ``jdata->num_procs``. * If ``--rank-by`` or ``--bind-to`` were specified, records those policies in the map for use during launch. On completion, fires ``MAP_COMPLETE``. **11. MAP_COMPLETE → prte_plm_base_mapping_complete** (``plm_base_launch_support.c``:276) Posts ``SYSTEM_PREP``. **12. SYSTEM_PREP → prte_plm_base_complete_setup** (``plm_base_launch_support.c``) Performs pre-launch sanity checks and environment preparation: * Validates that there are enough slots for the requested process count. * Constructs the environment for each app context (inheriting the HNP environment, applying ``-x VAR``, ``--env-merge``, and PMIx-standard keys). * Calls ``prte_filem.preposition_files()`` to stage any required input files to the compute nodes. The ``files_ready`` callback fires on completion; on success it activates ``MAP`` — **wait, this is actually activated from** ``vm_ready`` **for the app-job path; see below**. .. note:: ``SYSTEM_PREP``'s callback ``prte_plm_base_complete_setup`` does the environment/slot validation and then fires ``LAUNCH_APPS``. File staging happens earlier, inside ``vm_ready``, before MAP is activated. The call chain is: ``vm_ready`` → ``preposition_files`` → ``files_ready`` → ``MAP`` → ... → ``SYSTEM_PREP`` → ``LAUNCH_APPS``. **13. LAUNCH_APPS → prte_plm_base_launch_apps** (``plm_base_launch_support.c``) Prepares the per-daemon launch data and posts ``SEND_LAUNCH_MSG``. **14. SEND_LAUNCH_MSG → prte_plm_base_send_launch_msg** (``plm_base_launch_support.c``) Builds and sends an ODLS (On-node Daemon Launch Subsystem) launch message to each daemon that has local processes for this job. The message contains: * The job's namespace and process list. * Per-process slot list (cpuset, binding directives). * Application argv and environment. * IOF (I/O Forwarding) channel setup — which file descriptors to forward for each process. * Any PMIx server info that the processes will need at init time. Each daemon receives the message via ``PRTE_RML_TAG_LAUNCH_APPS`` and passes it to its ODLS component. The ODLS ``launch_local_procs()`` entry point iterates over the local process list and ``fork``/``exec``'s each one. After the exec, the child process calls ``PMIx_Init`` which connects it to the daemon's embedded PMIx server. **15. STARTED → job_started** Fires once the first process has been forked on any daemon (triggered by ``PRTE_PLM_LOCAL_LAUNCH_COMP_CMD`` receipt at the HNP—see step 16). Notifies the originating tool via a PMIx ``PMIX_EVENT_JOB_START`` event. **16. LOCAL_LAUNCH_COMPLETE** Each daemon sends ``PRTE_PLM_LOCAL_LAUNCH_COMP_CMD`` back to the HNP when all of its local processes have attempted to start, carrying each process's PID and state. The HNP handler (``plm_base_receive.c``:715) accumulates ``jdata->num_launched``; when the first process is counted it posts ``STARTED``; when all processes are counted it posts ``RUNNING``. **17. READY_FOR_DEBUG → ready_for_debug** Optional. If the job was submitted with ``--stop-on-exec``, ``--stop-in-init``, or ``--stop-in-app``, each daemon waits until all its local processes signal readiness and then sends ``PRTE_PLM_READY_FOR_DEBUG_CMD`` to the HNP. When the HNP has heard from all daemons it fires a ``PMIX_READY_FOR_DEBUG`` PMIx event to the originating tool. **18. RUNNING → prte_plm_base_post_launch** All processes across the entire job are running. Post-launch cleanup: timeout timers, progress callbacks, and similar housekeeping. **19. REGISTERED → prte_plm_base_registered** All application processes have called ``PMIx_Init`` and registered with their local PMIx server. Each daemon accumulates its local count and sends ``PRTE_PLM_REGISTERED_CMD`` to the HNP when all of its local processes have registered. The HNP handler (``plm_base_receive.c``:675) increments ``jdata->num_reported``; when the count reaches ``jdata->num_procs`` it fires this state. Process State Machine --------------------- The process state machine tracks individual application processes. It runs on both the HNP (via the DVM module) and each daemon (via the prted module), with the same set of states and a single callback ``prte_state_base_track_procs`` / ``track_procs``. .. list-table:: :widths: 40 10 50 :header-rows: 1 * - Name - Value - Meaning * - ``PRTE_PROC_STATE_INIT`` - 1 - Process entry created by RMAPS. * - ``PRTE_PROC_STATE_RUNNING`` - 4 - Daemon has forked the process. * - ``PRTE_PROC_STATE_REGISTERED`` - 5 - Process called ``PMIx_Init``. * - ``PRTE_PROC_STATE_IOF_COMPLETE`` - 6 - All I/O forwarding pipes have closed. * - ``PRTE_PROC_STATE_WAITPID_FIRED`` - 7 - ``waitpid`` detected the process has exited. * - ``PRTE_PROC_STATE_READY_FOR_DEBUG`` - 9 - Process is stopped and awaiting a debugger. * - ``PRTE_PROC_STATE_TERMINATED`` - 20 - Process is fully cleaned up. A process is considered still running if its state is less than ``PRTE_PROC_STATE_UNTERMINATED`` (15). States ≥ ``PRTE_PROC_STATE_ERROR`` (50) indicate abnormal exit. On the daemon side (``src/mca/state/prted/state_prted.c``:314, ``track_procs``): * ``RUNNING``: increments ``jdata->num_launched``; when all local procs are running, fires ``PRTE_JOB_STATE_LOCAL_LAUNCH_COMPLETE`` which sends ``PRTE_PLM_LOCAL_LAUNCH_COMP_CMD`` to the HNP. * ``REGISTERED``: increments ``jdata->num_reported``; when all local procs have registered, sends ``PRTE_PLM_REGISTERED_CMD`` to the HNP. * ``IOF_COMPLETE`` / ``WAITPID_FIRED``: when both flags are set for a process, marks it ``TERMINATED`` and triggers job-completion accounting. Termination and Error States ---------------------------- **Boundary markers** (job states): * ``PRTE_JOB_STATE_UNTERMINATED`` (30): any state below this means the job is still running. * ``PRTE_JOB_STATE_ERROR`` (50): any state at or above this is an error. **Normal termination sequence**: ``TERMINATED`` → ``NOTIFY_COMPLETED`` → ``NOTIFIED`` → ``ALL_JOBS_COMPLETE`` → ``prte_quit`` **Selected error states**: .. list-table:: :widths: 50 10 :header-rows: 1 * - Name - Value * - ``PRTE_JOB_STATE_KILLED_BY_CMD`` - 51 * - ``PRTE_JOB_STATE_ABORTED`` - 52 * - ``PRTE_JOB_STATE_FAILED_TO_START`` - 53 * - ``PRTE_JOB_STATE_NEVER_LAUNCHED`` - 60 * - ``PRTE_JOB_STATE_ALLOC_FAILED`` - 68 * - ``PRTE_JOB_STATE_MAP_FAILED`` - 69 * - ``PRTE_JOB_STATE_CANNOT_LAUNCH`` - 70 * - ``PRTE_JOB_STATE_FORCED_EXIT`` - 64 All error states ultimately route to ``force_quit`` or ``prte_quit`` which calls ``prte_plm.terminate_orteds()`` before exiting. Key Source Files ---------------- .. list-table:: :widths: 45 55 :header-rows: 1 * - File - Role * - ``src/mca/plm/plm_types.h`` - All state constant definitions. * - ``src/mca/state/dvm/state_dvm.c`` - DVM job and proc state tables; ``vm_ready``, ``init_complete``, ``check_complete``, ``dvm_notify``, ``cleanup_job``. * - ``src/mca/state/prted/state_prted.c`` - Per-daemon job and proc state tables; ``track_procs``, ``track_jobs``. * - ``src/mca/state/base/state_base_fns.c`` - ``prte_state_base_activate_job_state`` — the core dispatch function. * - ``src/mca/plm/base/plm_base_launch_support.c`` - Most PLM base callbacks: ``prte_plm_base_setup_job``, ``prte_plm_base_allocation_complete``, ``prte_plm_base_daemons_launched``, ``prte_plm_base_daemons_reported``, ``progress_daemons``, ``prte_plm_base_daemon_callback``. * - ``src/mca/plm/base/plm_base_receive.c`` - HNP message handler: processes ``PRTE_PLM_LOCAL_LAUNCH_COMP_CMD`` and ``PRTE_PLM_REGISTERED_CMD`` from daemons. * - ``src/mca/plm/ssh/plm_ssh_module.c`` - SSH PLM ``launch_daemons`` callback (line 1077). * - ``src/mca/plm/slurm/plm_slurm_module.c`` - SLURM PLM ``launch_daemons`` callback. * - ``src/mca/plm/pals/plm_pals_module.c`` - PALS PLM ``launch_daemons`` callback. * - ``src/mca/plm/lsf/plm_lsf_module.c`` - LSF PLM ``launch_daemons`` callback. * - ``src/mca/ras/base/ras_base_allocate.c`` - ``prte_ras_base_add_hosts()`` (thin async wrapper, line 771); ``prte_ras_base_complete_request()`` (grow/shrink completion, line 586); ``prte_ras_base_modify()`` (routes requests to RAS modules, line 529). * - ``src/mca/ras/hosts/ras_hosts.c`` - ``ras/hosts`` module ``modify()`` entry point: parses hostfiles and host lists and inserts nodes into the pool (line 340). * - ``src/mca/ras/slurm/ras_slurm_modify_extend.c`` - Slurm ``modify()`` entry for ``PMIX_ALLOC_EXTEND``; fires ``LAUNCH_DAEMONS`` directly on the daemon job (line 752) instead of routing through ``prte_ras_base_complete_request()`` — see the launch-fence warning under *DVM Extension and the Daemon-Launch Race*. * - ``src/prted/prted_comm.c`` - ``PRTE_DAEMON_SHRINK_CMD`` handler (line 469): checks daemon rank list and exits cleanly if listed. Debugging --------- Verbose output for each subsystem is controlled at runtime: .. code-block:: sh # Job state machine transitions prte --prtemca state_base_verbose 5 ... # PLM (daemon launch, message receive) prte --prtemca plm_base_verbose 5 ... # Process mapping prte --prtemca rmaps_base_verbose 5 ... # Resource allocation prte --prtemca ras_base_verbose 5 ... At verbosity level 5 the state machine also prints its full table at startup via ``prte_state_base_print_job_state_machine()``. DVM Extension and the Daemon-Launch Race ----------------------------------------- Background ~~~~~~~~~~ A persistent DVM can have its node pool expanded at runtime in two ways: 1. **App-triggered** (``src/mca/ras/base/ras_base_allocate.c``:771): A job submitted with ``--add-host`` or ``--add-hostfile`` causes the RAS base ``add_hosts()`` function — now a thin asynchronous wrapper — to collect the directives into a ``prte_pmix_server_req_t`` with ``req->key = "hosts"`` and ``req->allocdir = PMIX_ALLOC_EXTEND``. It sets ``prte_dvm_ready = false`` to block concurrent job dispatch, then posts the request to the event loop for ``prte_ras_base_modify()`` to handle. ``prte_ras_base_modify()`` routes the request to the ``ras/hosts`` module, whose ``modify()`` entry point (``src/mca/ras/hosts/ras_hosts.c``:340) parses the hostfiles and host lists and inserts new nodes into ``prte_node_pool``. On success the common completion function ``prte_ras_base_complete_request()`` (line 586) marks ``PRTE_JOB_EXTEND_DVM`` on the **daemon job** and fires ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` on the daemon job. Any application jobs that arrive while ``prte_dvm_ready`` is false are stashed in ``prte_cache`` and flushed when ``vm_ready()`` fires. 2. **Scheduler push** (``src/mca/ras/slurm/ras_slurm_modify_extend.c``:752): When Slurm grants additional nodes (e.g., in response to a ``PMIx_Allocate`` call from an application), the Slurm RAS component adds the nodes to the pool and fires ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` **directly on the daemon job**, setting ``PRTE_JOB_EXTEND_DVM`` on the daemon job — bypassing ``prte_ras_base_complete_request()`` and leaving ``prte_dvm_ready`` unchanged. In both cases ``setup_virtual_machine()`` is called (from within the PLM's ``launch_daemons`` callback) and detects the extension via the ``PRTE_JOB_EXTEND_DVM`` attribute on the daemon job. If new daemons are needed it sets ``PRTE_JOB_LAUNCHED_DAEMONS`` on the daemon job and returns with ``map->num_new_daemons > 0``. The PLM then spawns ``prted`` processes on the new nodes and the state machine parks at ``DAEMONS_LAUNCHED`` until they call home. .. warning:: A RAS component that handles a modification request (grow or shrink) must route its result through ``prte_ras_base_complete_request()`` rather than activating ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` directly on the daemon job. ``prte_ras_base_complete_request()`` is the single point that performs the bookkeeping the launch fence depends on: it sets ``PRTE_JOB_EXTEND_DVM`` and resets ``prte_nidmap_communicated`` on the grow path, and on the shrink path it records the ``prte_shrink_campaign_t`` and raises ``prte_dvm_launch_fence`` *before* any daemon is asked to leave. A component that fires ``PRTE_JOB_STATE_LAUNCH_DAEMONS`` itself — as the Slurm scheduler-push path historically does — skips this common handling and can leave the fence out of step with the campaign it is supposed to gate, reopening the daemon-launch race described below. New RAS modules, and any reworking of the existing ones, should hand their results to ``prte_ras_base_complete_request()`` and let it activate the state. DVM Shrink ~~~~~~~~~~ A DVM can also be **shrunk** at runtime by releasing nodes back to the scheduler. The path runs through the same ``prte_ras_base_complete_request()`` function, but with ``req->allocdir == PMIX_ALLOC_RELEASE``: 1. The ``PMIX_ALLOC_RELEASE`` branch extracts the node list from ``PMIX_ALLOC_NODE_LIST``, looks up each node's daemon rank in ``prte_node_pool``, and packs the ranks into a ``PRTE_DAEMON_SHRINK_CMD`` message. 2. The message is broadcast to all daemons via ``prte_grpcomm.xcast(PRTE_RML_TAG_DAEMON)``. 3. Each daemon that receives ``PRTE_DAEMON_SHRINK_CMD`` (``src/prted/prted_comm.c``:469) checks whether its own rank appears in the unpacked list. If listed, it: a. Sets ``prte_abnormal_term_ordered = true``. b. Fires a ``PMIX_EVENT_JOB_END`` PMIx event to notify any attached tools. c. Activates ``PRTE_JOB_STATE_DAEMONS_TERMINATED`` and exits cleanly. The HNP needs no acknowledgement from the daemon: it learns that the daemon is gone through the normal daemon-loss (comm-failure) path, which is also the only event that guarantees the daemon's routes and node state have actually been torn down (see below). Unlisted daemons silently discard the command and continue running. In addition, each RAS module may implement a ``release_allocation`` entry point (added in ``src/mca/ras/ras.h``). The base function ``prte_ras_base_release_allocation()`` cycles active modules in priority order (filtering by ``session->alloc_module`` when set) and is called automatically from the ``prte_session_t`` destructor so that allocations are released when their session object is destructed. Shrink Synchronisation Requirement ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``PRTE_DAEMON_SHRINK_CMD`` xcast is fire-and-forget: targeted daemons exit on their own schedule, and the HNP must determine when all of them have actually terminated. This creates two race windows that must be closed. **Race 1 — new job mapping onto a shrinking node** A job that reaches the ``VM_READY → MAP`` boundary while a shrink is in progress may have its processes mapped onto a node whose daemon has already received ``PRTE_DAEMON_SHRINK_CMD``. By the time the launch message is sent the daemon may already have exited. **Race 2 — in-flight job at** ``LAUNCH_APPS`` A job that was fully mapped *before* a shrink started and then reaches ``LAUNCH_APPS`` (where launch data is packed and sent to each daemon) may send to a daemon that dies in the window between MAP and the actual send. Closing both races requires: 1. **Completion on actual daemon death** — the HNP records the targeted daemon ranks in a ``prte_shrink_campaign_t`` and waits for each one to leave the DVM. Departure is detected through the existing daemon-loss (comm-failure) path in the ``errmgr/dvm`` component, which matches the dead daemon's rank against the campaign's target list, drives the fence counter down, and releases the fence once every target is gone. The HNP does not rely on any acknowledgement from the daemon: the reason a targeted daemon dies is irrelevant, and the comm-failure event is the only signal that also guarantees the daemon's routes, ``num_daemons`` count, and node state have been cleaned up. Each target slot is stamped ``PMIX_RANK_INVALID`` once counted so a repeated comm event cannot decrement the campaign twice. 2. **Second hold point at** ``LAUNCH_APPS`` — ``prte_plm_base_launch_apps()`` checks a dedicated ``prte_shrink_ntargets`` counter (nonzero only when a shrink is in progress) and if nonzero parks the job in a second held-job array (``prte_prelaunch_held_jobs``) rather than packing or sending any launch data. This hold uses ``prte_shrink_ntargets`` rather than the general ``prte_dvm_launch_fence`` so that a concurrent DVM grow does not unnecessarily stall jobs that have already been mapped to existing nodes. 3. **Remap on release** — when ``prte_dvm_launch_fence`` returns to zero, jobs in ``prte_prelaunch_held_jobs`` that were mapped to any of the now-dead daemon nodes are reset to ``MAP`` state so they are remapped to the surviving nodes; jobs whose entire mapping lies on surviving nodes are re-activated at ``LAUNCH_APPS`` without remapping. The full implementation plan is in :ref:`dvm-shrink-campaign-label`; the shared fence mechanism it builds on is in :ref:`elastic-dvm-plan-label`. The Race Condition ~~~~~~~~~~~~~~~~~~ The app-triggered path partially mitigates the race by setting ``prte_dvm_ready = false`` in ``add_hosts()`` before the asynchronous request is posted: any job that arrives after that point is stashed in ``prte_cache`` and is not dispatched until ``vm_ready()`` restores ``prte_dvm_ready = true``. The scheduler-push path does **not** clear ``prte_dvm_ready``. Because ``prte_dvm_ready`` otherwise remains ``true`` throughout DVM operation (it is only cleared at shutdown), any application job that arrives while a scheduler-initiated daemon launch is in flight is dispatched immediately: .. code-block:: text Thread of events (time →) Slurm grants new nodes ras_slurm_modify_extend fires LAUNCH_DAEMONS on daemon job PLM starts spawning prted on new nodes ← daemon launch in progress App job B arrives, prte_dvm_ready==true, B is dispatched B: INIT → ALLOCATE → VM_READY B: MAP ← assigns procs to new nodes ← daemons NOT UP YET B: SEND_LAUNCH_MSG → daemons fail to receive it The same race exists when multiple apps are running concurrently inside the DVM and one of them triggers an allocation expansion: the other apps' independent state machine progressions can interleave with the daemon launch events. Required Change: Gate at the VM_READY → MAP Boundary ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To eliminate the race, all application jobs must be held at the ``VM_READY → MAP`` boundary whenever any daemon launch campaign is in progress, regardless of which path (app-triggered or scheduler push) initiated it. Jobs that are already past ``MAP`` (i.e., already launching or running) are unaffected — their daemons are already up. The mechanism is a **global launch fence** — a counter (``prte_dvm_launch_fence``) that tracks the number of in-progress daemon launch campaigns. An app job that reaches the ``VM_READY → MAP`` transition checks the fence; if it is nonzero the job parks itself in a held-job array (``prte_held_jobs``) and is released when the fence reaches zero. The step-by-step implementation plan is in :ref:`elastic-dvm-plan-label`, with the grow- and shrink-specific details in :ref:`dvm-grow-campaign-label` and :ref:`dvm-shrink-campaign-label`.