8.2.1. Node Reservation and Session Targeting: Specification
8.2.1.1. Purpose
This document specifies the externally observable behavior of node
reservation and session targeting in PRRTE: the contract a tool,
scheduler, or application may rely on when it asks the DVM to set aside
nodes for later use and to direct spawned jobs onto those nodes. It
defines what the runtime does, not how it does it. The companion
design plan, node-reservation.rst, describes the internal data
structures, code paths, and implementation order; where this
specification and that plan disagree about observable behavior, this
specification is authoritative and the plan must be corrected.
The contract is expressed entirely in terms of PMIx attributes supplied on allocation and spawn requests and the PMIx status codes returned in response. No new command-line options, environment variables, or tool behaviors are introduced.
8.2.1.2. Scope
8.2.1.2.1. In scope
The meaning of the
PMIX_ALLOC_TARGET,PMIX_ALLOC_SHARE,PMIX_ALLOC_INHERITANCE,PMIX_ALLOC_WARN_TIMEOUT, andPMIX_SPAWN_TARGETattributes as interpreted by PRRTE.The rules that decide which pool of nodes a dynamically allocated set of nodes joins.
The rules that decide which nodes a spawned job may be mapped onto.
The ownership model that governs who may target a reservation.
The events that end a reservation — including the per-reservation inheritance disposition — and the observable effect of each.
The relay of the scheduler’s
PMIX_ALLOC_TIMEOUT_WARNINGevent to the requester ahead of an allocation’s expiration.The behavior guaranteed when the underlying PMIx does not define one or more of the attributes above.
8.2.1.2.2. Out of scope / non-goals
PRRTE invents no reservation policy of its own. It does not decide on its own to reserve, share, time-out, or reclaim nodes; every such decision is driven by an attribute the requester supplied or by the scheduler returning nodes. The sole defaults are those stated below.
No guarantee is made about the physical nodes chosen to satisfy a dynamic allocation — node selection remains the scheduler’s responsibility. This specification governs only which session the resulting nodes are placed into.
Quotas, fair-share, priority, and accounting across reservations are not addressed.
The wire encoding of attributes and the timing of internal state transitions are implementation details and are not specified here.
8.2.1.3. Definitions
- Allocation
A set of nodes returned by a single resource request, identified by its
PMIX_ALLOC_IDstring.- Default session
The general, unreserved pool of nodes. Every node discovered when the DVM starts belongs to the default session, and any job that does not target a reservation may map onto it. The default session is referred to in a request by the invalid namespace sentinel (an empty allocation id /
PMIX_NSPACE_INVALID).- Reservation
A non-default session whose nodes are withheld from the default pool and made available only to the namespaces that own it or to jobs that explicitly target it by allocation id.
- Owner set
The set of namespaces entitled to spawn jobs onto a reservation’s nodes. Reservations have a set of owners, not a single owner (see Ownership model). The scheduler is an implicit owner of every session.
- Namespace / job
Every job is its own namespace and every namespace maps to exactly one job; the two terms are interchangeable in this specification. “A namespace owns a reservation” therefore means “this one job, and any process within it, may spawn onto that reservation.”
8.2.1.4. Attribute contract
PMIX_ALLOC_TARGETSupplied on an allocation request to name the namespace a new reservation is created for (or whose existing reservation is extended). Honored only when the requester is a tool; an application that supplies it is rejected (see Error semantics). When absent, the reservation is created for the requester’s own namespace.
PMIX_ALLOC_SHAREA boolean that governs the allocation’s initial reservation state — its disposition while the owning namespace lives. When absent or
false(the default), the nodes are reserved for the owning namespace. Whentrue, the nodes are unreserved from the outset: they join the default session and are usable by any member of the session.PMIX_ALLOC_SHAREis orthogonal toPMIX_ALLOC_INHERITANCE: the former describes the allocation while its owning namespace lives, the latter what happens when that namespace dies. (This attribute replaces the formerPMIX_ALLOC_RESERVEDwith inverted sense.)PMIX_ALLOC_INHERITANCESupplied on an allocation request to select what becomes of the allocation when its owning namespace terminates. Its value is a
pmix_alloc_inheritance_ttaking one of four dispositions —PMIX_ALLOC_INHERIT_NONE,PMIX_ALLOC_INHERIT_CHILD,PMIX_ALLOC_INHERIT_DEFAULT, orPMIX_ALLOC_INHERIT_CHILD_DEFAULT— whose meanings are defined under Lifecycle. It is honored onPMIX_ALLOC_NEWandPMIX_ALLOC_EXTENDrequests and ignored onPMIX_ALLOC_RELEASE,PMIX_ALLOC_REAQUIRE, andPMIX_ALLOC_REQ_CANCEL. Because the disposition governs behavior at the owning namespace’s death rather than the reservation’s live state, it is meaningful even for a shared (unreserved) allocation — there it still decides whether the nodes return to the scheduler (NONE/CHILD) or remain in the session. When absent, the host behaves as ifPMIX_ALLOC_INHERIT_DEFAULThad been specified (see Lifecycle); the legacy “release to the scheduler on termination” behavior must be requested explicitly withPMIX_ALLOC_INHERIT_NONE. A non-default value that the host does not support is rejected (see Error semantics) rather than silently ignored, so the requester is never misled about its allocation’s lifetime.PMIX_ALLOC_WARN_TIMEOUTA
uint32_tsupplied on aPMIX_ALLOC_NEWorPMIX_ALLOC_EXTENDrequest, giving the number of seconds before the allocation’s expiration at which the requester wishes to be warned. It is advisory: a scheduler that does not implement it simply ignores it and no warning is delivered. When honored, the scheduler delivers aPMIX_ALLOC_TIMEOUT_WARNINGevent to the requester as the deadline nears; see Allocation timeout warning.PMIX_SPAWN_TARGETSupplied on a spawn request to select the allocation(s) whose nodes the job may map onto. Its value is either a single allocation-id string or an array of allocation-id strings. Each entry names a reservation by its
PMIX_ALLOC_ID; an entry equal to the invalid namespace denotes the default session. When absent, the job uses the default session (or its parent’s session, exactly as today).
All five attributes are optional. A request that supplies none of them behaves exactly as a pre-reservation PRRTE would (see Backward compatibility).
8.2.1.5. Allocation behavior
When the scheduler returns nodes for a dynamic allocation request, the nodes are always added to the DVM so daemons can be launched on them. Which session they become available to is decided as follows, by requester class and the attributes present.
There are three requester/allocation cases:
Startup allocation. The nodes discovered when the DVM starts form the default session. No request drives this; it is the baseline.
Tool-requested dynamic allocation. A tool (a PMIx client with no job of its own — e.g. a scheduler or a
prun-class process) issues the request.Application-requested dynamic allocation. A process belonging to a running job issues the request.
The routing rule is summarized in the decision table below. “Session
owned by <nspace>” means a reservation whose owner set is seeded with
<nspace>. PMIX_ALLOC_SHARE governs only the destination
(reserved versus shared); PMIX_ALLOC_TARGET independently designates
the owning namespace for inheritance and accounting, which still
applies even when the nodes are shared.
Requester |
|
|
Destination |
|---|---|---|---|
Tool |
|
absent / |
reservation owned by |
Tool |
absent |
absent / |
reservation owned by the tool’s own namespace |
Tool |
absent |
|
default session (shared) |
Application |
absent |
absent / |
reservation owned by the application’s namespace |
Application |
absent |
|
default session (shared) |
Application |
present |
any |
rejected — |
8.2.1.5.1. New versus extend
A request carrying the PMIX_ALLOC_NEW directive always mints a fresh
reservation, even for a namespace that already owns one; the namespace
may thus hold several reservations, each distinguished by its allocation
id. A request carrying PMIX_ALLOC_EXTEND adds nodes to the existing
reservation named by the request’s PMIX_ALLOC_ID (the
scheduler-assigned id) or PMIX_ALLOC_REQ_ID (the requester’s own
request id); it must carry at least one, and succeeds only when the
requester’s namespace owns the named reservation.
8.2.1.5.2. Reporting the result
A successful allocation response returns the destination’s
PMIX_ALLOC_ID so the requester can later target the reservation when
spawning or extend or release it. When the request supplied a
PMIX_ALLOC_REQ_ID, that id is echoed back in the response.
8.2.1.6. Spawn targeting behavior
A spawn request selects its candidate nodes through PMIX_SPAWN_TARGET:
Absent, or naming only the invalid namespace — the job maps onto the default session. This is the unchanged, pre-reservation behavior.
One or more valid allocation ids — the job may map onto the union of those reservations’ nodes.
Valid ids plus the invalid namespace — the union additionally includes the default pool. This is the supported way to request “my reservation, and the general pool too.”
Every targeted allocation is subject to the ownership check below. A job’s candidate node pool is exactly the union of the sessions it is permitted to target; it can see no other nodes.
A consequence for existing jobs: once any node is reserved, a job that targets the default session no longer sees reserved nodes. This isolation is intended. A job that targets no reservation behaves identically to today on an allocation where nothing has been reserved.
8.2.1.7. Ownership model
A reservation is owned by a set of namespaces:
The set is seeded with the owning namespace — the namespace the reservation was created for (the
PMIX_ALLOC_TARGETnamespace for a tool request, otherwise the requester’s own namespace). A reservation has exactly one owning namespace for its lifetime; it is the namespace whose termination drives inheritance (see Lifecycle).Each time a job is successfully spawned into a reservation, that job’s namespace is added to the owner set as a child namespace, and may then spawn further jobs onto the same nodes. This owner set is the authorization layer — who may map onto and target the reservation.
Distinct from authorization, a derived child is, for lifetime purposes, any namespace reachable by following the spawn relationship from the owning namespace to arbitrary depth — a descendant at any level, not just an immediate child. Children launched into the reservation via
PMIX_SPAWN_TARGETcount as derived children. The set of derived children is what theCHILD-flavored inheritance rules track (see Lifecycle); it is distinct from the single owning namespace that seeded the reservation.Ownership is not inherited transitively. A child owns only the reservation it was spawned into — never the other reservations its parent happens to own.
The scheduler is an implicit owner of every session and may target or release any of them.
A spawn may target a session only if, for every entry in
PMIX_SPAWN_TARGET, at least one of the following holds: the entry is
the default session; the requester’s namespace is in the session’s owner
set; or the requester is the scheduler. If any entry fails this check
the entire spawn is rejected (see Error semantics) — the request is
never silently narrowed to the permitted subset, so the caller always
learns its request was not honored in full.
8.2.1.8. Error semantics
Condition |
Returned status |
|---|---|
Application request carries |
|
|
|
|
|
|
|
|
|
|
|
|
|
In every rejection case no partial effect is observable: a rejected
allocation reserves nothing, and a rejected spawn launches no processes.
A non-default PMIX_ALLOC_INHERITANCE value is rejected rather than
silently ignored so the requester is never misled about its allocation’s
lifetime.
8.2.1.9. Lifecycle
A reservation’s lifetime is governed by two independent mechanisms:
unconditional triggers that end it regardless of any other state, and
a namespace-termination disposition selected at creation by
PMIX_ALLOC_INHERITANCE that decides what its termination does to the
reservation.
8.2.1.9.1. Unconditional triggers
Independently of the inheritance disposition, a reservation ends when any of these occurs:
An owner releases it by issuing
PMIX_ALLOC_RELEASEfor the reservation’s allocation id. Any namespace in the owner set may release it; the scheduler may release any reservation.The DVM tears down. All sessions, default and reserved alike, end with the DVM.
The scheduler terminates the allocation — for example on a timeout or an administrative reclaim — which arrives as a scheduler-driven release for the allocation id.
8.2.1.9.2. Namespace-termination disposition
PMIX_ALLOC_INHERITANCE selects, at allocation time, what becomes of
the reservation as its owning namespace and derived children
(see Ownership model) terminate. It answers two coupled, orthogonal
questions — when the disposition fires (lifetime: at the owning
namespace’s termination, or only after the last derived child has also
terminated) and what it then does (disposition: return the nodes to
the scheduler, or merely make them unreserved within the session):
PMIX_ALLOC_INHERIT_NONEWhen the owning namespace terminates, the reservation is released and its nodes are returned to the scheduler, even if derived children are still running on them. This is the legacy behavior.
PMIX_ALLOC_INHERIT_CHILDThe reservation outlives its owning namespace and is released — nodes returned to the scheduler — only when all derived children have terminated.
PMIX_ALLOC_INHERIT_DEFAULTWhen the owning namespace terminates, the nodes are not returned to the scheduler; they become unreserved and remain in the session.
PMIX_ALLOC_INHERIT_CHILD_DEFAULTThe nodes become unreserved within the session when the last derived child terminates, rather than being returned to the scheduler.
The two axes combine as:
Returned to scheduler |
Unreserved in session |
|
|---|---|---|
At owning-namespace termination |
|
|
At last-derived-child termination |
|
|
Returned to the scheduler runs the same path as an explicit
PMIX_ALLOC_RELEASE: the nodes leave the reservation and go back to the
scheduler’s general pool (the DVM shrinks for nodes the scheduler owns;
nodes that were carved from the DVM’s own startup pool revert to the
default session). Unreserved in session never returns nodes to the
scheduler on this trigger; the named allocation goes away but every node
stays in the DVM, clears its reservation, and joins the default session
for general use. Such nodes return to the scheduler only when the
session (DVM) itself ends or they are explicitly released — inheritance
alone never hands them back.
8.2.1.9.3. Default disposition
When PMIX_ALLOC_INHERITANCE is absent, the host behaves as if
PMIX_ALLOC_INHERIT_DEFAULT had been specified: on termination of the
owning namespace the reservation’s nodes become unreserved and remain in
the session rather than being returned to the scheduler. The legacy
“release to the scheduler on termination” behavior is not the
default; a requester that wants it must pass PMIX_ALLOC_INHERIT_NONE
explicitly.
8.2.1.9.4. Ending a reservation while jobs are still mapped
Any trigger — an unconditional one or a namespace-termination
disposition — may fire while jobs are still mapped onto the
reservation’s nodes (for example, PMIX_ALLOC_INHERIT_NONE releasing
the reservation while child jobs run on it). The observable outcome
depends only on whether the nodes leave the DVM:
Nodes stay in the DVM (an owner release whose nodes are not returned to the scheduler, or a
*_DEFAULTdisposition becoming unreserved). No process is killed. The nodes revert to the default pool, already-running jobs continue undisturbed, and only the mapping of new jobs stops seeing the reservation.Nodes leave the DVM (a scheduler reclaim or timeout, a
NONEorCHILDdisposition returning nodes to the scheduler, or any release of scheduler-owned nodes). This is the existing DVM-shrink behavior: jobs and daemons on the departing nodes are terminated before the nodes leave.
Reservation teardown never itself kills a job; process termination occurs only when nodes physically depart, under the pre-existing shrink behavior.
8.2.1.9.5. Allocation timeout warning
Scheduler reclaim (unconditional trigger 3) is frequently driven by a bounded allocation lifetime: when the granted period expires the scheduler reclaims the nodes, terminating any jobs still running on them. To let an owner checkpoint, drain work, or request an extension before that happens, a requester may ask for advance notice when it creates or extends the allocation. PRRTE acts purely as the relay between the requester and the scheduler here — it neither sets the timeout nor generates the warning:
On a
PMIX_ALLOC_NEWorPMIX_ALLOC_EXTENDrequest the requester includesPMIX_ALLOC_WARN_TIMEOUT(uint32_tseconds before expiration). PRRTE forwards it to the scheduler. Being advisory, a scheduler that does not implement it ignores it and no warning follows.When the threshold is reached the scheduler issues a
PMIX_ALLOC_TIMEOUT_WARNINGevent. PRRTE delivers it only to the process that requested the allocation — it is a directed event, not a broadcast — carrying:PMIX_ALLOC_ID(char*) — the affected allocation; always present.PMIX_ALLOC_REQ_ID(char*) — the requester’s own request id, included whenever one was supplied on the original request, so the recipient can match the warning by either identifier.PMIX_TIME_REMAINING(uint32_t) — seconds left before expiration, in the same units asPMIX_ALLOC_WARN_TIMEOUT.
The recipient reacts within its event handler — typically by issuing a
PMIX_ALLOC_EXTEND(optionally renewingPMIX_ALLOC_WARN_TIMEOUT) to lengthen the allocation, or by beginning an orderly drain or shutdown — and returns the appropriate event status. If the allocation is not extended, it expires and the scheduler reclaims its nodes through the existing shrink path described above, terminating the jobs still mapped onto them.
The warning is strictly advance notice; it changes nothing about the reservation by itself. Only the subsequent expiration (or an explicit extend/release) alters the allocation, and that flows through the unconditional triggers and dispositions already specified.
8.2.1.10. Backward compatibility
Every attribute in this specification is optional, and PRRTE must build and run against a PMIx that defines none, some, or all of them.
When
PMIX_ALLOC_TARGETis unavailable, a dynamic allocation is always reserved to the requester’s own namespace.When
PMIX_ALLOC_SHAREis unavailable, the “reserve” default applies and nodes are never routed to the default pool by a dynamic request.When
PMIX_ALLOC_INHERITANCEis unavailable, every reservation takes the default disposition (PMIX_ALLOC_INHERIT_DEFAULT): on termination of its owning namespace the nodes become unreserved within the session. The non-default dispositions — in particular the legacyPMIX_ALLOC_INHERIT_NONE(release to the scheduler) — cannot be requested, and a request carrying a non-default value is rejected (see Error semantics).When
PMIX_ALLOC_WARN_TIMEOUTis unavailable, no advance timeout warning is requested; an allocation still expires and is reclaimed by the scheduler as before, only without the advisoryPMIX_ALLOC_TIMEOUT_WARNINGevent beforehand.When
PMIX_SPAWN_TARGETis unavailable, spawns use the default or parent session exactly as they do today.
A DVM whose PMIx lacks all five attributes still partitions resources sensibly — dynamic allocations reserve to the requesting namespace, reservations become unreserved within the session when their owning namespace exits, allocations expire and are reclaimed without an advance warning, and spawns use the default/parent session — and exhibits no behavior change for any request beyond that prescribed default disposition.
8.2.1.11. Conformance summary
A conforming implementation guarantees that:
A job’s candidate nodes are exactly the union of the sessions it is permitted to target, and never include a reserved node it does not own.
A reservation is created, extended, shared, targeted, or released only in response to a requester-supplied attribute or a scheduler action — never by PRRTE’s own initiative.
Every rejection listed in Error semantics leaves no observable partial effect.
A reservation ends on an unconditional trigger (owner release, DVM teardown, or scheduler termination) or on the namespace-termination disposition recorded at its creation — defaulting to
PMIX_ALLOC_INHERIT_DEFAULT(unreserve in session) when none was supplied — and on no other event.Nodes are returned to the scheduler only under a
NONE/CHILDdisposition, an explicit release, or session (DVM) teardown; a*_DEFAULTdisposition only ever unreserves nodes within the session.A
PMIX_ALLOC_TIMEOUT_WARNINGevent, when the scheduler honors aPMIX_ALLOC_WARN_TIMEOUTrequest, is relayed only to the process that requested the allocation, and is advisory — it alters nothing about the reservation by itself.The defaults and rejections above hold unchanged when the underlying PMIx lacks any of the five attributes.