7.5.2. Reliable Messaging (RELM)
A normal RML send is fire-and-forget. It rides the radix routing tree (The Runtime Messaging Layer (RML)), and if a daemon on the path dies while a message is in flight, that message is simply lost — the tree repairs itself, but the bytes are gone. For most runtime traffic that is acceptable, because the higher-level protocol will retry or the information is reconstructible. Some traffic is not so forgiving. RELM (“reliable messaging”) is the layer that guarantees a message is delivered exactly once to its destination even if daemons on the path fail and the tree is rebuilt underneath it.
RELM lives in src/rml/relm/; its editing map is src/rml/relm/AGENTS.md
and the concrete implementation’s map is src/rml/relm/base/AGENTS.md.
7.5.2.1. How RELM is invoked
Application code sends reliably with prte_rml_send_buffer_reliable_nb
(wrapped by the PRTE_RML_RELIABLE_SEND macro). That call routes through the
prte_relm module to prte_relm_start_msg. Everything else in the
subsystem runs in reaction to two things: RELM control messages received from
neighbors, and fault notices from the routing layer.
RELM is shaped like a small framework — the prte_relm_module_t interface and
a state machine (prte_relm_state_machine_t) full of function pointers — so a
different reliability strategy could be substituted. Today there is exactly one
implementation, the base module, which relm.c installs at open time by
copying prte_relm_base_module into the global prte_relm and wiring every
state-machine callback to a prte_relm_base_* function.
7.5.2.2. Identifying a message
Every reliable message has a stable, globally-unique identity that all daemons on its path agree on:
the origin
src(the rank that calledreliable_send) and a per-source UID together form a globally-unique id;adding the destination
dstyields the message signature (prte_relm_signature_t);the pair
<src, uid>is hashed into a 64-bit GUID (prte_relm_guid_t) used as the hash-table key.
The UID (prte_relm_uid_t) is a 32-bit counter that is allowed to wrap; the
design assumes a message is globally complete and unreferenced long before its
UID could be reused. The top of the range is reserved for sentinels
(UNKNOWN / NONE / INVALID), so valid UIDs run up to
PRTE_RELM_UID_MAX.
State is kept per destination. prte_relm_state_machine_t holds a hash table
of prte_relm_rank_t objects keyed by destination rank; each of those holds a
hash table of prte_relm_msg_t objects keyed by GUID. A prte_relm_msg_t
records the message’s src / uid / dst, its current state, the
message data (while it may still be needed), and prev_uid / next_uid
links that chain together the messages this origin has sent to this destination.
7.5.2.3. Ordering
Messages from one origin to one destination are delivered in order. The
prev_uid / next_uid chain enforces this: a message may only be posted to
the application once its predecessor has been posted, and an acknowledgement of a
message implicitly acknowledges everything before it in the chain. A message
whose predecessor has not yet arrived is parked in the PENDING state and
released to SENDING when the predecessor is posted.
7.5.2.4. The message lifecycle
Each prte_relm_msg_t moves through a small set of lasting states
(types.h). Because the same message object exists on every daemon along the
path, “the state” means this daemon’s view of the message:
SENDINGThe message (with its data) is queued to go one hop further downstream, toward
dst.SENTThis daemon has handed the message to the next hop. An intermediate daemon caches the data at this point in case a replay is needed.
REQUESTEDA replay has been requested — the data was lost from somewhere downstream and must be re-sent.
PENDINGThe message is held for ordering, waiting on its predecessor.
ACKEDThe destination has posted the message; an acknowledgement is travelling back upstream toward
src.ACKACKEDThe origin has seen the ACK and is releasing state; the ACK-of-ACK travels back downstream so every daemon can drop the message.
A handful of ephemeral states (NEW, ACKACKED when used to trigger,
CACHED, EVICTED) drive transitions but are never stored as a
message’s state; the engine enforces this.
Steady-state flow. When reliable_send starts a message, the origin
constructs it in NEW, records the data, links it after the previous message
to that destination, and moves it to SENDING. The data is forwarded one hop
at a time; each hop transitions SENDING → SENT and (if it is neither
source nor destination) caches the data. When the data reaches dst, the
destination posts it to the application — respecting the prev_uid ordering,
holding it PENDING if its predecessor has not yet posted — and moves it to
ACKED. The ACK walks back upstream hop by hop; at the origin it becomes an
ACK-of-ACK, which walks back downstream so every daemon on the path releases its
copy.
7.5.2.5. State updates on the wire
State changes propagate as small RELM control messages carrying the message’s
signature, its prev_uid, its state, and (for SENDING) the data. They
are sent with the ordinary RML over a single prte_rml_get_route hop, tagged
PRTE_RML_TAG_RELM_STATE. state_machine.c provides the two emitters —
prte_relm_send_state_downstream (toward dst) and
prte_relm_send_state_upstream (toward src) — and a receiver,
prte_relm_message_handler, that unpacks an update, finds or creates the local
message object, and feeds it to the state machine.
The interesting logic is prte_relm_base_update_state (state_updates.c),
which routes an incoming update by who sent it:
local_update— a change requested by this daemon itself;downstream_update— a change reported by the neighbor towarddst(typically an ACK coming back);upstream_update— a change reported by the neighbor towardsrc(typically data coming forward, or a replay request).
Each of those is a switch over the incoming state that decides what to do given the message’s current state: forward the data, post to the application, cache the data, request or replay a lost message, emit an ACK, or tear the message down. Updates that arrive from a rank that is no longer this message’s upstream or downstream neighbor are simply ignored — a natural consequence of the tree having changed.
7.5.2.6. Caching and replay
An intermediate daemon holds a message’s data after forwarding it (state
SENT) so it can replay the data if a downstream daemon lost it during a
fault. Cached data is bounded two ways, both configurable:
by time —
relm_base_cache_ms(default 500 ms), after which an eviction timer fires and drops the data; andby count —
relm_base_cache_max_count(default 30), beyond which the oldest cached message is evicted.
When data is needed but no longer cached anywhere reachable, a daemon issues a
REQUESTED update upstream; the first daemon that still holds the data (or the
origin, which always can) replays it as a fresh SENDING.
7.5.2.7. Surviving a fault: link updates
The steady-state protocol assumes the path is stable. When a daemon fails, the
routing layer repairs the tree and notifies the registered fault handlers (RML,
grpcomm, filem, then relm — see The Runtime Messaging Layer (RML)). RELM’s handler
(link_updates.c) is where in-flight messages are stitched back onto the new
tree.
The handler acts only on the local-scope recovery notice. (Faults are reported twice: once locally as they are discovered and once globally when the HNP confirms them. By the time the global notice arrives, RML recovery has already run and the tree shows no delta, so a component must save whatever it needs between the two calls.) On the local notice, the handler:
Purges dead paths — messages to or from a failed rank, and messages this daemon is, after the repair, no longer on the direct path of.
Exchanges link updates with each neighbor whose identity changed. A link update (tagged
PRTE_RML_TAG_RELM_LINK) carries the full state of every in-flight message that routes through that link, so a new neighbor learns exactly what is outstanding and can resume it.
Two mechanisms keep this correct under cascading or concurrent promotions:
Depth stamping. Each link update carries the sender’s tree depth. A receiver ignores an update whose depth does not match the expected parent/child depth, so lingering updates from before a promotion are discarded rather than misapplied.
Update gating. The
upstream_links_updatedanddownstream_links_updatedbitmaps track which neighbors this daemon has exchanged state with since its last promotion. A daemon will not forward updates onward until it has gathered enough neighbor state to do so correctly;try_send_pending_link_updatesreleases pending updates as the missing pieces arrive.
Because a promoted daemon may have inherited children that briefly believed they
had a different parent, the recovery code treats all of a promoted daemon’s
children as new — see the commentary on prte_rml_recovery_status_t in
rml_types.h.
7.5.2.8. Where to look
Entry point and message identity:
state_machine.c(start_msg,find/get, GUID math intypes.h).Steady-state transitions:
base/state_updates.c.Fault recovery and link updates:
base/link_updates.c.Pack/unpack and helper macros:
util.c/util.h.Module wiring and MCA parameters:
base/base.c.