7.5.2. Reliable Messaging (RELM)

A normal RML send is fire-and-forget. It rides the radix routing tree (The Runtime Messaging Layer (RML)), and if a daemon on the path dies while a message is in flight, that message is simply lost — the tree repairs itself, but the bytes are gone. For most runtime traffic that is acceptable, because the higher-level protocol will retry or the information is reconstructible. Some traffic is not so forgiving. RELM (“reliable messaging”) is the layer that guarantees a message is delivered exactly once to its destination even if daemons on the path fail and the tree is rebuilt underneath it.

RELM lives in src/rml/relm/; its editing map is src/rml/relm/AGENTS.md and the concrete implementation’s map is src/rml/relm/base/AGENTS.md.

7.5.2.1. How RELM is invoked

Application code sends reliably with prte_rml_send_buffer_reliable_nb (wrapped by the PRTE_RML_RELIABLE_SEND macro). That call routes through the prte_relm module to prte_relm_start_msg. Everything else in the subsystem runs in reaction to two things: RELM control messages received from neighbors, and fault notices from the routing layer.

RELM is shaped like a small framework — the prte_relm_module_t interface and a state machine (prte_relm_state_machine_t) full of function pointers — so a different reliability strategy could be substituted. Today there is exactly one implementation, the base module, which relm.c installs at open time by copying prte_relm_base_module into the global prte_relm and wiring every state-machine callback to a prte_relm_base_* function.

7.5.2.2. Identifying a message

Every reliable message has a stable, globally-unique identity that all daemons on its path agree on:

  • the origin src (the rank that called reliable_send) and a per-source UID together form a globally-unique id;

  • adding the destination dst yields the message signature (prte_relm_signature_t);

  • the pair <src, uid> is hashed into a 64-bit GUID (prte_relm_guid_t) used as the hash-table key.

The UID (prte_relm_uid_t) is a 32-bit counter that is allowed to wrap; the design assumes a message is globally complete and unreferenced long before its UID could be reused. The top of the range is reserved for sentinels (UNKNOWN / NONE / INVALID), so valid UIDs run up to PRTE_RELM_UID_MAX.

State is kept per destination. prte_relm_state_machine_t holds a hash table of prte_relm_rank_t objects keyed by destination rank; each of those holds a hash table of prte_relm_msg_t objects keyed by GUID. A prte_relm_msg_t records the message’s src / uid / dst, its current state, the message data (while it may still be needed), and prev_uid / next_uid links that chain together the messages this origin has sent to this destination.

7.5.2.3. Ordering

Messages from one origin to one destination are delivered in order. The prev_uid / next_uid chain enforces this: a message may only be posted to the application once its predecessor has been posted, and an acknowledgement of a message implicitly acknowledges everything before it in the chain. A message whose predecessor has not yet arrived is parked in the PENDING state and released to SENDING when the predecessor is posted.

7.5.2.4. The message lifecycle

Each prte_relm_msg_t moves through a small set of lasting states (types.h). Because the same message object exists on every daemon along the path, “the state” means this daemon’s view of the message:

SENDING

The message (with its data) is queued to go one hop further downstream, toward dst.

SENT

This daemon has handed the message to the next hop. An intermediate daemon caches the data at this point in case a replay is needed.

REQUESTED

A replay has been requested — the data was lost from somewhere downstream and must be re-sent.

PENDING

The message is held for ordering, waiting on its predecessor.

ACKED

The destination has posted the message; an acknowledgement is travelling back upstream toward src.

ACKACKED

The origin has seen the ACK and is releasing state; the ACK-of-ACK travels back downstream so every daemon can drop the message.

A handful of ephemeral states (NEW, ACKACKED when used to trigger, CACHED, EVICTED) drive transitions but are never stored as a message’s state; the engine enforces this.

Steady-state flow. When reliable_send starts a message, the origin constructs it in NEW, records the data, links it after the previous message to that destination, and moves it to SENDING. The data is forwarded one hop at a time; each hop transitions SENDINGSENT and (if it is neither source nor destination) caches the data. When the data reaches dst, the destination posts it to the application — respecting the prev_uid ordering, holding it PENDING if its predecessor has not yet posted — and moves it to ACKED. The ACK walks back upstream hop by hop; at the origin it becomes an ACK-of-ACK, which walks back downstream so every daemon on the path releases its copy.

7.5.2.5. State updates on the wire

State changes propagate as small RELM control messages carrying the message’s signature, its prev_uid, its state, and (for SENDING) the data. They are sent with the ordinary RML over a single prte_rml_get_route hop, tagged PRTE_RML_TAG_RELM_STATE. state_machine.c provides the two emitters — prte_relm_send_state_downstream (toward dst) and prte_relm_send_state_upstream (toward src) — and a receiver, prte_relm_message_handler, that unpacks an update, finds or creates the local message object, and feeds it to the state machine.

The interesting logic is prte_relm_base_update_state (state_updates.c), which routes an incoming update by who sent it:

  • local_update — a change requested by this daemon itself;

  • downstream_update — a change reported by the neighbor toward dst (typically an ACK coming back);

  • upstream_update — a change reported by the neighbor toward src (typically data coming forward, or a replay request).

Each of those is a switch over the incoming state that decides what to do given the message’s current state: forward the data, post to the application, cache the data, request or replay a lost message, emit an ACK, or tear the message down. Updates that arrive from a rank that is no longer this message’s upstream or downstream neighbor are simply ignored — a natural consequence of the tree having changed.

7.5.2.6. Caching and replay

An intermediate daemon holds a message’s data after forwarding it (state SENT) so it can replay the data if a downstream daemon lost it during a fault. Cached data is bounded two ways, both configurable:

  • by timerelm_base_cache_ms (default 500 ms), after which an eviction timer fires and drops the data; and

  • by countrelm_base_cache_max_count (default 30), beyond which the oldest cached message is evicted.

When data is needed but no longer cached anywhere reachable, a daemon issues a REQUESTED update upstream; the first daemon that still holds the data (or the origin, which always can) replays it as a fresh SENDING.

7.5.2.8. Where to look

  • Entry point and message identity: state_machine.c (start_msg, find/get, GUID math in types.h).

  • Steady-state transitions: base/state_updates.c.

  • Fault recovery and link updates: base/link_updates.c.

  • Pack/unpack and helper macros: util.c / util.h.

  • Module wiring and MCA parameters: base/base.c.