IC message routing layer
Overview
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.
Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height [math]\displaystyle{ x }[/math] by the upper layers as taking the replicated state of version [math]\displaystyle{ x-1 }[/math], and "applying" the batch to it to obtain replicated state of version [math]\displaystyle{ x }[/math].
In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
- Coordinating the deterministic processing of batches: Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.
- Deterministic processing of batches: Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.
- Transferring message streams from one subnet to another: Moving streams from one subnet to another.
Remarks and Required Prior Knowledge
- The goal of this document is to provide the next level of detail compared to the material in the "How it works" section of internetcomputer.org. So it is recommended to study the material available there first.
- This page builds upon definitions made in the page describing the state manager. Please refer to this page for missing definitions related to the replicated state etc.
- Also see this and this blog post for some relevant and easier to digest background information.
- The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
- The notation used in this page is described here.
Replicated vs. Canonical State
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as ReplicatedState
, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the CanonicalState
. Given the relations between ReplicatedState
and CanonicalState
as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the ReplicatedState
. We assume an implicit conversion from ReplicatedState
to CanonicalState
whenever we access some state passed to this component via an API function.
Guarantees Provided by Message Routing
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include
- guaranteed replies (each canister-to-canister request will eventually receive a reply),
- canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and
- authenticity (only messages that come from canisters on the IC are processed).
To ensure that the system invariants hold, message routing needs to provide the following guarantees:
- Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.
- If a message can not be delivered, a synthetic reject response must be produced.
- If a canister [math]\displaystyle{ A }[/math] sends two messages [math]\displaystyle{ m_1 }[/math] and [math]\displaystyle{ m_2 }[/math] to a canister [math]\displaystyle{ B }[/math], then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister [math]\displaystyle{ B }[/math]'s input queue from [math]\displaystyle{ A }[/math] in that order.
Preliminaries
Description of the Relevant Parts of the Registry
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the state manager wiki page. Throughout this document we assume that the registry contents we work with are authentic.
The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
Registry { subnets : Set<SubnetId>, subnet_assignment: CanisterId ↦ SubnetId ... }
Description of the Relevant Canonical State
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the CanonicalState
is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some s : CanonicalState
we can use s.ingress_queue
to access s[ingress_queues]
We start by defining the individual fields of the type CanonicalState which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.
Parts visible to message routing and execution
CanonicalState { ... ingress_queues : IngressQueues, input_queues : InputQueues, output_queues : OutputQueues, ... }
Parts visible to Message Routing only
CanonicalState { ... streams : Streams, expected_xnet_indices : Set<(SubnetId × StreamIndex)> ... }
Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.
Abstract Queues
We define a generic queue type Queue<T>
which has the following fields:
Queue<T> { next_index : ℕ, // Rolling index; the index of the next message to be inserted elements : ℕ ↦ T // The elements currently in the queue }
We define a new queue as new_queue : Queue<T>
with new_queue.elements = ∅
and new_queue.next_index = 1
. Furthermore, it has the following associated functions:
push
takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
push : Self × (ℕ ↦ T) → Self push(self, values) := self with ├─ next_index := self.next_index + |values| └─ elements := self.elements ∪ { (i - 1 + k ↦ t) | i = self.next_index ∧ (j ↦ t) ∈ values ∧ k = rank(j, dom(values)) }
delete
removes the given elements from the queues keeping thenext_index
% REQUIRE: values ⊆ self.elements delete : Self × (ℕ ↦ T) → Self delete(self, values) := self with ├─ next_index := self.next_index └─ elements := self.elements \ values
clear
removes all elements from the queues keeping the next_index
clear : Self → Self clear(self) := self with ├─ next_index := self.next_index └─ elements := ∅
We are often working with partial maps of type SomeIdentifier ↦ Queue<T>
, in which case we will use the following shorthand notation. With q
being a queue of the aforementioned type, and v
being a partial map of type (SomeIdentifier × ℕ) ↦ T
, we define the following semantic for the functions f ∈ { push, delete }
associated to Queue<T>
:
f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>) f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧ (id ↦ values) ∈ v ∧ queue' = f(queue, values) } ∪ { (id ↦ queue') | (id ↦ values) ∈ v ∧ ∄ (id ↦ ·) ∈ q ∧ queue' = f(Queue<T>::new_queue, values) } ∪ { (id ↦ queue) | (id ↦ queue) ∈ q ∧ ∄ (id ↦ ·) ∈ v }
For the functions f ∈ { clear }
we use
f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>) f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧ queue' = f(queue) }
We will henceforth omit the map
postfix in f_map
and simply use f
if it is clear from the input type that the map variant of f
should be used.
Indices
We define an Index
to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
Index : X × ... × Y × ℕ
In addition we define the following semantic:
- We define the prefix of an index Index
i := (x, …, y, seq_nr)
asprefix(i) := i[1…|i| - 1] = (x, …, y)
, i.e., it contains all elements of i except the last one.
- We define the postfix of an Index
i := (x, …, y, seq_nr)
as postfix(i) := i[|i|] = seq_nr, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.
- For an
Index i
, the operation [math]\displaystyle{ i + 1 }[/math] is defined asconcatenate(prefix(i), postfix(i) + 1)
.
- Two indices,
Index i
andIndex j
, are incomparable ifprefix(i) ≠ prefix(j)
.
- For two indices,
Index i
andIndex j
, we have that [math]\displaystyle{ i \leq j }[/math] ifprefix(i) = prefix(j)
andpostfix(i) ≤ postfix(j)
.
Queues
We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.
Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called IngressIndex
, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
IngressIndex : CanisterId × ℕ
Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
QueueIndex : CanisterId × CanisterId × ℕ
The type representing all of the ingress queues is defined as follows:
IngressQueues : CanisterId ↦ Queue<Message>,
which means that IngressQueues.elements : IngressIndex ↦ Message
.
The type representing all of the input queues is defined as follows:
InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,
which means that InputQueues.elements : QueueIndex ↦ Message
.
The type representing all of the output queues is defined as follows:
OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,
which means that OutputQueues.elements : QueueIndex ↦ Message
.
Streams
Each individual Stream
is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a SubstreamId
. The concrete definition of SubstreamId
is up to the implementation. In the current implementation SubstreamId
is defined to be the unit type ()
, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of Index
called StreamIndex which is defined as follows:
StreamIndex : SubstreamId × ℕ
A Stream
is comprised of a sequence of Signal
messages signals
and a sequence of canister-to-canister messages msgs
.
Stream { signals : StreamIndex ↦ {ACCEPT, REJECT}, msgs : SubstreamId ↦ Queue<Message> }
which means that Stream.msgs.elements : StreamIndex ↦ Message
.
While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
Streams : SubnetId ↦ Stream
We may sometimes abuse the notation and directly access the fields defined for an individual Stream
on the Streams type, in which case we obtain maps of the following type:
Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT}) Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)
(Certified) Stream Slices
StreamSlices
and CertifiedStreamSlices
, respectively, are used to transport streams from one to an other subnet within XNetPayloads
that are part of consensus blocks. Essentially, a StreamSlice
is a slice of a stream which retains the begin and the end of the original stream. A StreamSlice
is wrapped in a CertifiedStreamSlice
for transport so that authenticity can be guaranteed. Neither CertifiedStreamSlices
nor StreamSlices
are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A CertifiedStreamSlice
is created by calling the respective encoding routine of the state manager. Such a CertifiedStreamSlice
can then be decoded into a StreamSlice
using the corresponding decoding routine provided by the state manager.
StreamSlice { stream : Stream, begin : Set<StreamIndex>, end : Set<StreamIndex> }
CertifiedStreamSlice { payload : PartialCanonicalState witness : Witness signature : Certification }
For the precise relation of StreamSlice
and CertifiedStreamSlice
, refer to the specification of the state manager.
Batch
A batch consists of multiple elements including an ingress_payload
constituting a sequence of ingress messages, and an xnet_payload
.
Batch { batch_number : Height registry_version : RegistryVersion ingress_payload : ℕ ↦ Message xnet_payload : SubnetId ↦ CertifiedStreamSlice requires_full_state_hash : { TRUE, FALSE } }
Decoded Batch
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
DecodedBatch { ingress_payload : ℕ ↦ Message xnet_payload : SubnetId ↦ StreamSlice }
Currently this only means decoding the CertifiedStreamSlices
into StreamSlices
because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
decode : SubnetId × Batch → DecodedBatch decode(own_subnet, b) := DecodedBatch { with ├─ ingress_payload := b.ingress_payload └─ xnet_payload := { (src_subnet ↦ slice) | (src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧ slice = StateManager.decode_valid_certified_stream(own_subnet, cert_slice ) } }
Message Routing
Message routing is triggered by incoming batches from consensus. For each Batch b
, message routing will perform the following steps:
- Obtain the
ReplicatedState s
of the right version w.r.t.Batch b
.
- Submit
s
,decode(own_subnet, b)
for processing by the deterministic state machine comprised of the message routing and execution layer. This includes
- An induction phase (cf.
pre_process
), where the valid messages indecode(own_subnet, b)
are inducted. Among others, a message m in aStreamSlice
from subnetX
is considered valid ifregistry.get_registry_at(b.registry_version).subnet_assignment
mapsm.src
toX
.
- An induction phase (cf.
- An execution phase (cf.
execute
), which executes messages available in the induction pool.
- An execution phase (cf.
- An XNet message routing phase (cf.
post_process
), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.
- An XNet message routing phase (cf.
- Commit the replicated state, incrementally updated by the previous steps, to the state manager via
commit_and_certify
.
Deterministic State Machine
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type CanonicalState
to generally describe the operations of the message-routing-related operations of this component.
The flow diagram below details the operation of the component. Its operation is logically split into three phases.
- The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.
- The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.
- The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.
All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.
API
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.
pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState
: Triggers the induction phase.
execute(s : CanonicalState) → CanonicalState
: Triggers the execution phase.
post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState
: Triggers the XNet message routing phase.
Abstractions of Other Parts of the System
Valid Set Rule (VSR)
The VSR is a component that makes the decision of whether to ACCEPT
a message or to REJECT
a message. For message routing, ACCEPT
has the semantic that the execution layer takes responsibility for the message, whereas REJECT
has the semantic that the message is dropped and may require action from the message routing layer.
The operation of the VSR on ingress messages is defined as follows, where vsr_check_ingress : CanonicalState × Batch → Set<ℕ>
is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
VSR(state, batch).ingress := { ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload ∧ i ∈ vsr_check_ingress(state, batch) ∧ j = Rank(i, vsr_check_ingress(state, batch)) }
Scheduler and Hypervisor. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)
which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.
We will later use this function when we describe how the state transition function execute(CanonicalState) → CanonicalState
transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.
- First, we have
consumed_ingress_messages
, which contains a partial mapIngressIndex ↦ Message
containing all consumed ingress messages.
- Second, we have
consumed_xnet_messages
, which contains a partial mapQueueIndex ↦ Message
containing all consumed cross-net messages.
- Third, we have
produced_messages
which contains a partial mapQueueIndex ↦ Message
containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.