IC message routing layer

Overview

The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height [math]\displaystyle{ x }[/math] by the upper layers as taking the replicated state of version [math]\displaystyle{ x-1 }[/math], and "applying" the batch to it to obtain replicated state of version [math]\displaystyle{ x }[/math].

This document describes the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:

Coordinating the deterministic processing of batches: Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

Deterministic processing of batches: Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

Transferring message streams from one subnet to another: Moving streams from one subnet to another.

Remarks and Required Prior Knowledge

The goal of this document is to provide the next level of detail compared to the material in the "How it works" section of internetcomputer.org. So it is recommended to study the material available there first.
This page builds upon definitions made in the page describing the state manager. Please refer to this page for missing definitions related to the replicated state etc.
Also see this and this blog post for some relevant and easier to digest background information.
The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
The notation used in this page is described here.

Replicated vs. Canonical State

While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as ReplicatedState, the operation the message routing component performs on the state based on its canonical representation, i.e., the CanonicalState is described. Given the relations between ReplicatedState and CanonicalState as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the ReplicatedState. An implicit conversion from ReplicatedState to CanonicalState is assumed whenever some state passed to this component via an API function is accessed.

Guarantees Provided by Message Routing

Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

guaranteed replies (each canister-to-canister request will eventually receive a reply),

canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

If a message can not be delivered, a synthetic reject response must be produced.

If a canister [math]\displaystyle{ A }[/math] sends two messages [math]\displaystyle{ m_1 }[/math] and [math]\displaystyle{ m_2 }[/math] to a canister [math]\displaystyle{ B }[/math], then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister [math]\displaystyle{ B }[/math]'s input queue from [math]\displaystyle{ A }[/math] in that order.

Preliminaries

Description of the Relevant Parts of the Registry

The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the state manager wiki page. Throughout this document it is assume that the registry contents are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.

Registry {
    subnets : Set<SubnetId>,
	subnet_assignment: CanisterId ↦ SubnetId
    ...
}

Description of the Relevant Canonical State

This section defined the parts of the canonical state which are relevant for the description of this component together with some constraints imposed on the replicated state. Abstractly the CanonicalState is defined as a nested partial map. For easier readability the entries of the outermost map are bundled together in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some s : CanonicalState one can use s.ingress_queue to access s[ingress_queues]

This section defines the individual fields of the type CanonicalState which are relevant in the context of this document. After that there are more details about the datatypes of the individual fields. There is a distinction made between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

Parts visible to message routing and execution

CanonicalState {
    ...
    ingress_queues  : IngressQueues,
    input_queues    : InputQueues,
    output_queues   : OutputQueues,
    ...
}

Parts visible to Message Routing only

CanonicalState {
    ...
    streams               : Streams,
    expected_xnet_indices : Set<(SubnetId × StreamIndex)>
    ...
}

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues, message routing will only ever push messages to them, whereas for output queues, message routing will only ever pull messages from them. The opposite holds for the execution environment.

Abstract Queues

Define a generic queue type Queue<T> which has the following fields:

Queue<T> {
    next_index : ℕ,     // Rolling index; the index of the next message to be inserted
    elements   : ℕ ↦ T  // The elements currently in the queue
}

Define a new queue as new_queue : Queue<T> with new_queue.elements = ∅ and new_queue.next_index = 1. Furthermore, it has the following associated functions:

push takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.

push : Self × (ℕ ↦ T) → Self
push(self, values) :=
    self with
            ├─ next_index := self.next_index + |values|
            └─ elements := self.elements
                           ∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
                                                 (j ↦ t) ∈ values ∧
                                                 k = rank(j, dom(values)) }

delete removes the given elements from the queues keeping the next_index

% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
    self with
            ├─ next_index := self.next_index
            └─ elements := self.elements
                           \ values

clear removes all elements from the queues keeping the next_index

clear : Self → Self
clear(self) :=
    self with
            ├─ next_index := self.next_index
            └─ elements := ∅

Partial maps of type SomeIdentifier ↦ Queue<T> are often used, in which case the following shorthand notation is used. With q being a queue of the aforementioned type, and v being a partial map of type (SomeIdentifier × ℕ) ↦ T, define the following semantic for the functions f ∈ { push, delete } associated to Queue<T>:

f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
                                (id ↦ values) ∈ v ∧
                                queue' = f(queue, values)
              } ∪
              { (id ↦ queue') | (id ↦ values) ∈ v ∧
                                ∄ (id ↦ ·) ∈ q ∧
                                queue' = f(Queue<T>::new_queue, values)
              } ∪
              { (id ↦ queue) | (id ↦ queue) ∈ q ∧
                               ∄ (id ↦ ·) ∈ v
              }

For the functions f ∈ { clear } use

f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
                             queue' = f(queue)
           }

Henceforth the map postfix will be omitted in f_map and simply use f if it is clear from the input type that the map variant of f should be used.

Indices

Define an Index to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.

Index : X × ... × Y × ℕ

In addition the following semantic is defined:

Define the prefix of an index Index i := (x, …, y, seq_nr) as prefix(i) := i[1…|i| - 1] = (x, …, y), i.e., it contains all elements of i except the last one.

Define the postfix of an Index i := (x, …, y, seq_nr) as postfix(i) := i[|i|] = seq_nr, i.e., the last element of the index sequence. As already mentioned, the postfix of an index is required to be a natural number.

For an Index i, the operation [math]\displaystyle{ i + 1 }[/math] is defined as concatenate(prefix(i), postfix(i) + 1).

Two indices, Index i and Index j, are incomparable if prefix(i) ≠ prefix(j).

For two indices, Index i and Index j, it is the case that [math]\displaystyle{ i \leq j }[/math] if prefix(i) = prefix(j) and postfix(i) ≤ postfix(j).

Queues

Three different types of queues exist in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called IngressIndex, which is a tuple consisting of the destination canister ID and a natural number, i.e.,

IngressIndex : CanisterId × ℕ

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:

QueueIndex : CanisterId × CanisterId × ℕ

The type representing all of the ingress queues is defined as follows:

IngressQueues : CanisterId ↦ Queue<Message>,

which means that IngressQueues.elements : IngressIndex ↦ Message.

The type representing all of the input queues is defined as follows:

InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,

which means that InputQueues.elements : QueueIndex ↦ Message.

The type representing all of the output queues is defined as follows:

OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,

which means that OutputQueues.elements : QueueIndex ↦ Message.

Streams

Each individual Stream is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a SubstreamId. The concrete definition of SubstreamId is up to the implementation. In the current implementation SubstreamId is defined to be the unit type (), i.e., flat streams. Messages in streams are indexed by a concrete instance of Index called StreamIndex which is defined as follows:

StreamIndex : SubstreamId × ℕ

A Stream is comprised of a sequence of Signal messages signals and a sequence of canister-to-canister messages msgs.

Stream {
    signals : StreamIndex ↦ {ACCEPT, REJECT},
    msgs    : SubstreamId ↦ Queue<Message>
}

which means that Stream.msgs.elements : StreamIndex ↦ Message.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, a data structure Streams is defined holding all streams indexed by destination subnetwork:

Streams : SubnetId ↦ Stream

Notation may be sometimes abused and directly access the fields defined for an individual Stream on the Streams type, in which case maps of the following type are obtained:

Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs    : SubnetId ↦ (SubstreamId ↦ Queue<Message>)

(Certified) Stream Slices

StreamSlices and CertifiedStreamSlices, respectively, are used to transport streams from one to an other subnet within XNetPayloads that are part of consensus blocks. Essentially, a StreamSlice is a slice of a stream which retains the begin and the end of the original stream. A StreamSlice is wrapped in a CertifiedStreamSlice for transport so that authenticity can be guaranteed. Neither CertifiedStreamSlices nor StreamSlices are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A CertifiedStreamSlice is created by calling the respective encoding routine of the state manager. Such a CertifiedStreamSlice can then be decoded into a StreamSlice using the corresponding decoding routine provided by the state manager.

StreamSlice {
    stream    : Stream,
    begin     : Set<StreamIndex>,
    end       : Set<StreamIndex>
}

CertifiedStreamSlice {
    payload   : PartialCanonicalState
    witness   : Witness
    signature : Certification
}

For the precise relation of StreamSlice and CertifiedStreamSlice, refer to the specification of the state manager.

Batch

A batch consists of multiple elements including an ingress_payload constituting a sequence of ingress messages, and an xnet_payload.

Batch {
    batch_number             : Height
    registry_version         : RegistryVersion
    ingress_payload          : ℕ ↦ Message
    xnet_payload             : SubnetId ↦ CertifiedStreamSlice
    requires_full_state_hash : { TRUE, FALSE }
}

Decoded Batch

A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.

DecodedBatch {
    ingress_payload : ℕ ↦ Message
    xnet_payload : SubnetId ↦ StreamSlice
}

Currently this only means decoding the CertifiedStreamSlices into StreamSlices because it is assumed that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:

decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
    DecodedBatch {
        with
           ├─ ingress_payload := b.ingress_payload
           └─ xnet_payload :=
                  { (src_subnet ↦ slice) |
                      (src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
                      slice = StateManager.decode_valid_certified_stream(own_subnet,
                                                                         cert_slice
                                                                        )
                  }
    }

Message Routing

Message routing is triggered by incoming batches from consensus. For each Batch b, message routing will perform the following steps:

Components interacting with message routing during a deterministic processing round

Interactions of message routing with other components during a deterministic processing round

Obtain the ReplicatedState s of the right version w.r.t. Batch b.

Submit s, decode(own_subnet, b) for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

- An induction phase (cf. pre_process), where the valid messages in decode(own_subnet, b) are inducted. Among others, a message m in a StreamSlice from subnet X is considered valid if registry.get_registry_at(b.registry_version).subnet_assignment maps m.src to X.

- An execution phase (cf. execute), which executes messages available in the induction pool.

- An XNet message routing phase (cf. post_process), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

Commit the replicated state, incrementally updated by the previous steps, to the state manager via commit_and_certify.

Deterministic State Machine

As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, state of type CanonicalState is used to generally describe the operations of the message-routing-related operations of this component.

Data flow during batch processing

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

API

The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState: Triggers the induction phase.

execute(s : CanonicalState) → CanonicalState: Triggers the execution phase.

post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState: Triggers the XNet message routing phase.

Abstractions of Other Parts of the System

Valid Set Rule (VSR) The VSR is a component that makes the decision of whether to ACCEPT a message or to REJECT a message. For message routing, ACCEPT has the semantic that the execution layer takes responsibility for the message, whereas REJECT has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where vsr_check_ingress : CanonicalState × Batch → Set<ℕ> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.

VSR(state, batch).ingress :=
  { ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
                           ∧ i ∈ vsr_check_ingress(state, batch)
                           ∧ j = Rank(i, vsr_check_ingress(state, batch))
  }

The VSR for cross-net messages is defined as follows, where vsr_check_xnet : CanonicalState × Batch → Set<(SubnetId × StreamIndex)> is a deterministic function that determines the indices of the messages in the individual substreams contained in xnet_payload to be inducted.

It is required that the implementation of the VSR (or the layer above) makes sure that all reply messages are accepted by the VSR. Formally this means that for any valid State-Batch combination (s, b) it holds that for all (subnet, index) so that b.xnet_payload[subnet].msgs[index] is a reply message that (subnet, index) ∈ vsr_check_xnet(s, b).

Based on this rule one can straight-forwardly define the interface behavior of the VSR.

VSR(state, batch).xnet :=
  { (index ↦ msg) |
      (index ↦ msg) ∈ batch.xnet_payload.msgs ∧
      index ∈ vsr_check_xnet(state, batch)
  }

VSR(state, batch).signals :=
    { (concatenate(subnet, index) ↦ ACCEPT) |
      (subnet ↦ stream) ∈ batch.xnet_payload ∧
      (index ↦ msg) ∈ stream.msgs ∧
      (subnet, index) ∈ vsr_check_xnet(state, batch)
    }
  ∪ { (concatenate(subnet, index) ↦ REJECT) |
      (subnet ↦ stream) ∈ batch.xnet_payload ∧
      (index ↦ msg) ∈ stream.msgs ∧
      (subnet, index) ∉ vsr_check_xnet(state, batch)
    }

Scheduler and Hypervisor. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. The functionality of scheduler and hypervisor are modeled as a deterministic function schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message) which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

This function will be used lated when it's described how the state transition function execute(CanonicalState) → CanonicalState transforms the state. For the sake of compact notation, the following fields are used to access the individual return values of the schedule_and_execute function.

consumed_ingress_messages, which contains a partial map IngressIndex ↦ Message containing all consumed ingress messages.

consumed_xnet_messages, which contains a partial map QueueIndex ↦ Message containing all consumed cross-net messages.

produced_messages, which contains a partial map QueueIndex ↦ Message containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

Description of the State Transitions

Induction Phase. In the induction phase, one starts off with a CanonicalState S, some subnet_assignment and a DecodedBatch b and applies b to S relative to subnet_assignment to obtain S', i.e., one computes S' = pre_process(S, subnet_assignment, b).

This section describes things w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer those actions are omitted here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before defining the actual state transition, a couple of helper functions are defined. First, it is needed to define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.

% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
%           └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
%          └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
%             ├─ ∃ s1, s2 :
%             │  └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
%             └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦  Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
  % There is no concrete implementation of this function provided as there are
  % multiple possible implementations and the choice for one also depends on
  % how priorities/fairness etc. are handled.
  %
  % A trivial implementation is to iterate over the given stream slices S per
  % subnet and for each individual slice iterate over all the messages in the
  % order they appear in the slice and push each message m on the right queue,
  % i.e., the one belonging to the destination canister. This is also the way
  % things are currently implemented.
}

Based on this a function can now be defined that maps over the indexes of the valid XNet messages.

map_valid_xnet_messages : (SubnetId ↦ Slice) ×
                          (CanisterId ↦ SubnetId) →
                          ((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
    queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
                                          (index ↦ m) ∈ slice.msgs ∧
                                          subnet_assignment[m.src] = subnet ∧

               })

Finally, the state S' resulting from computing pre_process(S, b) can be defined:

S with
  % Append the ingress messages accepted by the VSR to the appropriate ingress_queue
  ingress_queues            := push(S.ingress_queues, VSR(S, b).ingress)

  % Append the canister to canister messages accepted by the VSR to the appropriate
  % input queue.
  input_queues              := push(S.input_queues,
                                    map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
                                   )

  % Garbage collect the messages which have accepted by the target subnet.
  % (As soon as the VSR does no longer ACCEPT all messages, one would have
  %  to make sure that rejected messages are appropriately re-enqueued in
  %  the streams)
  streams.msgs              := delete(S.streams.msgs,
                                 { (concatenate(subnet, index) ↦ msg) |
                                       (subnet ↦ slice) ∈ b.xnet_payload ∧
                                       (i ↦ ·) ∈ slice.signals ∧
                                       index = concatenate(subnet, i)
                                 }
                               )

  % Add the signals reflecting the decisions made by the VSR in the current round and
  % garbage collect the signals which have already been processed on the other subnet
  % (one knows that a signal has been processed when the message is no longer included
  % in a given slice).
  streams.signals           := S.streams.signals
                               ∪ VSR(S, b).signals
                               \ { (index ↦ signal) |
                                       (subnet ↦ slice) ∈ b.xnet_payload ∧
                                       (i ↦ signal) ∈ S.streams[subnet].signals ∧
                                       index = concatenate(subnet, i) ∧
                                       j ∈ slice.begin ∧
                                       i < j
                                 }

  % Update the expected XNet indexes so that the block maker can compute which messages
  % to include in a block referencing this state.
  expected_xnet_indices     := { index     | index ∈ S.expected_xnet_indices ∧
                                             ∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
                                             └─ prefix(index) = prefix(i)
                               } ∪
                               { index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }

Execution Phase. In the execution phase, one starts off with a CanonicalState S, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes S' = execute(S) where S is the state after the induction phase. From the perspective of message routing, the state S' resulting from computing execute(S) looks as follows:

S with
  % Delete the consumed ingress messages from the respective ingress queues
  ingress_queues    := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

  % Delete the consumed canister to canister messages from the respective input queues
  input_queues      := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

  % Append the produced messages to the respective output queues
  output_queues     := push(S.output_queues, schedule_and_execute(S).produced_messages)

  % Execution specific state is transformed by the execution environment; the precise transition
  % function is out of scope here.

XNet Message Routing Phase. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes S' = post_process(S, registry), where S is the state after the execution phase and registry represents a view of the registry.

Before defining the state transition, a helper function is defined to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.

% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
    delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
                                       q_index = (·, dst, ·) ∧
                                       dst ∉ dom(subnet_assignment)
                   }
          )

Produce NON_EXISTENT_CANISTER replies telling the sending canister that the destination canister does not exist.

% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues 
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
  { ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
                                              q_index = (src, dst, i) ∧
                                              dst ∉ dom(subnet_assignment)
  })

Non flat streams. As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a StreamIndex is defined to be a tuple of SubstreamId and a natural number. If there is a flat stream, StreamIndex is defined to be the unit type () which effectively means that the implementation can use natural numbers as stream index as one does not need to make the SubstreamId explicit in this case. In contrast, if there is a per-destination (or per-source) substreams, StreamIndex is defined to be a CanisterId.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a QueueIndex, i.e., a src-dst tuple—decides on the prefix of the StreamIndex, i.e., the SubstreamId.

substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst

Description of the actual state transition. The state S' resulting from computing post_process(S, subnet_assignment) is defined as follows:

S with
  % Clear the output queues
  output_queues := clear(S.output_queues)

  % Route the messages produced in the previous execution phase to the appropriate streams
  % taking into account ordering and capacity management constraints enforced by stream_index.
  streams.msgs  := {
    let msgs = S.streams.msgs

    % Iterate over filtered messages preserving order of messages in queues.
    for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
      msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

    return msgs
  }


  % Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
  input_queues := push(S.input_queues,
                       non_existent_canister_replies(S.output_queues, subnet_assignment))

Ordering of Messages in the Stream & Fairness. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

XNet Transfer

After calling commit_and_certify at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The XNetTransfer subcomponent is responsible to enable this transfer. It consists of

XNet transfer component diagram

An XNetEndpoint which is responsible for serving certified stream slices and making them available to XNetPayloadBuilders on other subnetworks.

An XNetPayloadBuilder, which allows the block makers to obtain an XNetPayload containing the currently available certified streams originating from other subnetworks. The XNetPayloadBuilder obtains those streams by interacting with XNetEndpoints exposed by other subnets. The XNetPayloadBuilder also provides functionality for notaries to verify XNetPayloads contained in block proposals.

There are no specifications about the protocol run between the XNetEndpoint and the XNetPayloadBuilder to transfer the streams between two subnetworks. The only requirement is that certified streams made available by an XNetEndpoint of an honest replica on some source subnetwork, they can be obtained by an XNetPayloadBuilder of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

Properties and Functionality

Assume an XNet transfer component on a replica part of subnet own_subnet. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)

For any (remote_subnet ↦ css) ∈ payload:

StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css) succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

Furthermore, for each slice it will hold that a soon as the state corresponding to height h = reference_height + |past_payloads| is available that concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using validate_xnet_payload are accepted if they adhere to those requirements, and are rejected otherwise.

XNet Endpoint

The XNetEndpoint serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the XNetEndpoint on the remote subnet in question. The following abstraction is used to avoid explicitly talking about this client: Assume that there is a function get : SubnetId → XNetEndpoint which will return an appropriate instance of XNetEndpoint which can directly be queried using the API described below.

XNet transfer sequence diagram

get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice: Returns the requested certified stream slice in its transport format.

It is required that an honest XNetPayloadBuilder-XNetEndpoint pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

XNet Payload Builder

The XNetPayloadBuilder builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

Rules for payload building

Below we formally define the operation of the component. We first define the following helper functions. We assume that XNetPayloadBuilder has an associated field own_subnet which is passed whenever constructing an XNetPayloadBuilder:

new : SubnetId → Self
new(own_subnet) :=
  XNetPayloadBuilder {
      with
         └─ own_subnet := own_subnet
  }

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

The function slice_indexes returns the set of expected indices for the block to be proposed, solely based on a set of Slices.

% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }

The function state_and_payload_indexes returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.

% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
%              prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
                            (SubnetId ↦ StreamSlice) →
                            Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
                                            \ dom(slices.msgs.elements)
                                            ∪ slice_indexes(slices)

The function expected_indexes returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.

% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
                   ReplicatedState ×
                   (SubnetId ↦ StreamSlice) →
                   Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
    state_and_payload_indexes(
        state,
        { (src ↦ slice) | payload ∈ slices ∧
                          (src ↦ cert_slice) ∈ payload ∧
                          slice = StateManager.decode_valid_certified_stream(own_subnet,
                                                                             cert_slice)
        }
    )

Creation of XNet Payloads

Based on the functions above, we are now ready to define the function get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.

% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit ∧
%          each payload output by get_xnet_payload will be accepted by validate_xnet_payload
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
    { (remote_subnet ↦ slice) |
            S = StateManager.get_state_at(reference_height)
          ∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
          ∧ (remote_subnet, begin_index) ∈
                  expected_indexes(self.own_subnet, S, past_payloads)
                ∪ { (subnet_id, StreamIndex::ONE) |
                        subnet_id ∈ subnets
                                    \ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
                                                                      S,
                                                                      past_payloads)
                                      }
                  }
            % msg_limit and size limit need to be set by the implementation as appropriate
            % to satisfy the post condition
          ∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
          ∧ ERR ≠ StateManager.decode_certified_stream(registry_version,
                                                       self.own_subnet,
                                                       remote_subnet,
                                                       slice)
    }

Validation of XNet Payloads

Validation of XNetPayloads works analogously to the creation. The function validate_xnet_payload is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.

% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
    S = StateManager.get_state_at(reference_height) ∧
    ∀ (remote_subnet ↦ css) ∈ payload :
    {
      slice = StateManager.decode_certified_stream(registry_version,
                                                   self.own_subnet,
                                                   remote_subnet,
                                                   css) ∧
      ∀ index ∈ min(dom(slice.msgs.elements)) :
      {
        (remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
        index = (remote_subnet, StreamIndex::ONE)
      }
    }