Internet Computer Wiki - User contributions [en]

Dealing with cycles limit exceeded errors

2026-02-09T15:28:39Z

David: Deprecate page

This page has been deprecated.

Parallelism

2026-02-09T15:28:18Z

David: Deprecate page

This page has been deprecated.

Comparing Canister Cycles vs Performance Counter

2026-02-09T15:27:53Z

David: Deprecate page

This page has been deprecated.

Current limitations of the Internet Computer

2026-02-09T15:27:29Z

David: Deprecate page

This page has been deprecated.

Subnet splitting forum announcement template

2023-10-30T12:12:35Z

David:

This page provides a template process that can be used to announce an upcoming series of proposals to [[Subnet splitting|split a subnet]].

===Step 1: Heads up template, before starting the process===
The text below gives an example of how an announcement for an upcoming series of proposals to split a subnet could look like.
----

Hello everybody,

We are planning to submit the proposals to split subnet <code><fill in subnet ID></code> because <code><fill in reason></code>. Currently this subnet is assigned the canister ID ranges <code><fill in ranges; in both representations></code>. Ranges are inclusive, i.e. they include the beginning and the end of the range. We are proposing to split off the canisters corresponding to ID ranges <code><fill in ranges; in both representations></code> to a new subnet. The ranges were chosen so that <code><fill in how ranges were chosen></code>. For more details on subnet splitting please refer to [[Subnet splitting|this]] wiki page.

Generally the subnet splitting process is designed in a way that – apart from downtime (<code><fill in expected downtime if available></code>) – it is transparent to canisters provided that they properly handle transient errors when sending canister-to-canister messages. Note that these transient errors routinely occur during normal subnet functioning – so a correct canister implementation should properly handle these errors. In case you are unsure whether your canister properly handles these transient errors you might want to consider stopping it and restarting it after the split is complete.

It will be possible to verify whether the proposed split is authentic by verifying the splitting related artifacts as described [[Verification of Artifacts in Subnet Splitting|here]]. The required artifacts will be obtained after the proposal to halt the source subnet at a CUP boundary was accepted and the subnet successfully halted. They will be published in a follow up post in this thread.

Subsequent updates will also be shared in this thread.

Hint: to make sense of the ranges and whether your canister is inside or outside some range, the state tool enables the conversion of canister IDs to their hex representation which allows to compare them as numbers. For example, to convert canister ID qoctq-giaaa-aaaaa-aaaea-cai to its hex representation one can:

*Clone the IC repo from https://github.com/dfinity/ic
*From within the repo, enter the dev container via <code>./gitlab-ci/container/container-run.sh</code>
*Run <code>bazel run --config local //rs/state_tool:state-tool -- canister_id_to_hex --canister_id qoctq-giaaa-aaaaa-aaaea-cai</code>

----

=== Step 2: Once the source subnet is halted and the artifacts were obtained ===
After the proposal for halting the source subnet was accepted and the subnet is halted one can obtain the artifacts needed for [[Verification of Artifacts in Subnet Splitting|end to end verification]] and share it with the community. To do so one can follow the steps below:

# Run through the [[Verification of Artifacts in Subnet Splitting|verification process]] manually to verify whether everything works. Only proceed with subsequent steps if verification matches.
# Post the artifacts in the thread and link to the verification process again so that community can verify.
# Continue with subsequent splitting proposals.

Subnet splitting forum announcement template

2023-10-30T12:10:37Z

David:

This page provides a template process that can be used to announce an upcoming series of proposals to [[Subnet splitting|split a subnet]].

===Step 1: Heads up template, before starting the process===

----

Hello everybody,

We are planning to submit the proposals to split subnet <code><fill in subnet ID></code> because <code><fill in reason></code>. Currently this subnet is assigned the canister ID ranges <code><fill in ranges; in both representations></code>. Ranges are inclusive, i.e. they include the beginning and the end of the range. We are proposing to split off the canisters corresponding to ID ranges <code><fill in ranges; in both representations></code> to a new subnet. The ranges were chosen so that <code><fill in how ranges were chosen></code>. For more details on subnet splitting please refer to [[Subnet splitting|this]] wiki page.

Generally the subnet splitting process is designed in a way that – apart from downtime (<code><fill in expected downtime if available></code>) – it is transparent to canisters provided that they properly handle transient errors when sending canister-to-canister messages. Note that these transient errors routinely occur during normal subnet functioning – so a correct canister implementation should properly handle these errors. In case you are unsure whether your canister properly handles these transient errors you might want to consider stopping it and restarting it after the split is complete.

It will be possible to verify whether the proposed split is authentic by verifying the splitting related artifacts as described [[Verification of Artifacts in Subnet Splitting|here]]. The required artifacts will be obtained after the proposal to halt the source subnet at a CUP boundary was accepted and the subnet successfully halted. They will be published in a follow up post in this thread.

Subsequent updates will also be shared in this thread.

Hint: to make sense of the ranges and whether your canister is inside or outside some range, the state tool enables the conversion of canister IDs to their hex representation which allows to compare them as numbers. For example, to convert canister ID qoctq-giaaa-aaaaa-aaaea-cai to its hex representation one can:

*Clone the IC repo from https://github.com/dfinity/ic
*From within the repo, enter the dev container via <code>./gitlab-ci/container/container-run.sh</code>
*Run <code>bazel run --config local //rs/state_tool:state-tool -- canister_id_to_hex --canister_id qoctq-giaaa-aaaaa-aaaea-cai</code>

----

=== Step 2: Once the source subnet is halted and the artifacts were obtained ===
After the proposal for halting the source subnet was accepted and the subnet is halted one can obtain the artifacts needed for [[Verification of Artifacts in Subnet Splitting|end to end verification]] and share it with the community. To do so one can follow the steps below:

# Run through the [[Verification of Artifacts in Subnet Splitting|verification process]] manually to verify whether everything works. Only proceed with subsequent steps if verification matches.
# Post the artifacts in the thread and link to the verification process again so that community can verify.
# Continue with subsequent splitting proposals.

Subnet splitting forum announcement template

2023-10-30T12:10:14Z

David: Create template page

This page provides a template process that can be used to announce an upcoming series of proposals to split a subnet.

===Step 1: Heads up template, before starting the process===

----

Hello everybody,

We are planning to submit the proposals to split subnet <code><fill in subnet ID></code> because <code><fill in reason></code>. Currently this subnet is assigned the canister ID ranges <code><fill in ranges; in both representations></code>. Ranges are inclusive, i.e. they include the beginning and the end of the range. We are proposing to split off the canisters corresponding to ID ranges <code><fill in ranges; in both representations></code> to a new subnet. The ranges were chosen so that <code><fill in how ranges were chosen></code>. For more details on subnet splitting please refer to [[Subnet splitting|this]] wiki page.

Generally the subnet splitting process is designed in a way that – apart from downtime (<code><fill in expected downtime if available></code>) – it is transparent to canisters provided that they properly handle transient errors when sending canister-to-canister messages. Note that these transient errors routinely occur during normal subnet functioning – so a correct canister implementation should properly handle these errors. In case you are unsure whether your canister properly handles these transient errors you might want to consider stopping it and restarting it after the split is complete.

It will be possible to verify whether the proposed split is authentic by verifying the splitting related artifacts as described [[Verification of Artifacts in Subnet Splitting|here]]. The required artifacts will be obtained after the proposal to halt the source subnet at a CUP boundary was accepted and the subnet successfully halted. They will be published in a follow up post in this thread.

Subsequent updates will also be shared in this thread.

Hint: to make sense of the ranges and whether your canister is inside or outside some range, the state tool enables the conversion of canister IDs to their hex representation which allows to compare them as numbers. For example, to convert canister ID qoctq-giaaa-aaaaa-aaaea-cai to its hex representation one can:

*Clone the IC repo from https://github.com/dfinity/ic
*From within the repo, enter the dev container via <code>./gitlab-ci/container/container-run.sh</code>
*Run <code>bazel run --config local //rs/state_tool:state-tool -- canister_id_to_hex --canister_id qoctq-giaaa-aaaaa-aaaea-cai</code>

----

=== Step 2: Once the source subnet is halted and the artifacts were obtained ===
After the proposal for halting the source subnet was accepted and the subnet is halted one can obtain the artifacts needed for [[Verification of Artifacts in Subnet Splitting|end to end verification]] and share it with the community. To do so one can follow the steps below:

# Run through the [[Verification of Artifacts in Subnet Splitting|verification process]] manually to verify whether everything works. Only proceed with subsequent steps if verification matches.
# Post the artifacts in the thread and link to the verification process again so that community can verify.
# Continue with subsequent splitting proposals.

Subnet splitting

2023-10-30T11:58:12Z

David: Add link to template page

[[File:Subnet splitting steps.png|alt=Subnet splitting steps|thumb|High level overview of subnet splitting steps]]
The Internet Computer Protocol (ICP) now supports a minimal viable product (MVP) version of subnet splitting. Subnet splitting is a process, where a subset of the canisters from the parent subnet A are split off onto a newly created child subnet B for load balancing purposes and the remaining canisters stay on the trimmed down subnet A’. The MVP process is orchestrated by a series of NNS proposals and, very roughly speaking, consists of the steps visualized on the right.

=== Background ===
To understand all the details of subnet splitting we need to take a closer look at some parts of the ICP.

==== Checkpoints and Manifests ====
Up to date nodes persist their state to disk every couple hundreds of rounds (usually 500). A state that is persisted to disk by the protocol is called a checkpoint. Each checkpoint is a directory with the following structure (note that not all of the files shown below are necessarily present in every checkpoint, e.g., writing empty files may be skipped).<syntaxhighlight lang="shell">
├── canister_states
│ ├── <hex(canister_id)>
│ │ ├── canister.pbuf
│ │ ├── queues.pbuf
│ │ ├── software.wasm
│ │ ├── stable_memory.bin
│ │ └── vmemory_0.bin
│ ...
├── ingress_history.pbuf
├── split_from.pbuf
├── subnet_queues.pbuf
└── system_metadata.pbuf
</syntaxhighlight>For example, the state of each canister is stored in a separate directory. In contrast, there’s a single file containing the current ingress history (request statuses) of all canisters.

For each checkpoint the protocol will also compute a so-called manifest. Such a manifest consists of the individual hashes of every file and file chunk (the actual contents of files are chunked so that the chunks can be individually transmitted) and a root hash computed from them. The root hash has the property that it is intractable to come up with two different states that hash to the same root hash and can hence be used to verify the integrity of a certain state relative to a root hash. A similar property holds for the file and chunk hashes in the manifest. However, they are redundant for guaranteeing the integrity of the entire state, as they are also covered by the root hash. The reason why they are included in the manifest nevertheless is to allow for efficiently computing state diffs by comparing file/chunk hashes.

==== CUPs ====
Every checkpoint will also be signed on behalf of the subnet by the protocol. This is done by creating and threshold signing a so-called catch up package (CUP). A CUP includes the root hash of the corresponding manifest as well as all the information that is required by a newly joining and/or behind node to resume execution from the respective checkpoint. The signature on the CUP together with the properties of the hash described above guarantees the authenticity of the state.

==== Remarks ====
Note that when we point to source code within this article we use links to a specific version of the code to make sure that these links don’t break if the code changes in the future. It is advisable to locate the code that is pointed to in the most recent version and/or the version that is used to verify the artifacts and look at it in this version to avoid missing changes that happened after the publication of this article.

=== Splitting Process ===
In more detail a subnet split proceeds as follows.

# Create a new halted subnet via NNS proposal. This is the subnet to become subnet B. It should have the same size in terms of number of nodes as the subnet to be split (subnet A) to maintain the same trust assumptions. It is important that it is halted to ensure that it will remain in its genesis state. This is because we will propose to overwrite this subnet’s state later with the split off half of subnet A.
# Add canisters to be split off to the migration list (NNS proposal). This is to make other subnets aware that canisters are going to be split off. This will help to react appropriately to messages that are, e.g., expected to come from subnet A according to the routing table but actually come from B or the other way round (note that not all subnets will observe a change of the routing table at the same time).
# Halt subnet A at the next CUP/checkpoint (NNS proposal) and add keys that grant the entity executing the split read only access to the state (NNS proposal). Generally this proposal is the same as in the conventional subnet recovery case. The only difference is that it won’t instruct the subnet to halt immediately but at the next CUP. This will ensure that we obtain a state that is signed on behalf of the subnet so that the split can be verified by the community.
# Update the routing table (NNS proposal). As soon as a subnet observes this change it will (among others) start routing messages to split off canisters to B. The reason for why this happens after halting subnet A is to ensure that A doesn’t route any new messages to B until it is unhalted again.
# Obtain the state of subnet A together with the CUP at the height where it was halted (recall that a read only SSH key was added in step 3). Also obtain a certification of the public key of subnet A issued by the NNS (via a read_state call). Once obtained, split the state into two parts. The first part contains states of canisters that are not split off, as well as the entire subnet-level state of A, such as streams, subnet queues, etc.; this part will be used as the new genesis state of A (we’ll refer to the subnet A with this genesis state as A’). The second state part consists of an empty subnet state plus the canisters split off from A, and will become the new genesis state of subnet B. Both the genesis states of A’ and B will contain the full ingress history, to be pruned in the new subnets’ first execution round. Pruning it beforehand would change the checkpoint file contents and the hashes in the manifest needed for verification of the split would no longer match.
# Verify whether each file ended up on the expected subnet and whether their hashes still match. The artifacts to perform the verification will be published during a split. A manifest only contains hashes but no file contents, i.e., any subnet or canister data. Comparing the hashes is sufficient because of the properties of the hash described above and the fact that split is done in a way that either (1) file contents are either not modified during splitting, or (2) the content of the files is fully determined by the split’s input parameters. If this step is successful, one knows which root hashes to expect in the recovery proposals for subnets A’ and B in the next steps.
# Perform a subnet recovery for subnets A’ (resp. A) and B with the states obtained in the previous step (series of NNS proposals). The NNS voters should check that the root hashes of the states match the ones obtained in the verification step (details below) to be sure that the states were only split but not otherwise tampered with.
# Once it is clear that there are no more messages in any stream to or from subnet A/A’ that may potentially be misrouted, remove the migration list entry (NNS proposal).

To gain confidence that the splitting process doesn’t violate the messaging guarantees that the IC provides to canisters, the above high-level design is accompanied by a [https://github.com/dfinity/tla-models/tree/master/subnet-splitting formal model written in TLA+]. An analysis of the model using the TLC tool found that the design preserves the messaging guarantees.

=== Verification of Artifacts in Subnet Splitting ===
As mentioned before, the process is orchestrated via NNS proposals and it will be end-to-end verifiable in the sense that the community will be able to verify whether the states recovered onto the new subnets are indeed the two halves of the initial state before voting on the respective proposals. A description on how to perform this verification is provided on the [[Verification of Artifacts in Subnet Splitting]] page.

=== Additional Useful Material ===

* [[Subnet splitting forum announcement template|Template]] for announcing upcoming subnet splitting proposals

Verification of Artifacts in Subnet Splitting

2023-09-01T14:42:09Z

David: Add git commit that was used for running the example.

This page describes how to verify the artifacts during the [[subnet splitting]] process. Note that when we point to source code within this article we use links to a specific version of the code to make sure that these links don't break if the code changes in the future. It is advisable to locate the code that is pointed to in the most recent version and/or the version that is used to verify the artifacts and look at it in this version to avoid missing changes that happened after the publication of this article.

The first step is to obtain a vanilla clone of the [https://github.com/dfinity/ic/tree/master/rs IC repository] and check out the commit corresponding to the version currently rolled out on mainnet. Once this is done one can enter the dev container via <code>./gitlab-ci/container/container-run.sh</code>. From within the container one should be able to run the commands as provided below.

===Splitting the State===
The subnet state is split in such a way that the content of most of the individual checkpoint files does not change and so the file and chunk hashes in the manifest can be directly compared to the corresponding ones in the manifest on the initial parent subnet. For example, a canister is retained without modification on exactly one of the resulting subnets. There are a handful of subnet (as opposed to canister) state files where this is not possible, e.g., subnet queues. For these files it is the case, however, that they will remain on the source subnet in their entirety and the split off subnet will get new files, initialized based on the parameters passed to the splitting logic. So the expected hashes of these new files are also fully determined.

The ingress history is a special case: because ingress messages are essentially tied to their respective target canisters both child subnets need parts of the ingress history. To avoid breaking end to end verifiability here, the full ingress history of the parent subnet will be preserved on both child subnets, keeping the hash of the ingress history file the same on the parent and both child subnets. The unnecessary entries are then pruned in the beginning of the first execution round of each of the child subnets.

===Subnet Splitting Artifacts===
The artifacts needed to verify a subnet split are as listed below. Artifacts 2-5 will be published (presumably on the forum) during a subnet split so that the community can factor them into their decision on whether to accept the related proposals.

#The NNS public key. '''This key serves as the root of trust of the validation and will not be distributed as part of the artifacts.''' We recommend that everyone who wants to verify the artifacts obtains this key out of band. For example the agents and the rosetta node have the mainnet NNS public key baked in and one can obtain them from the respective repositories on GitHub.
#A certificate from the NNS that includes the public key of the subnet to be split. This will allow to verify the authenticity of the subnet public key relative to the NNS public key.
#The subnet ID of the subnet to be split. If the certificate from the previous step is deemed valid relative to the NNS public key, this ID can be used to extract the subnet’s public key from the certificate.
#The CUP of the subnet to be split at the height where it was halted. The CUP signature can be verified using the subnet public key from the previous step and, if verification succeeds, the authentic root hash of the state can be extracted.
#A textual representation of the manifest of the subnet to be split at the height where it was halted. One can recompute the root hash from the file and chunk hashes in the textual representation of the manifest using the state tool. If the recomputed root hash matches the one extracted from the CUP in the previous step, one knows that the file and chunk hashes in the manifest must be authentic and can hence be used as a basis of comparison, without knowing the actual file contents, i.e., just based on the hashes.

The verification procedure of the artifacts can be triggered via the validate subcommand of the subnet splitting tool. We have prepared demo artifacts we obtained from a split that was carried out on a testnet. They can be downloaded [https://drive.google.com/file/d/1xFyStpVce0B6WRgxvfB68EG_KtDPm0N4/view?usp=drive_link here]. In the commands below we assume that you have the IC repository at git commit <code>02b79bcf3cefbb86b5252a2576ccb4dd7e0a9090</code> checked out and unzipped the artifacts archive into the <code>/tmp/splitting-artifacts</code> directory.<syntaxhighlight lang="shell-session">
bazel run //rs/recovery/subnet_splitting:subnet-splitting-tool \
--config local \
-- validate \
--nns-public-key-path /tmp/splitting-artifacts/demo_nns_public_key.pem \
--state-tree-path /tmp/splitting-artifacts/demo_state_tree.cbor \
--source-subnet-id gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae \
--cup-path /tmp/splitting-artifacts/demo_CUP.pbuf \
--state-manifest-path /tmp/splitting-artifacts/demo_source_manifest.txt

</syntaxhighlight>This is the expected output upon a successful validation.<syntaxhighlight lang="shell-session">
Aug 31 08:15:05.954 INFO Validating State Tree signed by the NNS
Aug 31 08:15:05.954 INFO Reading the NNS public key from /tmp/splitting-artifacts/demo_nns_public_key.pem
Aug 31 08:15:05.957 INFO Validation succeeded: extracted authentic subnet key from the NNS state tree.
Aug 31 08:15:05.957 INFO
Aug 31 08:15:05.957 INFO Validating Source Subnet's original CUP
Aug 31 08:15:05.960 INFO Dealer subnet from the CUP: gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae
Aug 31 08:15:05.960 INFO CUP height: 82600
Aug 31 08:15:05.960 INFO Block time from the CUP: 2023-08-22 09:23:29.206708237 UTC (nanos since unix epoch: 1692696209206708237)
Aug 31 08:15:05.960 INFO State hash from the CUP: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO Validation succeeded: source subnet CUP signature is valid.
Aug 31 08:15:05.967 INFO
Aug 31 08:15:05.967 INFO Validating Source Subnet's original state manifest
Aug 31 08:15:05.967 INFO state hash from the CUP: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO state hash from the State Manifest: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO Validation succeeded: recomputed manifest root hash matches the one in the CUP.
Aug 31 08:15:05.967 INFO

</syntaxhighlight>

===Computing the expected manifests and root hashes===
If the verification above succeeds, one can split the authentic manifest of the parent subnet to compute the expected manifests for the child subnets. Before we start with the actual procedure, let’s look at the example manifest of the source subnet in our testnet setting. <syntaxhighlight lang="shell-session">
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 216 | 8feafcda70181ac8920dfeba5db951f931f87758ecd5dcd7c920a8468a5bacf9 | canister_states/00000000003000000101/canister.pbuf
1 | 532 | cfa0f9a9f833b072aa0c8d1f6eb8bb9334a36168379eaa0925d9013b2d646bbb | canister_states/00000000003000010101/canister.pbuf
2 | 459789 | 4840e8e6e379f2108cec9de7ef386ea5f927f687b121b90ae765ad144bc0222a | canister_states/00000000003000010101/software.wasm
3 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000010101/stable_memory.bin
4 | 1093632 | 1ce90b6ed4c455e1c51c4c2ff466d3c6d49c76a585a014d61eebae6c09b0980b | canister_states/00000000003000010101/vmemory_0.bin
5 | 4721 | 346f8b36205998ef3c3c70d5ef64f27096714c83925d98f9e7daf93bf71bd66b | canister_states/00000000003000020101/canister.pbuf
6 | 324196 | 73b2c88c9db040794b23535ecd57fa14528d31ada4d9a82fdb6e9cd21cc0c1e6 | canister_states/00000000003000020101/software.wasm
7 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000020101/stable_memory.bin
8 | 1245184 | 0779cd1b7b5afcac3a2e19070abab88bd637261965c36dd1334928980e7a3744 | canister_states/00000000003000020101/vmemory_0.bin
9 | 749 | 493d2f01070bccb63bf75764bcdbc42b1a6946e191029afd74353f339939f51e | canister_states/00000000003000030101/canister.pbuf
10 | 968242 | 912f20ef0b5769a55316b7463cc1d55f079a2e91d1d358a5524a28c0782e82b1 | canister_states/00000000003000030101/software.wasm
11 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000030101/stable_memory.bin
12 | 1458176 | 26f9c82ae827458e84befaa73c42f0cbcf39a2c9dae0c777aa26e7fd6a75db62 | canister_states/00000000003000030101/vmemory_0.bin
13 | 2418 | b18a59a56442ddd68d9fcc6fb0103f12c0c0bfd0d01c473711d53b2aff18d596 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 216 | 0c6f866e1b60969cc48b4015bb382d45c282b46a82dd24769a15e9628e4d2f62
1 | 1 | 0 | 532 | 3b53a0d9cb9e0131a4677d86c237ed06a1aadaf46c480e448a8d78afe9c43635
2 | 2 | 0 | 459789 | c470eb0871e137fccfb419b16e2826a63c2a1dba913f5c260ab22ee2958afd28
3 | 4 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
4 | 4 | 1048576 | 45056 | 1ab510085a2a3090dd1d49c578fceabd56a6433bd75e7e2ecc3107c8f8f208eb
5 | 5 | 0 | 4721 | 81af05e10350340c340db100e2b4c1dddc3a7f800ea573ddcef28d562fa70a10
6 | 6 | 0 | 324196 | accdd1e66c56db5b5ff0ac24182a10f65559cfd17cc52d86ac3aa6cae5d567b4
7 | 8 | 0 | 1048576 | a5d9f9ce12da611d7b170fc3156f2dc1255ee11fb6525ddc4dbc3cb7c5f488e5
8 | 8 | 1048576 | 196608 | 9f0bd49a6523364cbc57561e45d67597ef66e29c3fb36273465efd4d210872e8
9 | 9 | 0 | 749 | c0e5fae6857b04d035f8fd752505da9f3905514c4d8dcec7650a18e43af61fa5
10 | 10 | 0 | 968242 | f431111e6b9a9a074daf350cba6404abc47a993f7ddca098de1819c62d99a515
11 | 12 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
12 | 12 | 1048576 | 409600 | 93207dec95a267dab702bb1991af9dd9c5df72ba9d87188d37772282d3948632
13 | 13 | 0 | 2418 | 679b9963d906d7ecbf818b3e73d8654bd4944d7bd35d6c7c4fde7223bd8d664b

ROOT HASH: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416

</syntaxhighlight>The manifest consists of a file table listing all files and their hashes; and a chunk table enumerating the hashes of the chunks making up the individual files. Concretely, our state contains 4 canisters with IDs

*<code>fs35c-jyaaa-aaaab-qaaaa-cai</code> (<code>00000000003000000101</code>),
*<code>fv23w-eaaaa-aaaab-qaaaq-cai</code> (<code>00000000003000010101</code>),
*<code>f4zqk-siaaa-aaaab-qaaba-cai</code> (<code>00000000003000020101</code>) and
*<code>f3yw6-7qaaa-aaaab-qaabq-cai</code> (<code>00000000003000030101</code>).

As we can see in the manifest, the first canister doesn’t have a WASM installed and therefore only has a canister.pbuf file, while others do and also have all the other files that are specific to canisters with installed WASMs.

Finally there is also a root hash that covers all the files and chunks (via their hashes). For a correct set of artifacts the root hash should match the root hash in the CUP printed by the verification tool above. As we can see, we have a root hash match for our sample artifacts. More details about how the state is hashed to obtain the manifest can be found [https://github.com/dfinity/ic/blob/a4191e757dbc581cbc8cef9160aea1e852218a1a/rs/types/types/src/state_sync.rs#L3 here].

Now, recall that the hashes in the parent manifest and the subnet splitting parameters fully determine the manifests of the child subnets. This means that one can use the state tool to compute the expected manifests and root hashes for the child subnets.

To continue our example, let us assume we want to split off the last two canisters onto a subnet with the following ID:

*<code>ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe</code>.

We can compute the expected manifests with the <code>split_manifest</code> command of the state tool. While we obviously need to provide the ranges we want to split off, we also need to provide the subnet IDs of the subnets involved in the split, the subnet type and the batch time in nanoseconds. The latter is supposed to be set to the batch time in the CUP as printed in the verification step above. <syntaxhighlight lang="shell-session">
bazel run //rs/state_tool:state-tool \
--config local \
-- split_manifest \
--path /tmp/splitting-artifacts/demo_source_manifest.txt \
--from-subnet gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae \
--to-subnet ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe \
--subnet-type application \
--batch-time-nanos 1692696209206708237 \
--migrated-ranges f4zqk-siaaa-aaaab-qaaba-cai:f3yw6-7qaaa-aaaab-qaabq-cai

</syntaxhighlight>Which will output the two manifests. For better readability we present and discuss them individually. The manifest for the first subnet A’ looks as follows:<syntaxhighlight lang="shell-session">
Subnet gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae
--------
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 216 | 8feafcda70181ac8920dfeba5db951f931f87758ecd5dcd7c920a8468a5bacf9 | canister_states/00000000003000000101/canister.pbuf
1 | 532 | cfa0f9a9f833b072aa0c8d1f6eb8bb9334a36168379eaa0925d9013b2d646bbb | canister_states/00000000003000010101/canister.pbuf
2 | 459789 | 4840e8e6e379f2108cec9de7ef386ea5f927f687b121b90ae765ad144bc0222a | canister_states/00000000003000010101/software.wasm
3 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000010101/stable_memory.bin
4 | 1093632 | 1ce90b6ed4c455e1c51c4c2ff466d3c6d49c76a585a014d61eebae6c09b0980b | canister_states/00000000003000010101/vmemory_0.bin
5 | 35 | f590329cc74af3afc36daf577bf29136bc81279401a37175f8a9404277cce4b0 | split_from.pbuf
6 | 2418 | b18a59a56442ddd68d9fcc6fb0103f12c0c0bfd0d01c473711d53b2aff18d596 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 216 | 0c6f866e1b60969cc48b4015bb382d45c282b46a82dd24769a15e9628e4d2f62
1 | 1 | 0 | 532 | 3b53a0d9cb9e0131a4677d86c237ed06a1aadaf46c480e448a8d78afe9c43635
2 | 2 | 0 | 459789 | c470eb0871e137fccfb419b16e2826a63c2a1dba913f5c260ab22ee2958afd28
3 | 4 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
4 | 4 | 1048576 | 45056 | 1ab510085a2a3090dd1d49c578fceabd56a6433bd75e7e2ecc3107c8f8f208eb
5 | 5 | 0 | 35 | 51d26533676036bda303d066119cd7499bd39acc5d2d7b420f8c6d95ff7dd79b
6 | 6 | 0 | 2418 | 679b9963d906d7ecbf818b3e73d8654bd4944d7bd35d6c7c4fde7223bd8d664b

ROOT HASH: 8cb1880663d42b8e6c32cc894e7b7bc9304fc0f72bc89db251a81fe15256c1e8
</syntaxhighlight>We can see that all the files and chunks that are also present in the parent manifest have matching hashes. In addition, we have a new <code>split_from.pbuf</code> file that carries meta information about the split. It is introduced by the splitting logic and computed from the input parameters (see [https://github.com/dfinity/ic/blob/a4191e757dbc581cbc8cef9160aea1e852218a1a/rs/state_tool/src/commands/split_manifest.rs#L23 here]). Note that this demo state has an empty ingress history and so doesn’t contain a file corresponding to the ingress history. But if it would, the ingress history’s hash should match the one in the original manifest (the ingress history will only be pruned once the split subnets start making progress again). The manifest for the second subnet (B) looks as follows:<syntaxhighlight lang="shell-session">
Subnet ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe
--------
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 4721 | 346f8b36205998ef3c3c70d5ef64f27096714c83925d98f9e7daf93bf71bd66b | canister_states/00000000003000020101/canister.pbuf
1 | 324196 | 73b2c88c9db040794b23535ecd57fa14528d31ada4d9a82fdb6e9cd21cc0c1e6 | canister_states/00000000003000020101/software.wasm
2 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000020101/stable_memory.bin
3 | 1245184 | 0779cd1b7b5afcac3a2e19070abab88bd637261965c36dd1334928980e7a3744 | canister_states/00000000003000020101/vmemory_0.bin
4 | 749 | 493d2f01070bccb63bf75764bcdbc42b1a6946e191029afd74353f339939f51e | canister_states/00000000003000030101/canister.pbuf
5 | 968242 | 912f20ef0b5769a55316b7463cc1d55f079a2e91d1d358a5524a28c0782e82b1 | canister_states/00000000003000030101/software.wasm
6 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000030101/stable_memory.bin
7 | 1458176 | 26f9c82ae827458e84befaa73c42f0cbcf39a2c9dae0c777aa26e7fd6a75db62 | canister_states/00000000003000030101/vmemory_0.bin
8 | 35 | f590329cc74af3afc36daf577bf29136bc81279401a37175f8a9404277cce4b0 | split_from.pbuf
9 | 77 | c29f819c86ba8fe211a272961e15dbe1c87e801619c9391dda36f83fbe4192f2 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 4721 | 81af05e10350340c340db100e2b4c1dddc3a7f800ea573ddcef28d562fa70a10
1 | 1 | 0 | 324196 | accdd1e66c56db5b5ff0ac24182a10f65559cfd17cc52d86ac3aa6cae5d567b4
2 | 3 | 0 | 1048576 | a5d9f9ce12da611d7b170fc3156f2dc1255ee11fb6525ddc4dbc3cb7c5f488e5
3 | 3 | 1048576 | 196608 | 9f0bd49a6523364cbc57561e45d67597ef66e29c3fb36273465efd4d210872e8
4 | 4 | 0 | 749 | c0e5fae6857b04d035f8fd752505da9f3905514c4d8dcec7650a18e43af61fa5
5 | 5 | 0 | 968242 | f431111e6b9a9a074daf350cba6404abc47a993f7ddca098de1819c62d99a515
6 | 7 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
7 | 7 | 1048576 | 409600 | 93207dec95a267dab702bb1991af9dd9c5df72ba9d87188d37772282d3948632
8 | 8 | 0 | 35 | 51d26533676036bda303d066119cd7499bd39acc5d2d7b420f8c6d95ff7dd79b
9 | 9 | 0 | 77 | 4c7c38987b8ef245c68643cdc0ae519b8640f4bd260245e2c5b2f3a070abe789

ROOT HASH: 00bcb78b825ce4146ce42e117d5d7256d2f1e351d974753f8e08b588ed11bcd1

</syntaxhighlight>Again, the canister-related file and chunk hashes match the ones of the parent manifest. It again contains a split_from.pbuf file with the same hash as the one in the manifest of the sibling subnet. Finally, it also contains a newly created system metadata file, initialized based on the parameters passed to the splitting tool but otherwise empty (see here for the code).

If all the above properties hold, and the root hashes of these manifests appear as the root hashes of the states in the respective subnet recovery proposals for the child subnets, then one can conclude that the split is authentic and it is safe to vote in favor of the respective proposals from an authenticity point of view.

Verification of Artifacts in Subnet Splitting

2023-09-01T07:36:33Z

David: Fix link

This page describes how to verify the artifacts during the [[subnet splitting]] process. Note that when we point to source code within this article we use links to a specific version of the code to make sure that these links don't break if the code changes in the future. It is advisable to locate the code that is pointed to in the most recent version and/or the version that is used to verify the artifacts and look at it in this version to avoid missing changes that happened after the publication of this article.

The first step is to obtain a vanilla clone of the [https://github.com/dfinity/ic/tree/master/rs IC repository] and check out the commit corresponding to the version currently rolled out on mainnet. Once this is done one can enter the dev container via <code>./gitlab-ci/container/container-run.sh</code>. From within the container one should be able to run the commands as provided below.

===Splitting the State===
The subnet state is split in such a way that the content of most of the individual checkpoint files does not change and so the file and chunk hashes in the manifest can be directly compared to the corresponding ones in the manifest on the initial parent subnet. For example, a canister is retained without modification on exactly one of the resulting subnets. There are a handful of subnet (as opposed to canister) state files where this is not possible, e.g., subnet queues. For these files it is the case, however, that they will remain on the source subnet in their entirety and the split off subnet will get new files, initialized based on the parameters passed to the splitting logic. So the expected hashes of these new files are also fully determined.

The ingress history is a special case: because ingress messages are essentially tied to their respective target canisters both child subnets need parts of the ingress history. To avoid breaking end to end verifiability here, the full ingress history of the parent subnet will be preserved on both child subnets, keeping the hash of the ingress history file the same on the parent and both child subnets. The unnecessary entries are then pruned in the beginning of the first execution round of each of the child subnets.

===Subnet Splitting Artifacts===
The artifacts needed to verify a subnet split are as listed below. Artifacts 2-5 will be published (presumably on the forum) during a subnet split so that the community can factor them into their decision on whether to accept the related proposals.

#The NNS public key. '''This key serves as the root of trust of the validation and will not be distributed as part of the artifacts.''' We recommend that everyone who wants to verify the artifacts obtains this key out of band. For example the agents and the rosetta node have the mainnet NNS public key baked in and one can obtain them from the respective repositories on GitHub.
#A certificate from the NNS that includes the public key of the subnet to be split. This will allow to verify the authenticity of the subnet public key relative to the NNS public key.
#The subnet ID of the subnet to be split. If the certificate from the previous step is deemed valid relative to the NNS public key, this ID can be used to extract the subnet’s public key from the certificate.
#The CUP of the subnet to be split at the height where it was halted. The CUP signature can be verified using the subnet public key from the previous step and, if verification succeeds, the authentic root hash of the state can be extracted.
#A textual representation of the manifest of the subnet to be split at the height where it was halted. One can recompute the root hash from the file and chunk hashes in the textual representation of the manifest using the state tool. If the recomputed root hash matches the one extracted from the CUP in the previous step, one knows that the file and chunk hashes in the manifest must be authentic and can hence be used as a basis of comparison, without knowing the actual file contents, i.e., just based on the hashes.

The verification procedure of the artifacts can be triggered via the validate subcommand of the subnet splitting tool. We have prepared demo artifacts we obtained from a split that was carried out on a testnet. They can be downloaded here. In the commands below we assume that you download the artifacts into the <code>/tmp/splitting-artifacts</code> directory.<syntaxhighlight lang="shell-session">
bazel run //rs/recovery/subnet_splitting:subnet-splitting-tool \
--config local \
-- validate \
--nns-public-key-path /tmp/splitting-artifacts/demo_nns_public_key.pem \
--state-tree-path /tmp/splitting-artifacts/demo_state_tree.cbor \
--source-subnet-id gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae \
--cup-path /tmp/splitting-artifacts/demo_CUP.pbuf \
--state-manifest-path /tmp/splitting-artifacts/demo_source_manifest.txt

</syntaxhighlight>This is the expected output upon a successful validation.<syntaxhighlight lang="shell-session">
Aug 31 08:15:05.954 INFO Validating State Tree signed by the NNS
Aug 31 08:15:05.954 INFO Reading the NNS public key from /tmp/splitting-artifacts/demo_nns_public_key.pem
Aug 31 08:15:05.957 INFO Validation succeeded: extracted authentic subnet key from the NNS state tree.
Aug 31 08:15:05.957 INFO
Aug 31 08:15:05.957 INFO Validating Source Subnet's original CUP
Aug 31 08:15:05.960 INFO Dealer subnet from the CUP: gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae
Aug 31 08:15:05.960 INFO CUP height: 82600
Aug 31 08:15:05.960 INFO Block time from the CUP: 2023-08-22 09:23:29.206708237 UTC (nanos since unix epoch: 1692696209206708237)
Aug 31 08:15:05.960 INFO State hash from the CUP: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO Validation succeeded: source subnet CUP signature is valid.
Aug 31 08:15:05.967 INFO
Aug 31 08:15:05.967 INFO Validating Source Subnet's original state manifest
Aug 31 08:15:05.967 INFO state hash from the CUP: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO state hash from the State Manifest: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416
Aug 31 08:15:05.967 INFO Validation succeeded: recomputed manifest root hash matches the one in the CUP.
Aug 31 08:15:05.967 INFO

</syntaxhighlight>

===Computing the expected manifests and root hashes===
If the verification above succeeds, one can split the authentic manifest of the parent subnet to compute the expected manifests for the child subnets. Before we start with the actual procedure, let’s look at the example manifest of the source subnet in our testnet setting. <syntaxhighlight lang="shell-session">
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 216 | 8feafcda70181ac8920dfeba5db951f931f87758ecd5dcd7c920a8468a5bacf9 | canister_states/00000000003000000101/canister.pbuf
1 | 532 | cfa0f9a9f833b072aa0c8d1f6eb8bb9334a36168379eaa0925d9013b2d646bbb | canister_states/00000000003000010101/canister.pbuf
2 | 459789 | 4840e8e6e379f2108cec9de7ef386ea5f927f687b121b90ae765ad144bc0222a | canister_states/00000000003000010101/software.wasm
3 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000010101/stable_memory.bin
4 | 1093632 | 1ce90b6ed4c455e1c51c4c2ff466d3c6d49c76a585a014d61eebae6c09b0980b | canister_states/00000000003000010101/vmemory_0.bin
5 | 4721 | 346f8b36205998ef3c3c70d5ef64f27096714c83925d98f9e7daf93bf71bd66b | canister_states/00000000003000020101/canister.pbuf
6 | 324196 | 73b2c88c9db040794b23535ecd57fa14528d31ada4d9a82fdb6e9cd21cc0c1e6 | canister_states/00000000003000020101/software.wasm
7 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000020101/stable_memory.bin
8 | 1245184 | 0779cd1b7b5afcac3a2e19070abab88bd637261965c36dd1334928980e7a3744 | canister_states/00000000003000020101/vmemory_0.bin
9 | 749 | 493d2f01070bccb63bf75764bcdbc42b1a6946e191029afd74353f339939f51e | canister_states/00000000003000030101/canister.pbuf
10 | 968242 | 912f20ef0b5769a55316b7463cc1d55f079a2e91d1d358a5524a28c0782e82b1 | canister_states/00000000003000030101/software.wasm
11 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000030101/stable_memory.bin
12 | 1458176 | 26f9c82ae827458e84befaa73c42f0cbcf39a2c9dae0c777aa26e7fd6a75db62 | canister_states/00000000003000030101/vmemory_0.bin
13 | 2418 | b18a59a56442ddd68d9fcc6fb0103f12c0c0bfd0d01c473711d53b2aff18d596 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 216 | 0c6f866e1b60969cc48b4015bb382d45c282b46a82dd24769a15e9628e4d2f62
1 | 1 | 0 | 532 | 3b53a0d9cb9e0131a4677d86c237ed06a1aadaf46c480e448a8d78afe9c43635
2 | 2 | 0 | 459789 | c470eb0871e137fccfb419b16e2826a63c2a1dba913f5c260ab22ee2958afd28
3 | 4 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
4 | 4 | 1048576 | 45056 | 1ab510085a2a3090dd1d49c578fceabd56a6433bd75e7e2ecc3107c8f8f208eb
5 | 5 | 0 | 4721 | 81af05e10350340c340db100e2b4c1dddc3a7f800ea573ddcef28d562fa70a10
6 | 6 | 0 | 324196 | accdd1e66c56db5b5ff0ac24182a10f65559cfd17cc52d86ac3aa6cae5d567b4
7 | 8 | 0 | 1048576 | a5d9f9ce12da611d7b170fc3156f2dc1255ee11fb6525ddc4dbc3cb7c5f488e5
8 | 8 | 1048576 | 196608 | 9f0bd49a6523364cbc57561e45d67597ef66e29c3fb36273465efd4d210872e8
9 | 9 | 0 | 749 | c0e5fae6857b04d035f8fd752505da9f3905514c4d8dcec7650a18e43af61fa5
10 | 10 | 0 | 968242 | f431111e6b9a9a074daf350cba6404abc47a993f7ddca098de1819c62d99a515
11 | 12 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
12 | 12 | 1048576 | 409600 | 93207dec95a267dab702bb1991af9dd9c5df72ba9d87188d37772282d3948632
13 | 13 | 0 | 2418 | 679b9963d906d7ecbf818b3e73d8654bd4944d7bd35d6c7c4fde7223bd8d664b

ROOT HASH: 46fbb99db26d751c5b2b9f35778d5800bae74566f968388fbb0b36226a488416

</syntaxhighlight>The manifest consists of a file table listing all files and their hashes; and a chunk table enumerating the hashes of the chunks making up the individual files. Concretely, our state contains 4 canisters with IDs

*<code>fs35c-jyaaa-aaaab-qaaaa-cai</code> (<code>00000000003000000101</code>),
*<code>fv23w-eaaaa-aaaab-qaaaq-cai</code> (<code>00000000003000010101</code>),
*<code>f4zqk-siaaa-aaaab-qaaba-cai</code> (<code>00000000003000020101</code>) and
*<code>f3yw6-7qaaa-aaaab-qaabq-cai</code> (<code>00000000003000030101</code>).

As we can see in the manifest, the first canister doesn’t have a WASM installed and therefore only has a canister.pbuf file, while others do and also have all the other files that are specific to canisters with installed WASMs.

Finally there is also a root hash that covers all the files and chunks (via their hashes). For a correct set of artifacts the root hash should match the root hash in the CUP printed by the verification tool above. As we can see, we have a root hash match for our sample artifacts. More details about how the state is hashed to obtain the manifest can be found [https://github.com/dfinity/ic/blob/a4191e757dbc581cbc8cef9160aea1e852218a1a/rs/types/types/src/state_sync.rs#L3 here].

Now, recall that the hashes in the parent manifest and the subnet splitting parameters fully determine the manifests of the child subnets. This means that one can use the state tool to compute the expected manifests and root hashes for the child subnets.

To continue our example, let us assume we want to split off the last two canisters onto a subnet with the following ID:

*<code>ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe</code>.

We can compute the expected manifests with the <code>split_manifest</code> command of the state tool. While we obviously need to provide the ranges we want to split off, we also need to provide the subnet IDs of the subnets involved in the split, the subnet type and the batch time in nanoseconds. The latter is supposed to be set to the batch time in the CUP as printed in the verification step above. <syntaxhighlight lang="shell-session">
bazel run //rs/state_tool:state-tool \
--config local \
-- split_manifest \
--path /tmp/splitting-artifacts/demo_source_manifest.txt \
--from-subnet gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae \
--to-subnet ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe \
--subnet-type application \
--batch-time-nanos 1692696209206708237 \
--migrated-ranges f4zqk-siaaa-aaaab-qaaba-cai:f3yw6-7qaaa-aaaab-qaabq-cai

</syntaxhighlight>Which will output the two manifests. For better readability we present and discuss them individually. The manifest for the first subnet A’ looks as follows:<syntaxhighlight lang="shell-session">
Subnet gmmtp-v5kpc-wdkrx-pni4q-6qwld-ntwrv-kqxva-c3rnm-o6lx3-2eg74-oae
--------
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 216 | 8feafcda70181ac8920dfeba5db951f931f87758ecd5dcd7c920a8468a5bacf9 | canister_states/00000000003000000101/canister.pbuf
1 | 532 | cfa0f9a9f833b072aa0c8d1f6eb8bb9334a36168379eaa0925d9013b2d646bbb | canister_states/00000000003000010101/canister.pbuf
2 | 459789 | 4840e8e6e379f2108cec9de7ef386ea5f927f687b121b90ae765ad144bc0222a | canister_states/00000000003000010101/software.wasm
3 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000010101/stable_memory.bin
4 | 1093632 | 1ce90b6ed4c455e1c51c4c2ff466d3c6d49c76a585a014d61eebae6c09b0980b | canister_states/00000000003000010101/vmemory_0.bin
5 | 35 | f590329cc74af3afc36daf577bf29136bc81279401a37175f8a9404277cce4b0 | split_from.pbuf
6 | 2418 | b18a59a56442ddd68d9fcc6fb0103f12c0c0bfd0d01c473711d53b2aff18d596 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 216 | 0c6f866e1b60969cc48b4015bb382d45c282b46a82dd24769a15e9628e4d2f62
1 | 1 | 0 | 532 | 3b53a0d9cb9e0131a4677d86c237ed06a1aadaf46c480e448a8d78afe9c43635
2 | 2 | 0 | 459789 | c470eb0871e137fccfb419b16e2826a63c2a1dba913f5c260ab22ee2958afd28
3 | 4 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
4 | 4 | 1048576 | 45056 | 1ab510085a2a3090dd1d49c578fceabd56a6433bd75e7e2ecc3107c8f8f208eb
5 | 5 | 0 | 35 | 51d26533676036bda303d066119cd7499bd39acc5d2d7b420f8c6d95ff7dd79b
6 | 6 | 0 | 2418 | 679b9963d906d7ecbf818b3e73d8654bd4944d7bd35d6c7c4fde7223bd8d664b

ROOT HASH: 8cb1880663d42b8e6c32cc894e7b7bc9304fc0f72bc89db251a81fe15256c1e8
</syntaxhighlight>We can see that all the files and chunks that are also present in the parent manifest have matching hashes. In addition, we have a new <code>split_from.pbuf</code> file that carries meta information about the split. It is introduced by the splitting logic and computed from the input parameters (see [https://github.com/dfinity/ic/blob/a4191e757dbc581cbc8cef9160aea1e852218a1a/rs/state_tool/src/commands/split_manifest.rs#L23 here]). Note that this demo state has an empty ingress history and so doesn’t contain a file corresponding to the ingress history. But if it would, the ingress history’s hash should match the one in the original manifest (the ingress history will only be pruned once the split subnets start making progress again). The manifest for the second subnet (B) looks as follows:<syntaxhighlight lang="shell-session">
Subnet ykq2b-vnsfx-dwzwl-tvwli-ubd4d-lr6br-hpnqf-mettk-jrgtc-n5ioc-mqe
--------
MANIFEST VERSION: V3
FILE TABLE
idx | size | hash | path
------------+------------+------------------------------------------------------------------+------------------------------------------------------
0 | 4721 | 346f8b36205998ef3c3c70d5ef64f27096714c83925d98f9e7daf93bf71bd66b | canister_states/00000000003000020101/canister.pbuf
1 | 324196 | 73b2c88c9db040794b23535ecd57fa14528d31ada4d9a82fdb6e9cd21cc0c1e6 | canister_states/00000000003000020101/software.wasm
2 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000020101/stable_memory.bin
3 | 1245184 | 0779cd1b7b5afcac3a2e19070abab88bd637261965c36dd1334928980e7a3744 | canister_states/00000000003000020101/vmemory_0.bin
4 | 749 | 493d2f01070bccb63bf75764bcdbc42b1a6946e191029afd74353f339939f51e | canister_states/00000000003000030101/canister.pbuf
5 | 968242 | 912f20ef0b5769a55316b7463cc1d55f079a2e91d1d358a5524a28c0782e82b1 | canister_states/00000000003000030101/software.wasm
6 | 0 | 981305d7c0b2ace0f53fe822f4075278fa28511e8c34e70f37fd8425af659b36 | canister_states/00000000003000030101/stable_memory.bin
7 | 1458176 | 26f9c82ae827458e84befaa73c42f0cbcf39a2c9dae0c777aa26e7fd6a75db62 | canister_states/00000000003000030101/vmemory_0.bin
8 | 35 | f590329cc74af3afc36daf577bf29136bc81279401a37175f8a9404277cce4b0 | split_from.pbuf
9 | 77 | c29f819c86ba8fe211a272961e15dbe1c87e801619c9391dda36f83fbe4192f2 | system_metadata.pbuf
CHUNK TABLE
idx | file_idx | offset | size | hash
------------+------------+------------+------------+------------------------------------------------------------------
0 | 0 | 0 | 4721 | 81af05e10350340c340db100e2b4c1dddc3a7f800ea573ddcef28d562fa70a10
1 | 1 | 0 | 324196 | accdd1e66c56db5b5ff0ac24182a10f65559cfd17cc52d86ac3aa6cae5d567b4
2 | 3 | 0 | 1048576 | a5d9f9ce12da611d7b170fc3156f2dc1255ee11fb6525ddc4dbc3cb7c5f488e5
3 | 3 | 1048576 | 196608 | 9f0bd49a6523364cbc57561e45d67597ef66e29c3fb36273465efd4d210872e8
4 | 4 | 0 | 749 | c0e5fae6857b04d035f8fd752505da9f3905514c4d8dcec7650a18e43af61fa5
5 | 5 | 0 | 968242 | f431111e6b9a9a074daf350cba6404abc47a993f7ddca098de1819c62d99a515
6 | 7 | 0 | 1048576 | de5005242d69024d356dd4d03d8a6879a2b8aadfff6e0f5f80900d45d363305e
7 | 7 | 1048576 | 409600 | 93207dec95a267dab702bb1991af9dd9c5df72ba9d87188d37772282d3948632
8 | 8 | 0 | 35 | 51d26533676036bda303d066119cd7499bd39acc5d2d7b420f8c6d95ff7dd79b
9 | 9 | 0 | 77 | 4c7c38987b8ef245c68643cdc0ae519b8640f4bd260245e2c5b2f3a070abe789

ROOT HASH: 00bcb78b825ce4146ce42e117d5d7256d2f1e351d974753f8e08b588ed11bcd1

</syntaxhighlight>Again, the canister-related file and chunk hashes match the ones of the parent manifest. It again contains a split_from.pbuf file with the same hash as the one in the manifest of the sibling subnet. Finally, it also contains a newly created system metadata file, initialized based on the parameters passed to the splitting tool but otherwise empty (see here for the code).

If all the above properties hold, and the root hashes of these manifests appear as the root hashes of the states in the respective subnet recovery proposals for the child subnets, then one can conclude that the split is authentic and it is safe to vote in favor of the respective proposals from an authenticity point of view.

Subnet splitting

2023-08-31T13:21:26Z

David: Subnet splitting page add text

[[File:Subnet splitting steps.png|alt=Subnet splitting steps|thumb|High level overview of subnet splitting steps]]
The Internet Computer Protocol (ICP) now supports a minimal viable product (MVP) version of subnet splitting. Subnet splitting is a process, where a subset of the canisters from the parent subnet A are split off onto a newly created child subnet B for load balancing purposes and the remaining canisters stay on the trimmed down subnet A’. The MVP process is orchestrated by a series of NNS proposals and, very roughly speaking, consists of the steps visualized on the right.

=== Background ===
To understand all the details of subnet splitting we need to take a closer look at some parts of the ICP.

==== Checkpoints and Manifests ====
Up to date nodes persist their state to disk every couple hundreds of rounds (usually 500). A state that is persisted to disk by the protocol is called a checkpoint. Each checkpoint is a directory with the following structure (note that not all of the files shown below are necessarily present in every checkpoint, e.g., writing empty files may be skipped).<syntaxhighlight lang="shell">
├── canister_states
│ ├── <hex(canister_id)>
│ │ ├── canister.pbuf
│ │ ├── queues.pbuf
│ │ ├── software.wasm
│ │ ├── stable_memory.bin
│ │ └── vmemory_0.bin
│ ...
├── ingress_history.pbuf
├── split_from.pbuf
├── subnet_queues.pbuf
└── system_metadata.pbuf
</syntaxhighlight>For example, the state of each canister is stored in a separate directory. In contrast, there’s a single file containing the current ingress history (request statuses) of all canisters.

For each checkpoint the protocol will also compute a so-called manifest. Such a manifest consists of the individual hashes of every file and file chunk (the actual contents of files are chunked so that the chunks can be individually transmitted) and a root hash computed from them. The root hash has the property that it is intractable to come up with two different states that hash to the same root hash and can hence be used to verify the integrity of a certain state relative to a root hash. A similar property holds for the file and chunk hashes in the manifest. However, they are redundant for guaranteeing the integrity of the entire state, as they are also covered by the root hash. The reason why they are included in the manifest nevertheless is to allow for efficiently computing state diffs by comparing file/chunk hashes.

==== CUPs ====
Every checkpoint will also be signed on behalf of the subnet by the protocol. This is done by creating and threshold signing a so-called catch up package (CUP). A CUP includes the root hash of the corresponding manifest as well as all the information that is required by a newly joining and/or behind node to resume execution from the respective checkpoint. The signature on the CUP together with the properties of the hash described above guarantees the authenticity of the state.

==== Remarks ====
Note that when we point to source code within this article we use links to a specific version of the code to make sure that these links don’t break if the code changes in the future. It is advisable to locate the code that is pointed to in the most recent version and/or the version that is used to verify the artifacts and look at it in this version to avoid missing changes that happened after the publication of this article.

=== Splitting Process ===
In more detail a subnet split proceeds as follows.

# Create a new halted subnet via NNS proposal. This is the subnet to become subnet B. It should have the same size in terms of number of nodes as the subnet to be split (subnet A) to maintain the same trust assumptions. It is important that it is halted to ensure that it will remain in its genesis state. This is because we will propose to overwrite this subnet’s state later with the split off half of subnet A.
# Add canisters to be split off to the migration list (NNS proposal). This is to make other subnets aware that canisters are going to be split off. This will help to react appropriately to messages that are, e.g., expected to come from subnet A according to the routing table but actually come from B or the other way round (note that not all subnets will observe a change of the routing table at the same time).
# Halt subnet A at the next CUP/checkpoint (NNS proposal) and add keys that grant the entity executing the split read only access to the state (NNS proposal). Generally this proposal is the same as in the conventional subnet recovery case. The only difference is that it won’t instruct the subnet to halt immediately but at the next CUP. This will ensure that we obtain a state that is signed on behalf of the subnet so that the split can be verified by the community.
# Update the routing table (NNS proposal). As soon as a subnet observes this change it will (among others) start routing messages to split off canisters to B. The reason for why this happens after halting subnet A is to ensure that A doesn’t route any new messages to B until it is unhalted again.
# Obtain the state of subnet A together with the CUP at the height where it was halted (recall that a read only SSH key was added in step 3). Also obtain a certification of the public key of subnet A issued by the NNS (via a read_state call). Once obtained, split the state into two parts. The first part contains states of canisters that are not split off, as well as the entire subnet-level state of A, such as streams, subnet queues, etc.; this part will be used as the new genesis state of A (we’ll refer to the subnet A with this genesis state as A’). The second state part consists of an empty subnet state plus the canisters split off from A, and will become the new genesis state of subnet B. Both the genesis states of A’ and B will contain the full ingress history, to be pruned in the new subnets’ first execution round. Pruning it beforehand would change the checkpoint file contents and the hashes in the manifest needed for verification of the split would no longer match.
# Verify whether each file ended up on the expected subnet and whether their hashes still match. The artifacts to perform the verification will be published during a split. A manifest only contains hashes but no file contents, i.e., any subnet or canister data. Comparing the hashes is sufficient because of the properties of the hash described above and the fact that split is done in a way that either (1) file contents are either not modified during splitting, or (2) the content of the files is fully determined by the split’s input parameters. If this step is successful, one knows which root hashes to expect in the recovery proposals for subnets A’ and B in the next steps.
# Perform a subnet recovery for subnets A’ (resp. A) and B with the states obtained in the previous step (series of NNS proposals). The NNS voters should check that the root hashes of the states match the ones obtained in the verification step (details below) to be sure that the states were only split but not otherwise tampered with.
# Once it is clear that there are no more messages in any stream to or from subnet A/A’ that may potentially be misrouted, remove the migration list entry (NNS proposal).

To gain confidence that the splitting process doesn’t violate the messaging guarantees that the IC provides to canisters, the above high-level design is accompanied by a [https://github.com/dfinity/tla-models/tree/master/subnet-splitting formal model written in TLA+]. An analysis of the model using the TLC tool found that the design preserves the messaging guarantees.

=== Verification of Artifacts in Subnet Splitting ===
As mentioned before, the process is orchestrated via NNS proposals and it will be end-to-end verifiable, in the sense that the community will be able to verify whether the states recovered onto the new subnets are indeed the two halves of the initial state before voting on the respective proposals. A description on how to perform this verification is provided on the [[Verification of Artifacts in Subnet Splitting]] page.

Subnet splitting

2023-08-31T13:19:03Z

David:

Subnet splitting

2023-08-31T13:17:17Z

David: First version of page + intro

File:Subnet splitting steps.png

2023-08-31T13:11:56Z

David:

High level overview of subnet splitting steps

IC architecture overview

2023-08-31T13:09:26Z

David: Add link to subnet splitting page

[[File:IC-protocol-stack.png|500px]]

As illustrated in the above diagram, the Internet Computer Protocol consists of four layers:
* [[IC execution layer]]
* [[IC message routing layer]]
* [[IC consensus layer]]
* [[IC P2P (peer to peer) layer]]

Description of specific features:
*[[IC Smart Contract Memory]]
*[[Bitcoin integration]]
*[[HTTPS outcalls]]
* [[Subnet splitting]]

Canisters serving the web:
* [[HTTP asset certification]]
* [[Boundary Nodes]]

==See Also==
* '''The Internet Computer project website (hosted on the IC): [https://internetcomputer.org/ internetcomputer.org]'''

IC message routing layer

2022-11-04T06:49:36Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

The VSR for cross-net messages is defined as follows, where <code>vsr_check_xnet : CanonicalState × Batch → Set<(SubnetId × StreamIndex)></code> is a deterministic function that determines the indices of the messages in the individual substreams contained in <code>xnet_payload</code> to be inducted.

We require that the implementation of the VSR (or the layer above) makes sure that all reply messages are accepted by the VSR. Formally this means that for any valid State-Batch combination <code>(s, b)</code> it holds that for all <code>(subnet, index)</code> so that <code>b.xnet_payload[subnet].msgs[index]</code> is a reply message that <code>(subnet, index) ∈ vsr_check_xnet(s, b)</code>.

Based on this rule one can straight-forwardly define the interface behavior of the VSR.

<nowiki>VSR(state, batch).xnet :=
{ (index ↦ msg) |
(index ↦ msg) ∈ batch.xnet_payload.msgs ∧
index ∈ vsr_check_xnet(state, batch)
}

VSR(state, batch).signals :=
{ (concatenate(subnet, index) ↦ ACCEPT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∈ vsr_check_xnet(state, batch)
}
∪ { (concatenate(subnet, index) ↦ REJECT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∉ vsr_check_xnet(state, batch)
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

Below we formally define the operation of the component. We first define the following helper functions. We assume that <code>XNetPayloadBuilder</code> has an associated field <code>own_subnet</code> which is passed whenever constructing an <code>XNetPayloadBuilder</code>:
<nowiki>new : SubnetId → Self
new(own_subnet) :=
XNetPayloadBuilder {
with
└─ own_subnet := own_subnet
}
</nowiki>

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

* The function <code>slice_indexes</code> returns the set of expected indices for the block to be proposed, solely based on a set of Slices.
<nowiki>% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }</nowiki>

* The function <code>state_and_payload_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.
<nowiki>% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
% prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
\ dom(slices.msgs.elements)
∪ slice_indexes(slices)</nowiki>

* The function <code>expected_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.
<nowiki>% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
state_and_payload_indexes(
state,
{ (src ↦ slice) | payload ∈ slices ∧
(src ↦ cert_slice) ∈ payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice)
}
)</nowiki>

==== Creation of XNet Payloads ====
Based on the functions above, we are now ready to define the function <code>get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload</code>. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit ∧
% each payload output by get_xnet_payload will be accepted by validate_xnet_payload
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
{ (remote_subnet ↦ slice) |
S = StateManager.get_state_at(reference_height)
∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
∧ (remote_subnet, begin_index) ∈
expected_indexes(self.own_subnet, S, past_payloads)
∪ { (subnet_id, StreamIndex::ONE) |
subnet_id ∈ subnets
\ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
S,
past_payloads)
}
}
% msg_limit and size limit need to be set by the implementation as appropriate
% to satisfy the post condition
∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
∧ ERR ≠ StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
slice)
}</nowiki>

==== Validation of XNet Payloads ====
Validation of XNetPayloads works analogously to the creation. The function <code>validate_xnet_payload</code> is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
S = StateManager.get_state_at(reference_height) ∧
∀ (remote_subnet ↦ css) ∈ payload :
{
slice = StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
css) ∧
∀ index ∈ min(dom(slice.msgs.elements)) :
{
(remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
index = (remote_subnet, StreamIndex::ONE)
}
}</nowiki>

IC message routing layer

2022-11-04T06:43:11Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

The VSR for cross-net messages is defined as follows, where <code>vsr_check_xnet : CanonicalState × Batch → Set<(SubnetId × StreamIndex)></code> is a deterministic function that determines the indices of the messages in the individual substreams contained in <code>xnet_payload</code> to be inducted.

We require that the implementation of the VSR (or the layer above) makes sure that all reply messages are accepted by the VSR. Formally this means that for any valid State-Batch combination <code>(s, b)</code> it holds that for all <code>(subnet, index)</code> so that <code>b.xnet_payload[subnet].msgs[index]</code> is a reply message that <code>(subnet, index) ∈ vsr_check_xnet(s, b)</code>.

Based on this rule one can straight-forwardly define the interface behavior of the VSR.

<nowiki>VSR(state, batch).xnet :=
{ (index ↦ msg) |
(index ↦ msg) ∈ batch.xnet_payload.msgs ∧
index ∈ vsr_check_xnet(state, batch)
}

VSR(state, batch).signals :=
{ (concatenate(subnet, index) ↦ ACCEPT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∈ vsr_check_xnet(state, batch)
}
∪ { (concatenate(subnet, index) ↦ REJECT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∉ vsr_check_xnet(state, batch)
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

Below we formally define the operation of the component. We first define the following helper functions. We assume that <code>XNetPayloadBuilder</code> has an associated field <code>own_subnet</code> which is passed whenever constructing an <code>XNetPayloadBuilder</code>:
<nowiki>new : SubnetId → Self
new(own_subnet) :=
XNetPayloadBuilder {
with
└─ own_subnet := own_subnet
}
</nowiki>

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

* The function <code>slice_indexes</code> returns the set of expected indices for the block to be proposed, solely based on a set of Slices.
<nowiki>% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }</nowiki>

* The function <code>state_and_payload_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.
<nowiki>% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
% prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
\ dom(slices.msgs.elements)
∪ slice_indexes(slices)</nowiki>

* The function <code>expected_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.
<nowiki>% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
state_and_payload_indexes(
state,
{ (src ↦ slice) | payload ∈ slices ∧
(src ↦ cert_slice) ∈ payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice)
}
)</nowiki>

==== Creation of XNet Payloads ====
Based on the functions above, we are now ready to define the function <code>get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload</code>. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit ∧
% each payload output by get_xnet_payload will be accepted by validate_xnet_payload
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
{ (remote_subnet ↦ slice) |
S = StateManager.get_state_at(reference_height)
∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
∧ (remote_subnet, begin_index) ∈
expected_indexes(self.own_subnet, S, past_payloads)
∪ { (subnet_id, StreamIndex::ONE) |
subnet_id ∈ subnets
\ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
S,
past_payloads)
}
}
% msg_limit and size limit need to be set by the implementation as appropriate
% to satisfy the post condition
∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
}</nowiki>

==== Validation of XNet Payloads ====
Validation of XNetPayloads works analogously to the creation. The function <code>validate_xnet_payload</code> is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
S = StateManager.get_state_at(reference_height) ∧
∀ (remote_subnet ↦ css) ∈ payload :
{
slice = StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
css) ∧
∀ index ∈ min(dom(slice.msgs.elements)) :
{
(remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
index = (remote_subnet, StreamIndex::ONE)
}
}</nowiki>

IC message routing layer

2022-11-04T06:40:49Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

The VSR for cross-net messages is defined as follows, where <code>vsr_check_xnet : CanonicalState × Batch → Set<(SubnetId × StreamIndex)></code> is a deterministic function that determines the indices of the messages in the individual substreams contained in <code>xnet_payload</code> to be inducted.

We require that the implementation of the VSR (or the layer above) makes sure that all reply messages are accepted by the VSR. Formally this means that for any valid State-Batch combination <code>(s, b)</code> it holds that for all <code>(subnet, index)</code> so that <code>b.xnet_payload[subnet].msgs[index]</code> is a reply message that <code>(subnet, index) ∈ vsr_check_xnet(s, b)</code>.

Based on this rule one can straight-forwardly define the interface behavior of the VSR.

<nowiki>VSR(state, batch).xnet :=
{ (index ↦ msg) |
(index ↦ msg) ∈ batch.xnet_payload.msgs ∧
index ∈ vsr_check_xnet(state, batch)
}

VSR(state, batch).signals :=
{ (concatenate(subnet, index) ↦ ACCEPT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∈ vsr_check_xnet(state, batch)
}
∪ { (concatenate(subnet, index) ↦ REJECT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∉ vsr_check_xnet(state, batch)
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

Below we formally define the operation of the component. We first define the following helper functions. We assume that <code>XNetPayloadBuilder</code> has an associated field <code>own_subnet</code> which is passed whenever constructing an <code>XNetPayloadBuilder</code>:
<nowiki>new : SubnetId → Self
new(own_subnet) :=
XNetPayloadBuilder {
with
└─ own_subnet := own_subnet
}
</nowiki>

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

* The function <code>slice_indexes</code> returns the set of expected indices for the block to be proposed, solely based on a set of Slices.
<nowiki>% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }</nowiki>

* The function <code>state_and_payload_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.
<nowiki>% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
% prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
\ dom(slices.msgs.elements)
∪ slice_indexes(slices)</nowiki>

* The function <code>expected_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.
<nowiki>% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
state_and_payload_indexes(
state,
{ (src ↦ slice) | payload ∈ slices ∧
(src ↦ cert_slice) ∈ payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice)
}
)</nowiki>

==== Creation of XNet Payloads ====
Based on the functions above, we are now ready to define the function <code>get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload</code>. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
{ (remote_subnet ↦ slice) |
S = StateManager.get_state_at(reference_height)
∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
∧ (remote_subnet, begin_index) ∈
expected_indexes(self.own_subnet, S, past_payloads)
∪ { (subnet_id, StreamIndex::ONE) |
subnet_id ∈ subnets
\ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
S,
past_payloads)
}
}
% msg_limit and size limit need to be set by the implementation as appropriate
% to satisfy the post condition
∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
}</nowiki>

==== Validation of XNet Payloads ====
Validation of XNetPayloads works analogously to the creation. The function <code>validate_xnet_payload</code> is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
S = StateManager.get_state_at(reference_height) ∧
∀ (remote_subnet ↦ css) ∈ payload :
{
slice = StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
css) ∧
∀ index ∈ min(dom(slice.msgs.elements)) :
{
(remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
index = (remote_subnet, StreamIndex::ONE)
}
}</nowiki>

IC message routing layer

2022-11-04T06:40:27Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

The VSR for cross-net messages is defined as follows, where <code>vsr_check_xnet : CanonicalState × Batch → Set<(SubnetId × StreamIndex)></code> is a deterministic function that determines the indices of the messages in the individual substreams contained in <code>xnet_payload</code> to be inducted.

We require that the implementation of the VSR (or the layer above) makes sure that all reply messages are accepted by the VSR. Formally this means that for any valid State-Batch combination <code>(s, b)</code> it holds that for all <code>(subnet, index)</code> so that <code>b.xnet_payload[subnet].msgs[index]</code> is a reply message that <code>(subnet, index) ∈ vsr_check_xnet(s, b)</code>.

Based on this rule one can straight-forwardly define the interface behavior of the VSR.

<nowiki>VSR(state, batch).xnet :=
{ (index ↦ msg) |
(index ↦ msg) ∈ batch.xnet_payload.msgs ∧
index ∈ vsr_check_xnet(state, batch)
}

VSR(state, batch).signals :=
{ (concatenate(subnet, index) ↦ ACCEPT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∈ vsr_check_xnet(state, batch)
}
∪ { (concatenate(subnet, index) ↦ REJECT) |
(subnet ↦ stream) ∈ batch.xnet_payload ∧
(index ↦ msg) ∈ stream.msgs ∧
(subnet, index) ∉ vsr_check_xnet(state, batch)
}</code>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

Below we formally define the operation of the component. We first define the following helper functions. We assume that <code>XNetPayloadBuilder</code> has an associated field <code>own_subnet</code> which is passed whenever constructing an <code>XNetPayloadBuilder</code>:
<nowiki>new : SubnetId → Self
new(own_subnet) :=
XNetPayloadBuilder {
with
└─ own_subnet := own_subnet
}
</nowiki>

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

* The function <code>slice_indexes</code> returns the set of expected indices for the block to be proposed, solely based on a set of Slices.
<nowiki>% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }</nowiki>

* The function <code>state_and_payload_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.
<nowiki>% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
% prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
\ dom(slices.msgs.elements)
∪ slice_indexes(slices)</nowiki>

* The function <code>expected_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.
<nowiki>% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
state_and_payload_indexes(
state,
{ (src ↦ slice) | payload ∈ slices ∧
(src ↦ cert_slice) ∈ payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice)
}
)</nowiki>

==== Creation of XNet Payloads ====
Based on the functions above, we are now ready to define the function <code>get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload</code>. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
{ (remote_subnet ↦ slice) |
S = StateManager.get_state_at(reference_height)
∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
∧ (remote_subnet, begin_index) ∈
expected_indexes(self.own_subnet, S, past_payloads)
∪ { (subnet_id, StreamIndex::ONE) |
subnet_id ∈ subnets
\ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
S,
past_payloads)
}
}
% msg_limit and size limit need to be set by the implementation as appropriate
% to satisfy the post condition
∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
}</nowiki>

==== Validation of XNet Payloads ====
Validation of XNetPayloads works analogously to the creation. The function <code>validate_xnet_payload</code> is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
S = StateManager.get_state_at(reference_height) ∧
∀ (remote_subnet ↦ css) ∈ payload :
{
slice = StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
css) ∧
∀ index ∈ min(dom(slice.msgs.elements)) :
{
(remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
index = (remote_subnet, StreamIndex::ONE)
}
}</nowiki>

IC message routing layer

2022-11-03T13:14:34Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

Below we formally define the operation of the component. We first define the following helper functions. We assume that <code>XNetPayloadBuilder</code> has an associated field <code>own_subnet</code> which is passed whenever constructing an <code>XNetPayloadBuilder</code>:
<nowiki>new : SubnetId → Self
new(own_subnet) :=
XNetPayloadBuilder {
with
└─ own_subnet := own_subnet
}
</nowiki>

The API defines the past_payloads as a vector where the past payloads are ordered with respect to the corresponding height in the chain. While this ordering allows for a more efficient implementation of the functions below it does not matter on a conceptual level. Hence we resort to looking at it as a set for the sake of simplicity.

* The function <code>slice_indexes</code> returns the set of expected indices for the block to be proposed, solely based on a set of Slices.
<nowiki>% Take the maximum index for each individual (sub-)stream in the given set of slices and add
% 1 to obtain the next indexes one would expect when solely looking at the past payloads but
% ignoring the state.
slice_indexes : (SubnetId ↦ StreamSlice) → Set<(SubnetId × StreamIndex)>
slice_indexes(slices) := { i + 1 | i ∈ max(dom(slices.msgs.elements)) }</nowiki>

* The function <code>state_and_payload_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given slices from the past payloads.
<nowiki>% Take the expected indexes from the state, remove whatever index appears in the given
% slices and add the expected indexes according to the streams in the slices.
%
% FAIL IF: ∃ i, j ∈ state_and_payload_indexes(state, slices) :
% prefix(i) = prefix(j) ∧ postfix(i) ≠ postfix(j)
%
state_and_payload_indexes : ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
state_and_payload_indexes(state, slices) := state.expected_xnet_indices
\ dom(slices.msgs.elements)
∪ slice_indexes(slices)</nowiki>

* The function <code>expected_indexes</code> returns the set of expected indices for the block to be proposed, taking into account both the expected indices in the given replicated state and the more recent messages in the given past payloads.
<nowiki>% Decode the slices in the given payload and compute the expected indexes using the
% expected_indexes function above
expected_indexes : SubnetId ×
ReplicatedState ×
(SubnetId ↦ StreamSlice) →
Set<(SubnetId × StreamIndex)>
expected_indexes(own_subnet, state, slices) :=
state_and_payload_indexes(
state,
{ (src ↦ slice) | payload ∈ slices ∧
(src ↦ cert_slice) ∈ payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice)
}
)</nowiki>

==== Creation of XNet Payloads ====
Based on the functions above, we are now ready to define the function <code>get_xnet_payload : Height × Height × Set<XNetPayload> → XNetPayload</code>. Note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Build an xnet payload containing the currently available streams. The begin is either given
% by the expected index, and, if there is no expected index for a given prefix, the index
% ONE is expected.
%
% ENSURES: size_of(get_xnet_payload(self, ·, ·, ·, size_limit)) ≤ size_limit
%
get_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × ℕ → XNetPayload
get_xnet_payload(self, registry_version, reference_height, past_payloads, size_limit) :=
{ (remote_subnet ↦ slice) |
S = StateManager.get_state_at(reference_height)
∧ subnets = Registry::get_registry_at(registry_version).subnets \ { self.own_subnet }
∧ (remote_subnet, begin_index) ∈
expected_indexes(self.own_subnet, S, past_payloads)
∪ { (subnet_id, StreamIndex::ONE) |
subnet_id ∈ subnets
\ { s | (s, ·) ∈ expected_indexes(self.own_subnet,
S,
past_payloads)
}
}
% msg_limit and size limit need to be set by the implementation as appropriate
% to satisfy the post condition
∧ slice = XNetEndpoint::get(subnet).get_stream(remote_subnet, begin_index, ·, ·)
}</nowiki>

==== Validation of XNet Payloads ====
Validation of XNetPayloads works analogously to the creation. The function <code>validate_xnet_payload</code> is defined as follows, where we assume that it evaluates to false in case an error occurs. Again, note that the gap-freeness of streams is an invariant of the datatype, which is why we do not explicitly include the rule for gap-freeness here.
<nowiki>% Check whether a given xnet payload was built according to the rules given above.
%
% FAIL IF: size_of(payload) > size_limit
%
validate_xnet_payload : Self × RegistryVersion × Height × Vec<XNetPayload> × XNetPayload × ℕ → Bool
validate_xnet_payload(self, registry_version, reference_height, past_payloads, payload, size_limit) :=
S = StateManager.get_state_at(reference_height) ∧
∀ (remote_subnet ↦ css) ∈ payload :
{
slice = StateManager.decode_certified_stream(registry_version,
self.own_subnet,
remote_subnet,
css) ∧
∀ index ∈ min(dom(slice.msgs.elements)) :
{
(remote_subnet, index) ∈ expected_indexes(S, past_payloads) ∨
index = (remote_subnet, StreamIndex::ONE)
}
}</nowiki>

IC message routing layer

2022-11-03T13:08:24Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

=== XNet Payload Builder ===
The <code>XNetPayloadBuilder</code> builds and verifies payloads whenever requested to do so by the block maker. The rules for whether a payload is considered valid or not must be so that every notary is guaranteed to make the same decision on the same input and that a payload built by an honest payload builder will be accepted by honest validators. Essentially the rules resemble what is described in the section on properties and functionality. However, given that the execution may be behind we can not directly look up the expected indexes in the appropriate state but need to compute it based on the referenced state and the payloads since then. Below, we provide a figure illustrating the high-level functionality: generally speaking blocks are considered valid if they adhere to the rules described in the figure and are considered invalid otherwise.

[[File:Payload-building.png|thumb|Rules for payload building]]

File:Payload-building.png

2022-11-03T13:08:14Z

David:

Rules for payload building

IC message routing layer

2022-11-03T13:05:40Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

=== XNet Endpoint ===
The <code>XNetEndpoint</code> serves the streams available on some subnet to other subnets. For an implementation this will typically mean that there is some client which will handle querying the API of the <code>XNetEndpoint</code> on the remote subnet in question. We use the following abstraction to avoid explicitly talking about this client: We assume that there is a function <code>get : SubnetId → XNetEndpoint</code> which will return an appropriate instance of <code>XNetEndpoint</code> which we can directly query using the API described below.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

* <code>get_stream(subnet_id : SubnetId, begin : StreamIndex, msg_limit : ℕ, size_limit : ℕ) → CertifiedStreamSlice</code>: Returns the requested certified stream slice in its transport format.

We require that an honest <code>XNetPayloadBuilder</code>-<code>XNetEndpoint</code> pair is able to successfully obtain slices over this API.

Looking at the bigger picture, the intuition for why this will yield a secure system is that in each round a new pair of block maker and endpoint will try to pull over a stream, which, in turn, means that eventually an honest pair will be able to obtain the stream and include it into a block.

IC message routing layer

2022-11-03T13:00:19Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

=== Properties and Functionality ===
Assume an XNet transfer component on a replica part of subnet <code>own_subnet</code>. The interface behavior of the XNet transfer component will guarantee that for any payload payload produced via

<nowiki>get_xnet_payload(registry_version, reference_height, past_payloads, size_limit)</nowiki>

we have that for any <code>(remote_subnet ↦ css) ∈ payload</code>:

* <code>StateManager.decode_certified_stream(registry_version, own_subnet, remote_subnet, css)</code> succeeds, i.e., returns a valid slice slice that is guaranteed to come from remote_subnet.

* Furthermore, for each slice it will hold that a soon as the state corresponding to height <code>h = reference_height + |past_payloads|</code> is available that <code>concatenate(remote_subnet, min(dom(slice.msgs.elements))) ∈ StateManager.get_state_at(h).expected_indexes</code>. This means that the streams will start with the expected indexes stored in the previous state, i.e., they gap freely extend the previously seen streams.

Payloads verified using <code>validate_xnet_payload</code> are accepted if they adhere to those requirements, and are rejected otherwise.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

IC message routing layer

2022-11-03T12:57:09Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

[[File:Xnet.png|thumb|XNet transfer component diagram]]

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

We do not specify anything about the protocol run between the <code>XNetEndpoint</code> and the <code>XNetPayloadBuilder</code> to transfer the streams between two subnetworks. The only requirement we have is that certified streams made available by an <code>XNetEndpoint</code> of an honest replica on some source subnetwork, they can be obtained by an <code>XNetPayloadBuilder</code> of an honest replica on the destination subnetwork and that the information regarding which endpoints to contact is available in the Registry.

[[File:Xnet-sequence.png|thumb|XNet transfer sequence diagram]]

File:Xnet-sequence.png

2022-11-03T12:56:47Z

David:

XNet transfer sequence diagram

File:Xnet.png

2022-11-03T12:55:54Z

David:

XNet transfer component diagram

IC message routing layer

2022-11-03T12:53:07Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

== XNet Transfer ==
After calling <code>commit_and_certify</code> at the end of a deterministic processing cycle, the state manager will take care of getting the committed state certified. Once certification is complete, the certified stream slices can be made available to block makers on other subnets. The <code>XNetTransfer</code> subcomponent is responsible to enable this transfer. It consists of

* An <code>XNetEndpoint</code> which is responsible for serving certified stream slices and making them available to <code>XNetPayloadBuilders</code> on other subnetworks.

* An <code>XNetPayloadBuilder</code>, which allows the block makers to obtain an <code>XNetPayload</code> containing the currently available certified streams originating from other subnetworks. The <code>XNetPayloadBuilder</code> obtains those streams by interacting with <code>XNetEndpoints</code> exposed by other subnets. The <code>XNetPayloadBuilder</code> also provides functionality for notaries to verify <code>XNetPayloads</code> contained in block proposals.

IC message routing layer

2022-11-03T12:50:25Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

''Ordering of Messages in the Stream & Fairness''. As long as the invariant that the canister-to-canister ordering of messages is preserved when iterating over the filtered messages in the state transition described above, the implementation can take the freedom to apply alternative orderings.

Also note that, while the state transition defined above empties the output queues completely, this is not crucial to the design and one could hold back messages as long as this does not violate the ordering requirement.

IC message routing layer

2022-11-03T12:49:53Z

David:

== Overview ==
The Internet Computer (IC) achieves its security and fault tolerance by replicating computation across node machines located in various independent data centers across the world. For scalability reasons, the Internet Computing Protocol (ICP) composes the IC of multiple independent subnets. Each subnet can be viewed as an independent replicated state machine that replicates its state over a subset of all the available nodes.

Roughly speaking, replication is achieved by having the two lower ICP layers (P2P & Consensus) agree on blocks containing batches of messages to be executed, and then having the two upper ICP layers (Message Routing & Execution) execute them. Blocks are organized as a chain, where each block builds on the previous block. Each block has an associated height in the chain and one can look at execution of a batch of messages corresponding to the agreed upon block at height <math>x</math> by the upper layers as taking the replicated state of version <math>x-1</math>, and "applying" the batch to it to obtain replicated state of version <math>x</math>.

In this document we describe the role of the Message Routing layer in deterministic batch processing. Its responsibilities are:
* '''Coordinating the deterministic processing of batches:''' Fetching the right versions of the replicated state and the registry view to process the batch, triggering the deterministic processing, and committing the resulting replicated state.

* '''Deterministic processing of batches:''' Deterministic processing of batches relative to some replicated state and some registry view, resulting in an updated replicated state.

* '''Transferring message streams from one subnet to another:''' Moving streams from one subnet to another.

=== Remarks and Required Prior Knowledge ===

* The goal of this document is to provide the next level of detail compared to the material in the [https://internetcomputer.org/how-it-works "How it works" section of internetcomputer.org]. So it is recommended to study the material available there first.
* This page builds upon definitions made in the page describing the [[IC state manager|state manager]]. Please refer to this page for missing definitions related to the replicated state etc.
* Also see [https://mmapped.blog/posts/08-ic-xnet.html this] and [https://mmapped.blog/posts/02-ic-state-machine-replication.html this] blog post for some relevant and easier to digest background information.
* The documentation provided in this page may slightly deviate from the current implementation in terms of API as well as naming of functions, variables, etc. However, it still conveys the high-level ideas required to understand how the component itself works and how it interacts with other components. The implementation also contains several optimizations which are, however, not important for the conceptual overview here and therefore skipped.
* The notation used in this page is described [[Notation|here]].

=== Replicated vs. Canonical State ===
While the external API functions defined in this document will always take state in its implementation specific representation, i.e., as <code>ReplicatedState</code>, we describe the operation the message routing component performs on the state based on its canonical representation, i.e., the <code>CanonicalState</code>. Given the relations between <code>ReplicatedState</code> and <code>CanonicalState</code> as defined in the specification of the state manager, this will implicitly define how an implementation needs to act on the respective parts of the <code>ReplicatedState</code>. We assume an implicit conversion from <code>ReplicatedState</code> to <code>CanonicalState</code> whenever we access some state passed to this component via an API function.

== Guarantees Provided by Message Routing ==
Intuitively, the goal of the message routing layer is to enable transparent communication of canisters across subnets. This means that this layer formally does not add any guarantees the system provides, but simply needs to make sure that system invariants are preserved. Those system invariants include

* guaranteed replies (each canister-to-canister request will eventually receive a reply),

* canister-to-canister ordering (the order of canister-to-canister requests sent from one canister to another canister is preserved), and

* authenticity (only messages that come from canisters on the IC are processed).

To ensure that the system invariants hold, message routing needs to provide the following guarantees:

* Canister-to-canister messages will eventually be passed to the execution layer at the subnet the destination canister lives on exactly once.

* If a message can not be delivered, a synthetic reject response must be produced.

* If a canister <math>A</math> sends two messages <math>m_1</math> and <math>m_2</math> to a canister <math>B</math>, then, if none of them gets synthetically rejected, it must be guaranteed that they are put in canister <math>B</math>'s input queue from <math>A</math> in that order.

== Preliminaries ==
=== Description of the Relevant Parts of the Registry ===
The registry can be viewed as a central store of configuration information of the IC that is maintained by the NNS DAO. The content of the registry is held by a canister on the NNS subnet, and, roughly speaking, its authenticity is guaranteed by obtaining a certification on the content on behalf of the NNS using the certification mechanism as described in the [[IC state manager|state manager]] wiki page. Throughout this document we assume that the registry contents we work with are authentic.

The registry entries required by this component are set of all existing subnet ids, as well as a canister-to-subnet mapping subnet_assignment. Note that the actual implementation may choose to represent the required fields differently as long as they are conceptually equivalent.
<nowiki>Registry {
subnets : Set<SubnetId>,
subnet_assignment: CanisterId ↦ SubnetId
...
}</nowiki>

=== Description of the Relevant Canonical State ===
Below, we define the parts of the canonical state which are relevant for the description of this component together with some constraints we impose on the replicated state. Abstractly the <code>CanonicalState</code> is defined as a nested partial map. For easier readability we bundle together the entries of the outermost map in a data structure with multiple fields where the names of the fields represent the keys in the respective partial map, e.g., for some <code>s : CanonicalState</code> we can use <code>s.ingress_queue</code> to access <code>s[ingress_queues]</code>

We start by defining the individual fields of the type </code>CanonicalState</code> which are relevant in the context of this document. After that we give more details about the datatypes of the individual fields. We distinguish between the parts which are exclusively visible to message routing, and the parts which are also visible to the execution layer.

'''Parts visible to message routing and execution'''
<nowiki>CanonicalState {
...
ingress_queues : IngressQueues,
input_queues : InputQueues,
output_queues : OutputQueues,
...
}</nowiki>

'''Parts visible to Message Routing only'''
<nowiki>CanonicalState {
...
streams : Streams,
expected_xnet_indices : Set<(SubnetId × StreamIndex)>
...
}</nowiki>

Even though there are parts of the state that are accessed by both message routing and execution, one can enforce a conceptual boundary between them. In particular, for input queues we have that message routing will only ever push messages to them, whereas for output queues we have that message routing will only ever pull messages from them. The opposite holds for the execution environment.

==== Abstract Queues ====
We define a generic queue type <code>Queue<T></code> which has the following fields:
<nowiki>Queue<T> {
next_index : ℕ, // Rolling index; the index of the next message to be inserted
elements : ℕ ↦ T // The elements currently in the queue
}</nowiki>

We define a new queue as <code>new_queue : Queue<T></code> with <code>new_queue.elements = ∅</code> and <code>new_queue.next_index = 1</code>. Furthermore, it has the following associated functions:

* <code>push</code> takes a queue and a partial map of integers mapping to T, and returns a new queue consisting of the old queue with the given values appended. It also updates the next_index field so that it points to the index after the last inserted message.
<nowiki>push : Self × (ℕ ↦ T) → Self
push(self, values) :=
self with
├─ next_index := self.next_index + |values|
└─ elements := self.elements
∪ { (i - 1 + k ↦ t) | i = self.next_index ∧
(j ↦ t) ∈ values ∧
k = rank(j, dom(values)) }</nowiki>

* <code>delete</code> removes the given elements from the queues keeping the <code>next_index</code>
<nowiki>% REQUIRE: values ⊆ self.elements
delete : Self × (ℕ ↦ T) → Self
delete(self, values) :=
self with
├─ next_index := self.next_index
└─ elements := self.elements
\ values</nowiki>

* <code>clear</code> removes all elements from the queues keeping the next_index
<nowiki>clear : Self → Self
clear(self) :=
self with
├─ next_index := self.next_index
└─ elements := ∅</nowiki>

We are often working with partial maps of type <code>SomeIdentifier ↦ Queue<T></code>, in which case we will use the following shorthand notation. With <code>q</code> being a queue of the aforementioned type, and <code>v</code> being a partial map of type <code>(SomeIdentifier × ℕ) ↦ T</code>, we define the following semantic for the functions <code>f ∈ { push, delete }</code> associated to <code>Queue<T></code>:
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) × ((SomeIdentifier × ℕ) ↦ T) → (SomeIdentifier ↦ Queue<T>)
f_map(q, v) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
(id ↦ values) ∈ v ∧
queue' = f(queue, values)
} ∪
{ (id ↦ queue') | (id ↦ values) ∈ v ∧
∄ (id ↦ ·) ∈ q ∧
queue' = f(Queue<T>::new_queue, values)
} ∪
{ (id ↦ queue) | (id ↦ queue) ∈ q ∧
∄ (id ↦ ·) ∈ v
}</nowiki>

For the functions <code>f ∈ { clear }</code> we use
<nowiki>f_map : (SomeIdentifier ↦ Queue<T>) → (SomeIdentifier ↦ Queue<T>)
f_map(q) = { (id ↦ queue') | (id ↦ queue) ∈ q ∧
queue' = f(queue)
}</nowiki>

We will henceforth omit the <code>map</code> postfix in <code>f_map</code> and simply use <code>f</code> if it is clear from the input type that the map variant of <code>f</code> should be used.

==== Indices ====
We define an <code>Index</code> to be an arbitrary length sequence, where every element in the sequence up to the last one can have an arbitrary type, and the last one is a natural number.
<nowiki>Index : X × ... × Y × ℕ</nowiki>

In addition we define the following semantic:

* We define the prefix of an index Index <code>i := (x, …, y, seq_nr)</code> as <code>prefix(i) := i[1…|i| - 1] = (x, …, y)</code>, i.e., it contains all elements of i except the last one.

* We define the postfix of an Index <code>i := (x, …, y, seq_nr)</code> as </code>postfix(i) := i[|i|] = seq_nr</code>, i.e., the last element of the index sequence. As already mentioned, we require the postfix of an index to be a natural number.

* For an <code>Index i</code>, the operation <math>i + 1</math> is defined as <code>concatenate(prefix(i), postfix(i) + 1)</code>.

* Two indices, <code>Index i</code> and <code>Index j</code>, are incomparable if <code>prefix(i) ≠ prefix(j)</code>.

* For two indices, <code>Index i</code> and <code>Index j</code>, we have that <math>i \leq j</math> if <code>prefix(i) = prefix(j)</code> and <code>postfix(i) ≤ postfix(j)</code>.

==== Queues ====

We distinguish three different types of queues in the replicated state: ingress queues, input queues, and output queues. Ingress queues contain the incoming messages from users (i.e., ingress messages). Input queues contain the incoming canister-to-canister messages. Output queues contain the outgoing canister-to-canister messages.

Ingress queues are organized on a per destination basis. Messages in ingress queues are indexed by a concrete instance of Index called <code>IngressIndex</code>, which is a tuple consisting of the destination canister ID and a natural number, i.e.,
<nowiki>IngressIndex : CanisterId × ℕ</nowiki>

Input queues and output queues are organized on a per-source-and-destination basis. Messages in input- and output queues are indexed by a concrete instance of Index called QueueIndex, which is defined as follows:
<nowiki>QueueIndex : CanisterId × CanisterId × ℕ</nowiki>

The type representing all of the ingress queues is defined as follows:
<nowiki>IngressQueues : CanisterId ↦ Queue<Message>,</nowiki>
which means that <code>IngressQueues.elements : IngressIndex ↦ Message</code>.

The type representing all of the input queues is defined as follows:
<nowiki>InputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>InputQueues.elements : QueueIndex ↦ Message</code>.

The type representing all of the output queues is defined as follows:
<nowiki>OutputQueues : (CanisterId × CanisterId) ↦ Queue<Message>,</nowiki>
which means that <code>OutputQueues.elements : QueueIndex ↦ Message</code>.

==== Streams ====
Each individual <code>Stream</code> is scoped to a pair of subnets—the subnet a stream originates from and subnet the stream is targeted at. An individual stream is organized in multiple substreams identified by a <code>SubstreamId</code>. The concrete definition of <code>SubstreamId</code> is up to the implementation. In the current implementation <code>SubstreamId</code> is defined to be the unit type <code>()</code>, i.e., we have flat streams. Messages in streams are indexed by a concrete instance of <code>Index</code> called StreamIndex which is defined as follows:
<nowiki>StreamIndex : SubstreamId × ℕ</nowiki>
A <code>Stream</code> is comprised of a sequence of <code>Signal</code> messages <code>signals</code> and a sequence of canister-to-canister messages <code>msgs</code>.
<nowiki>Stream {
signals : StreamIndex ↦ {ACCEPT, REJECT},
msgs : SubstreamId ↦ Queue<Message>
}</nowiki>
which means that <code>Stream.msgs.elements : StreamIndex ↦ Message</code>.

While the subnet the stream originates from is implicitly determined, the target subnet needs to be made explicit. Hence, we define a data structure Streams holding all streams indexed by destination subnetwork:
<nowiki>Streams : SubnetId ↦ Stream</nowiki>

We may sometimes abuse the notation and directly access the fields defined for an individual <code>Stream</code> on the Streams type, in which case we obtain maps of the following type:
<nowiki>Streams.signals : SubnetId ↦ (StreamIndex ↦ {ACCEPT, REJECT})

Streams.msgs : SubnetId ↦ (SubstreamId ↦ Queue<Message>)</nowiki>

==== (Certified) Stream Slices ====
<code>StreamSlices</code> and <code>CertifiedStreamSlices</code>, respectively, are used to transport streams from one to an other subnet within <code>XNetPayloads</code> that are part of consensus blocks. Essentially, a <code>StreamSlice</code> is a slice of a stream which retains the begin and the end of the original stream. A <code>StreamSlice</code> is wrapped in a <code>CertifiedStreamSlice</code> for transport so that authenticity can be guaranteed. Neither <code>CertifiedStreamSlices</code> nor <code>StreamSlices</code> are ever explicitly created within message routing, but instead one relies on the encoding and decoding routines provided by the state manager: A <code>CertifiedStreamSlice</code> is created by calling the respective encoding routine of the state manager. Such a <code>CertifiedStreamSlice</code> can then be decoded into a <code>StreamSlice</code> using the corresponding decoding routine provided by the state manager.
<nowiki>StreamSlice {
stream : Stream,
begin : Set<StreamIndex>,
end : Set<StreamIndex>
}</nowiki>

<nowiki>CertifiedStreamSlice {
payload : PartialCanonicalState
witness : Witness
signature : Certification
}</nowiki>

For the precise relation of <code>StreamSlice</code> and <code>CertifiedStreamSlice</code>, refer to the specification of the state manager.

==== Batch ====
A batch consists of multiple elements including an <code>ingress_payload</code> constituting a sequence of ingress messages, and an <code>xnet_payload</code>.
<nowiki>Batch {
batch_number : Height
registry_version : RegistryVersion
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ CertifiedStreamSlice
requires_full_state_hash : { TRUE, FALSE }
}</nowiki>

==== Decoded Batch ====
A decoded batch represents a batch where all transport-specific things are decoded into the format suitable for processing and some things which are not required inside the deterministic state machine are stripped off.
<nowiki>DecodedBatch {
ingress_payload : ℕ ↦ Message
xnet_payload : SubnetId ↦ StreamSlice
}</nowiki>

Currently this only means decoding the <code>CertifiedStreamSlices</code> into <code>StreamSlices</code> because we assume that the ingress payload is suitable to be processed right away. Formally there is a function, which, based on the own subnet id and the given batch decodes the batch into a decoded batch:
<nowiki>decode : SubnetId × Batch → DecodedBatch
decode(own_subnet, b) :=
DecodedBatch {
with
├─ ingress_payload := b.ingress_payload
└─ xnet_payload :=
{ (src_subnet ↦ slice) |
(src_subnet ↦ cert_slice) ∈ b.xnet_payload ∧
slice = StateManager.decode_valid_certified_stream(own_subnet,
cert_slice
)
}
}</nowiki>

== Message Routing ==
Message routing is triggered by incoming batches from consensus. For each <code>Batch b</code>, message routing will perform the following steps:
[[File:Message Routing Components.png|thumb|Components interacting with message routing during a deterministic processing round]]
[[File:MR Interactions.png|thumb|Interactions of message routing with other components during a deterministic processing round]]

* Obtain the <code>ReplicatedState s</code> of the right version w.r.t. <code>Batch b</code>.

* Submit <code>s</code>, <code>decode(own_subnet, b)</code> for processing by the deterministic state machine comprised of the message routing and execution layer. This includes

** An induction phase (cf. <code>pre_process</code>), where the valid messages in <code>decode(own_subnet, b)</code> are inducted. Among others, a message m in a <code>StreamSlice</code> from subnet <code>X</code> is considered valid if <code>registry.get_registry_at(b.registry_version).subnet_assignment</code> maps <code>m.src</code> to <code>X</code>.

** An execution phase (cf. <code>execute</code>), which executes messages available in the induction pool.

** An XNet message routing phase (cf. <code>post_process</code>), which moves the messages produced in the execution phase from the per-session output queues to the subnet-to-subnet streams according to the mapping defined by the subnet assignment in the registry.

* Commit the replicated state, incrementally updated by the previous steps, to the state manager via <code>commit_and_certify</code>.

=== Deterministic State Machine ===
As shown in the sequence diagram above, the deterministic state machine implemented by message routing and execution applies batches provided by consensus to the appropriate state, additionally using some meta information provided by the registry. As discussed above, we will use state of type <code>CanonicalState</code> to generally describe the operations of the message-routing-related operations of this component.

[[File:Message-routing-data-flow.png|thumb|Data flow during batch processing]]

The flow diagram below details the operation of the component. Its operation is logically split into three phases.

* The induction phase, where the messages contained in the batch are preprocessed. This includes extracting them from the batch and, subject to their validity and the decision of VSR, added to the induction pool or not.

* The execution phase, where the hypervisor is triggered to perform an execution cycle. The important thing from a message routing perspective is that it will take messages from the input queues and process them, which causes messages to be added to the output queues.

* The XNet message routing phase, where the messages produced in the execution cycle are post-processed. This means that they are taken from the canister-to-canister output queues and routed into the appropriate subnet-to-subnet streams.

All messages will be added to the respective destination queue/stream preserving the order they appear in the respective source stream/queue.

==== API ====
The deterministic state machine does not provide any external API functions. It only provides the following functions resembling the state transformations implemented by the individual steps of the deterministic state machine depicted above. Refer to the previous section for context regarding when the individual functions are called.

* <code>pre_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId), b : DecodedBatch) → CanonicalState</code>: Triggers the induction phase.

* <code>execute(s : CanonicalState) → CanonicalState</code>: Triggers the execution phase.

* <code>post_process(s : CanonicalState, subnet_assignment : (CanisterId ↦ SubnetId)) → CanonicalState</code>: Triggers the XNet message routing phase.

==== Abstractions of Other Parts of the System ====

'''Valid Set Rule (VSR)'''
The VSR is a component that makes the decision of whether to <code>ACCEPT</code> a message or to <code>REJECT</code> a message. For message routing, <code>ACCEPT</code> has the semantic that the execution layer takes responsibility for the message, whereas <code>REJECT</code> has the semantic that the message is dropped and may require action from the message routing layer.

The operation of the VSR on ingress messages is defined as follows, where <code>vsr_check_ingress : CanonicalState × Batch → Set<ℕ></code> is a deterministic function returning the indices of the messages in the ingress payload accepted by the VSR, which returns a possibly empty set of index-message tuples corresponding to the accepted messages in the ingress_payload of the batch. The set is determined by the concrete implementation of the VSR.
<nowiki>VSR(state, batch).ingress :=
{ ((m_i.dst, j) ↦ m_i) | (i ↦ m_i) ∈ batch.ingress_payload
∧ i ∈ vsr_check_ingress(state, batch)
∧ j = Rank(i, vsr_check_ingress(state, batch))
}</nowiki>

'''Scheduler and Hypervisor'''. From the point of view of message routing, one can look at the the scheduler and the hypervisor together as one component. We model the functionality of scheduler and hypervisor as a deterministic function <code>schedule_and_execute : CanonicalState → (IngressIndex ↦ Message) × (QueueIndex ↦ Message) × (QueueIndex ↦ Message)</code> which computes the change set introduced by the Scheduler and the Hypervisor. It takes messages from the input queues, executes them and puts new messages to the output queues.

We will later use this function when we describe how the state transition function <code>execute(CanonicalState) → CanonicalState</code> transforms the state. For the sake of compact notation, we use the following fields to access the individual return values of the schedule_and_execute function.

* First, we have <code>consumed_ingress_messages</code>, which contains a partial map <code>IngressIndex ↦ Message</code> containing all consumed ingress messages.

* Second, we have <code>consumed_xnet_messages</code>, which contains a partial map <code>QueueIndex ↦ Message</code> containing all consumed cross-net messages.

* Third, we have <code>produced_messages</code> which contains a partial map <code>QueueIndex ↦ Message</code> containing all produced messages, where the order of the messages implied by the queue index determines the order in which they need to be added to the queues.

==== Description of the State Transitions ====

'''Induction Phase'''. In the induction phase, one starts off with a <code>CanonicalState S</code>, some <code>subnet_assignment</code> and a <code>DecodedBatch b</code> and applies <code>b</code> to <code>S</code> relative to <code>subnet_assignment</code> to obtain <code>S'</code>, i.e., one computes <code>S' = pre_process(S, subnet_assignment, b)</code>.

We describe things here w.r.t. to a version of the VSR which will accept all messages, while in reality the VSR may reject some messages in case canisters migrate across subnets or subnets are split. So while the possibility that messages can be REJECTed by the VSR would require specific action of the message routing layer we omit those actions here for simplicity as they are not crucial to understand the basic functionality of message routing.

Before we define the actual state transition we define a couple of helper functions. First we define a function that determines the order of the messages in the queues based on the order of the messages in the incoming stream slices.
<nowiki>% REQUIRES: ∄ (s1 ↦ m1), (s2 ↦ m2) ∈ S :
% └─ m1 = m2 ∧ s1 ≠ s2
%
% ENSURES: ∀ S satisfying the precondition above,
% └─ ∀ (q1 ↦ m1), (q2 ↦ m2) ∈ queue_index(S) :
% ├─ ∃ s1, s2 :
% │ └─ (s1 ↦ m1) ∈ S ∧ (s2 ↦ m2) ∈ S ∧
% └─ (m1.dst = m2.dst ∧ s1 ≤ s2) ==> q1 ≤ q2
%
queue_index: ((SubnetId × StreamIndex) ↦ Message) → ((CanisterId × ℕ) ↦ Message))
queue_index(S) := {
% We do not provide a concrete implementation of this function as there are
% multiple possible implementations and the choice for one also depends on
% how priorities/fairness etc. are handled.
%
% A trivial implementation is to iterate over the given stream slices S per
% subnet and for each individual slice iterate over all the messages in the
% order they appear in the slice and push each message m on the right queue,
% i.e., the one belonging to the destination canister. This is also the way
% things are currently implemented.
}</nowiki>

Based on this we can now define a function that maps over the indexes of the valid XNet messages.
<nowiki>map_valid_xnet_messages : (SubnetId ↦ Slice) ×
(CanisterId ↦ SubnetId) →
((CanisterId × ℕ) ↦ Message)
map_valid_xnet_messages(slices, subnet_assignment) :=
queue_index({ ((subnet, index) ↦ m) | (subnet ↦ slice) ∈ slices ∧
(index ↦ m) ∈ slice.msgs ∧
subnet_assignment[m.src] = subnet ∧

})</nowiki>

Finally, we can define the state <code>S'</code> resulting from computing <code>pre_process(S, b)</code>:
<nowiki>S with
% Append the ingress messages accepted by the VSR to the appropriate ingress_queue
ingress_queues := push(S.ingress_queues, VSR(S, b).ingress)

% Append the canister to canister messages accepted by the VSR to the appropriate
% input queue.
input_queues := push(S.input_queues,
map_valid_xnet_messages(VSR(S, b).xnet, subnet_assignment)
)

% Garbage collect the messages which have accepted by the target subnet.
% (As soon as the VSR does no longer ACCEPT all messages, one would have
% to make sure that rejected messages are appropriately re-enqueued in
% the streams)
streams.msgs := delete(S.streams.msgs,
{ (concatenate(subnet, index) ↦ msg) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ ·) ∈ slice.signals ∧
index = concatenate(subnet, i)
}
)

% Add the signals reflecting the decisions made by the VSR in the current round and
% garbage collect the signals which have already been processed on the other subnet
% (one knows that a signal has been processed when the message is no longer included
% in a given slice).
streams.signals := S.streams.signals
∪ VSR(S, b).signals
\ { (index ↦ signal) |
(subnet ↦ slice) ∈ b.xnet_payload ∧
(i ↦ signal) ∈ S.streams[subnet].signals ∧
index = concatenate(subnet, i) ∧
j ∈ slice.begin ∧
i < j
}

% Update the expected XNet indexes so that the block maker can compute which messages
% to include in a block referencing this state.
expected_xnet_indices := { index | index ∈ S.expected_xnet_indices ∧
∄ (i ↦ ·) ∈ b.xnet_payload.msgs.elements :
└─ prefix(index) = prefix(i)
} ∪
{ index + 1 | index ∈ max(dom(b.xnet_payload.msgs.elements)) }</nowiki>

'''Execution Phase'''. In the execution phase, one starts off with a <code>CanonicalState S</code>, schedules messages for execution by the hypervisor, and triggers the hypervisor to execute them, i.e., one computes <code>S' = execute(S)</code> where <code>S</code> is the state after the induction phase. From the perspective of message routing, the state <code>S'</code> resulting from computing <code>execute(S)</code> looks as follows:
<nowiki>S with
% Delete the consumed ingress messages from the respective ingress queues
ingress_queues := delete(S.ingress_queue, schedule_and_execute(S).consumed_ingress_messages)

% Delete the consumed canister to canister messages from the respective input queues
input_queues := delete(S.input_queues, schedule_and_execute(S).consumed_xnet_messages)

% Append the produced messages to the respective output queues
output_queues := push(S.output_queues, schedule_and_execute(S).produced_messages)

% Execution specific state is transformed by the execution environment; the precise transition
% function is out of scope here.</nowiki>

'''XNet Message Routing Phase'''. In the XNet message routing phase, one takes all the messages from the canister-to-canister output queues and, according to the subnet_assignment, puts them into a subnet-to-subnet stream, i.e., it computes <code>S' = post_process(S, registry)</code>, where <code>S</code> is the state after the execution phase and registry represents a view of the registry.

Before we define the state transition, we define a helper function to appropriately handle messages targeted at canisters that do not exist according to the given subnet assignment.
<nowiki>% Remove all messages from output queues targeted at non-existent canisters according
% to the subnet assignment.
filter : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → ((CanisterId × CanisterId) ↦ Queue<Message>)
filter(queues, subnet_assignment) :=
delete(queues, { (q_index ↦ msg) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (·, dst, ·) ∧
dst ∉ dom(subnet_assignment)
}
)
</nowiki>

Produce <code>NON_EXISTENT_CANISTER</code> replies telling the sending canister that the destination canister does not exist.
<nowiki>% Produce NON_EXISTENT_CANISTER messages to be pushed to input queues
% of the senders of messages where the destination does not exist
non_existent_canister_replies : ((CanisterId × CanisterId) ↦ Queue<Message>) × (CanisterId ↦ SubnetId) → (QueueIndex ↦ Message)
non_existent_canister_replies(queues, subnet_assignment) :=
{ ((dst, src, i) ↦ NON_EXISTENT_CANISTER) | (q_index ↦ msg) ∈ queues.elements ∧
q_index = (src, dst, i) ∧
dst ∉ dom(subnet_assignment)
})</nowiki>

''Non flat streams.'' As already mentioned before, the specification leaves it open whether one flat stream is produced per destination subnet, or whether each of the streams has multiple substreams—this can be decided by the implementation. To enable this, a <code>StreamIndex</code> is defined to be a tuple of <code>SubstreamId</code> and a natural number. If we have a flat stream, <code>StreamIndex</code> is defined to be the unit type <code>()</code> which effectively means that the implementation can use natural numbers as stream index as one does not need to make the <code>SubstreamId</code> explicit in this case. In contrast, if we have per-destination (or per-source) substreams, <code>StreamIndex</code> is defined to be a <code>CanisterId</code>.

Formally, this means that the implementation must fix a mapping function that—based on a given prefix of a <code>QueueIndex</code>, i.e., a src-dst tuple—decides on the prefix of the <code>StreamIndex</code>, i.e., the SubstreamId.
<nowiki>substream_id: (CanisterId × CanisterId) → SubstreamId

% Definition of substream_id for flat streams
substream_id((src, dst)) := ()

% Definition of substream_id for per-destination canister substreams
substream_id((src, dst)) := dst
</nowiki>

''Description of the actual state transition''. The state <code>S'</code> resulting from computing <code>post_process(S, subnet_assignment)</code> is defined as follows:
<nowiki>S with
% Clear the output queues
output_queues := clear(S.output_queues)

% Route the messages produced in the previous execution phase to the appropriate streams
% taking into account ordering and capacity management constraints enforced by stream_index.
streams.msgs := {
let msgs = S.streams.msgs

% Iterate over filtered messages preserving order of messages in queues.
for each (q_index ↦ msg) ∈ filter(S.output_queues, subnet_assignment)
msgs = push(msgs, { (concatenate(substream_id(prefix(q_index)), postfix(q_index)) ↦ msg) })

return msgs
}

% Push NON_EXISTENT_CANISTER replies to input queues of the respective canisters
input_queues := push(S.input_queues,
non_existent_canister_replies(S.output_queues, subnet_assignment))</nowiki>

IC message routing layer

2022-11-03T12:45:09Z

David:

IC message routing layer

2022-11-03T12:40:35Z

David:

IC message routing layer

2022-11-03T12:37:42Z

David:

IC message routing layer

2022-11-03T12:37:20Z

David:

IC message routing layer

2022-11-03T12:29:58Z

David:

IC message routing layer

2022-11-03T12:28:54Z

David:

IC message routing layer

2022-11-03T12:28:24Z

David:

IC message routing layer

2022-11-03T12:27:42Z

David:

IC message routing layer

2022-11-03T12:25:04Z

David:

IC message routing layer

2022-11-03T12:16:44Z

David:

File:Message-routing-data-flow.png

2022-11-03T12:16:01Z

David:

Data flow during deterministic processing

IC message routing layer

2022-11-03T12:12:33Z

David:

IC message routing layer

2022-11-03T12:10:07Z

David:

File:MR Interactions.png

2022-11-03T12:07:17Z

David:

Interactions of message routing with other components during a deterministic processing round

File:Message Routing Components.png

2022-11-03T12:06:00Z

David:

Components message routing interacts with in a deterministic processing cycle

IC message routing layer

2022-11-03T11:31:48Z

David:

IC message routing layer

2022-11-03T11:16:02Z

David:

IC message routing layer

2022-11-03T11:13:35Z

David:

IC message routing layer

2022-11-03T11:07:02Z

David:

IC message routing layer

2022-11-03T11:03:42Z

David: