Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

qmonnet · 2025-07-04T23:27:57Z

For stateless NAT, replace the multiple lists of prefixes that are stored for computing the mapping to translate addresses. Instead, associate to each IP range from the original prefixes a range of the same size among the target prefixes. We end up with more entries (potentially), but each entry contains only three IP addresses instead of the prefix list. This also saves us from looping on whole prefix lists when computing the address mappings. This also allows us to remove the iplist submodule, unused after the update.

Please refer to commit descriptions for more details.

Fredi-raspall · 2025-07-05T20:18:01Z

Marking this as don't merge until other PRs get in

qmonnet · 2025-07-07T09:03:38Z

Thanks Fredi, the PR was in draft mode already and it doesn't even target the main branch 😅

qmonnet · 2025-07-10T19:16:30Z

Talking with Daniel, the current sentiment is that this PR (and my other one on PrefixSize arithmetic) should likely go in before his own work.

I'll remove the dont-merge label (after triple-checking that auto-merge was disabled 🙈).

mvachhar

I'm going to approve this assuming we have reasonable test coverage for the translations. However, I think there is probably an even more efficient, O(log n), single trie lookup, way to compute the destination addresses. I'll think about it more over the next week.

This commit changes the way we store information for stateless NAT mapping, and the way we process to the lookup and mapping computation. The former model used IpList and IpListSubset (an optimised version of the former) objects. These objects represent the whole sets of allowed original and target IP addresses for the VpcExpose, and we'd retrieve the offset of the original IP within the first set, then look for the corresponding IP in the second set. This method works, but it's not great in terms of efficiency: it means that we need to loop over all the prefixes each time we want an offset (although the IpListSubset optimisation helped, here) and then again when we want to map this offset to the target address. Not to mention the memory cost of saving both lists of prefixes as a TrieValue, for each possible prefix, both for source and destination NAT. Let's rework the way we build the NAT tables. Instead of storing all these prefix lists in TrieValue objects and to associate them to each possible original prefix, we split the list of original prefixes into chunks, each of which gets the exact chunk (from the target prefixes) into which the addresses map, at the same offset. Okay, this sounds hard to visualise without a picture. Here's the legacy method, YAML-ish style: dst_nat table: - key: 1.0.0.0/16 value: TrieValue: original prefixes (IpList 1): - 1.0.0.0/16 - 2.0.0.0/16 destination prefixes (IpList 2): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 - key: 2.0.0.0/16 value: TrieValue: original prefixes (IpList 1): - 1.0.0.0/16 - 2.0.0.0/16 destination prefixes (IpList 2): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 From there, if a packet has destination 2.0.2.5, we compute the offset of the address within the first IpList (2^16 + 2*2^8 + 5), then look for the address at the same offset in the second set, and obtain 10.3.2.5. The IpListSubset optimisation shrinks one of the two IpList objects. here's the existing representation of the table: dst_nat table: - key: 1.0.0.0/16 value: TrieValue: original prefixes (IpListSubset): prefix: 1.0.0.0/16 offset: 0 destination prefixes (IpList): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 - key: 2.0.0.0/16 value: TrieValue: original prefixes (IpListSubset): prefix: 2.0.0.0/16 offset: 2^16 destination prefixes (IpList): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 This time, we only have one tuple (prefix, prefix offset) in the IpListSubset, and not the full list of original prefixes. When we receive 2.0.2.5, we compute the offset within the prefix (2*2^8 + 5) and add the prefix offset (2^16), to obtain the same offset and then the same mapping as in the first case. Now, the new model in this commit is the following: dst_nat table: - key: 1.0.0.0 value: original prefix end: 1.0.127.255 target range start: 10.1.0.0 - key: 1.0.128.0 original prefix end: 1.0.255.255 target range start: 10.2.0.0 - key: 2.0.0.0 original prefix end: 2.0.255.255 target range start: 10.3.0.0 Note that we have more keys (the 1.0.0.0/16 prefix was split into two chunks, each being associated to a /17 range from the target prefixes; target prefixes can be split too, if necessary, so that we keep a 1:1 address mapping between the associated ranges). We have more entries in the table, but they are much smaller, each containing 3 IP addresses only (one for the key, two making the value). These entries are stored in a BTreeMap. When our packet with destination 2.0.2.5 arrives, we look in the map for the closest lower (or equal) key in the map corresponding to the entry. In our case, this is 2.0.0.0. The lookup returns the full entry, so we have the original prefix start and end addresses, and the target range start address. From here, we compute the offset of the address within the range (2*2^8 + 5), and retrieve the address at the same offset within the target range (10.3.2.5). We don't need the original prefix end address to compute the offset. However, we do need this end address to make sure that the entry we find in the table is the right one: we don't want to find the closest lower range start address if the address we're looking up is not part of this range. This happens for source NAT, for example, when we search the different relevant expose sub-tables for the range corresponding to the packet source IP. The most complex part of this commit is the change of code for building the NAT rule entries, because we need to split the existing prefixes into these ranges to preserve the 1:1 mapping between each pair of ranges. We also have some sensitive updates in the structure of the NAT rules, and the insert() and lookup() function for these tables. In the core logic for the NAT code, it looks like the code for the mapping gets bigger, but this is because the new functions replace all of the IpList processing, that we don't use anymore. The IpList submodule will be entirely removed in a follow-up commit. Signed-off-by: Quentin Monnet <[email protected]>

The iplist module has been rewritten, it disappeared at some point, then it came back... but this time, it looks like we no longer need it, and we remove it again. For good? Time will tell... This is a follow-up to a rework of the stateless NAT tables, and the module is now no longer in use. Signed-off-by: Quentin Monnet <[email protected]>

To help debug stateless NAT in case of issue, log the ranges and the offset that we computed for the IP address we attempt to translate. Signed-off-by: Quentin Monnet <[email protected]>

The code for building the stateless NAT tables has grown and contains several logical sections. It would be more readable to have each of them in a dedicated submodule, along with their relevant unit tests, so this commit does just that. The three portions of code are: - The code to collapse the list of allowed and excluded prefixes into an equivalent list of allowed fragments only, moved to a new file collapse.rs - The code to build the TrieValue entries for the NAT tables, where we split the prefix lists again into pairs of ranges with a 1:1 mapping, moved to a new file range_builder.rs - The higher-level API function, add_peering(), and its direct dependency helper, that we move to the mod.rs No functional change. Signed-off-by: Quentin Monnet <[email protected]>

The NatTableValue is no longer associated with a trie (we use a BTreeHash and retrieve the lower bound for an IP to find the IP range start address). Now, the name is misleading: let's rename the struct (and some associated functions: generate_trie_values(), generate_public_trie_values(), generate_private_trie_values()). Signed-off-by: Quentin Monnet <[email protected]>

Add a simple unit test to validate the prefix list split into several IP range segments. Signed-off-by: Quentin Monnet <[email protected]>

When building stateless NAT tables, we turn our list of prefixes into different ranges, and insert each of these ranges into the NAT tables. We cannot easily merge and process contiguous CIDR prefixes; however, IP ranges, being defined with just a start and an end address, are easy to concatenate. Make sure to "merge" (concatenate) ranges from contiguous prefixes when we process them to build the tables. For an illustration, see the example from the unit test. We have: let prefixes_to_update = BTreeSet::from([ "1.0.0.0/24".into(), "1.0.1.0/24".into(), "1.0.2.0/24".into(), "1.0.3.0/24".into(), "2.0.0.0/16".into(), "2.1.0.0/16".into(), "2.2.0.0/16".into(), "2.3.0.0/16".into(), ]); let prefixes_to_point_to = BTreeSet::from([ "10.0.0.0/24".into(), "10.0.1.0/24".into(), "10.0.2.0/24".into(), "11.0.0.0/16".into(), "11.1.0.0/16".into(), "11.2.0.0/16".into(), "11.3.0.0/16".into(), "11.4.0.0/24".into(), ]); Note how the first set has a last /24 prefix when the second switches to a series of /16, breaking the "alignment" and causing more ranges to be created. Converted into basic mapping ranges, this would produce: 1.0.0.0-1.0.0.255 -> 10.0.0.0-10.0.0.255 1.0.1.0-1.0.2.255 -> 10.0.1.0-10.0.1.255 1.0.2.0-1.0.2.255 -> 10.0.2.0-10.0.2.255 1.0.3.0-1.0.3.255 -> 11.0.0.0-11.0.0.255 2.0.0.0-2.0.254.255 -> 11.0.1.0-11.0.255.255 2.0.255.0-2.0.255.255 -> 11.1.0.0-11.1.0.255 2.1.0.0-2.1.254.255 -> 11.1.1.0-11.1.255.255 2.1.255.0-2.1.255.255 -> 11.2.0.0-11.2.0.255 2.2.0.0-2.2.254.255 -> 11.2.1.0-11.2.255.255 2.2.255.0-2.3.255.255 -> 11.3.0.0-11.3.0.255 2.3.0.0-2.3.254.255 -> 11.3.1.0-11.3.255.255 2.3.255.0-2.3.255.255 -> 11.4.0.0-11.4.0.255 But most of these ranges are contiguous, and can be merged to produce fewer entries. The following mappings are equivalent: 1.0.0.0-1.0.2.255 -> 10.0.0.0-10.0.2.255 1.0.3.0-1.0.3.255 -> 11.0.0.0-11.0.0.255 2.0.0.0-2.3.255.255 -> 11.0.1.0-11.4.0.255 This is what we implement in this commit. Signed-off-by: Quentin Monnet <[email protected]>

qmonnet · 2025-07-11T13:52:52Z

Rebased to address the conflict.

I'm going to approve this assuming we have reasonable test coverage for the translations.

Thanks for the review!

However, I think there is probably an even more efficient, O(log n), single trie lookup, way to compute the destination addresses. I'll think about it more over the next week.

At the moment we have a tree lookup to find the relevant IP range, which is O(log n) already? (At least for destination NAT, and in the future for source NAT when we get a separate stage for destination VNI lookup). But if you have ideas to improve it further, yes I'm interested!

Add fuzzing tests in which we generate two lists of prefixes, pick addresses within the first list, and make sure that we can find IP ranges within the generated NAT ranges from the first prefix list. Make sure these NAT ranges have a total size that matches the size (number of IP addresses covered) of the prefix list. Generating the required prefix lists is non-trivial, given that each list must only contain non-overlapping prefixes, and both lists must have the same total size (total number of IP addresses covered by the prefixes), witout necessarily containing prefixes of the same lengths. Much of the code is dedicated to the generation of these lists. This test led to the discovery of a bug on the implementation of the Add trait for PrefixSize, which has been addressed in a previous commit. Signed-off-by: Quentin Monnet <[email protected]>

qmonnet added this to the GW R1 milestone Jul 4, 2025

qmonnet requested a review from mvachhar July 4, 2025 23:27

qmonnet self-assigned this Jul 4, 2025

qmonnet added the area/nat Related to Network Address Translation (NAT) label Jul 4, 2025

Fredi-raspall added the dont-merge Do not merge this Pull Request label Jul 5, 2025

qmonnet mentioned this pull request Jul 7, 2025

Pr/daniel noland/replace iptrie #672

Closed

qmonnet force-pushed the pr/qmonnet/simplify-trievalue branch from 316df2b to ae4844c Compare July 7, 2025 16:57

Base automatically changed from pr/qmonnet/simplify-trievalue to main July 7, 2025 17:10

qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch 3 times, most recently from 97af62e to 42ef2bc Compare July 7, 2025 21:49

qmonnet marked this pull request as ready for review July 8, 2025 02:45

qmonnet requested a review from a team as a code owner July 8, 2025 02:45

qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch 2 times, most recently from be0d5c5 to d832a28 Compare July 9, 2025 10:55

qmonnet requested a review from daniel-noland July 10, 2025 19:15

qmonnet removed the dont-merge Do not merge this Pull Request label Jul 10, 2025

mvachhar approved these changes Jul 11, 2025

View reviewed changes

qmonnet added 7 commits July 11, 2025 14:41

feat(nat): Log stateless NAT IP mapping offset for debug

afc913a

To help debug stateless NAT in case of issue, log the ranges and the offset that we computed for the IP address we attempt to translate. Signed-off-by: Quentin Monnet <[email protected]>

test(mgmt): Add unit test for stateless NAT's range_builder

44803b9

Add a simple unit test to validate the prefix list split into several IP range segments. Signed-off-by: Quentin Monnet <[email protected]>

qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch from d832a28 to ff721b7 Compare July 11, 2025 13:50

qmonnet enabled auto-merge July 11, 2025 13:50

qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch from ff721b7 to 805b608 Compare July 11, 2025 13:56

qmonnet added this pull request to the merge queue Jul 11, 2025

Merged via the queue into main with commit c461d77 Jul 11, 2025
16 checks passed

qmonnet deleted the pr/qmonnet/trie-value-target-split branch July 11, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

Uh oh!

qmonnet commented Jul 4, 2025 •

edited

Loading

Uh oh!

Fredi-raspall commented Jul 5, 2025

Uh oh!

qmonnet commented Jul 7, 2025

Uh oh!

qmonnet commented Jul 10, 2025

Uh oh!

mvachhar left a comment

Uh oh!

qmonnet commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

Uh oh!

Conversation

qmonnet commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fredi-raspall commented Jul 5, 2025

Uh oh!

qmonnet commented Jul 7, 2025

Uh oh!

qmonnet commented Jul 10, 2025

Uh oh!

mvachhar left a comment

Choose a reason for hiding this comment

Uh oh!

qmonnet commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

qmonnet commented Jul 4, 2025 •

edited

Loading