Skip to content

Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 11, 2025

Conversation

qmonnet
Copy link
Member

@qmonnet qmonnet commented Jul 4, 2025

For stateless NAT, replace the multiple lists of prefixes that are stored for computing the mapping to translate addresses. Instead, associate to each IP range from the original prefixes a range of the same size among the target prefixes. We end up with more entries (potentially), but each entry contains only three IP addresses instead of the prefix list. This also saves us from looping on whole prefix lists when computing the address mappings. This also allows us to remove the iplist submodule, unused after the update.

Please refer to commit descriptions for more details.

@qmonnet qmonnet added this to the GW R1 milestone Jul 4, 2025
@qmonnet qmonnet requested a review from mvachhar July 4, 2025 23:27
@qmonnet qmonnet self-assigned this Jul 4, 2025
@qmonnet qmonnet added the area/nat Related to Network Address Translation (NAT) label Jul 4, 2025
@Fredi-raspall Fredi-raspall added the dont-merge Do not merge this Pull Request label Jul 5, 2025
@Fredi-raspall
Copy link
Contributor

Marking this as don't merge until other PRs get in

@qmonnet
Copy link
Member Author

qmonnet commented Jul 7, 2025

Thanks Fredi, the PR was in draft mode already and it doesn't even target the main branch 😅

@qmonnet qmonnet force-pushed the pr/qmonnet/simplify-trievalue branch from 316df2b to ae4844c Compare July 7, 2025 16:57
Base automatically changed from pr/qmonnet/simplify-trievalue to main July 7, 2025 17:10
@qmonnet qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch 3 times, most recently from 97af62e to 42ef2bc Compare July 7, 2025 21:49
@qmonnet qmonnet marked this pull request as ready for review July 8, 2025 02:45
@qmonnet qmonnet requested a review from a team as a code owner July 8, 2025 02:45
@qmonnet qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch 2 times, most recently from be0d5c5 to d832a28 Compare July 9, 2025 10:55
@qmonnet qmonnet requested a review from daniel-noland July 10, 2025 19:15
@qmonnet
Copy link
Member Author

qmonnet commented Jul 10, 2025

Talking with Daniel, the current sentiment is that this PR (and my other one on PrefixSize arithmetic) should likely go in before his own work.

I'll remove the dont-merge label (after triple-checking that auto-merge was disabled 🙈).

@qmonnet qmonnet removed the dont-merge Do not merge this Pull Request label Jul 10, 2025
Copy link
Contributor

@mvachhar mvachhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to approve this assuming we have reasonable test coverage for the translations. However, I think there is probably an even more efficient, O(log n), single trie lookup, way to compute the destination addresses. I'll think about it more over the next week.

qmonnet added 7 commits July 11, 2025 14:41
This commit changes the way we store information for stateless NAT
mapping, and the way we process to the lookup and mapping computation.

The former model used IpList and IpListSubset (an optimised version of
the former) objects. These objects represent the whole sets of allowed
original and target IP addresses for the VpcExpose, and we'd retrieve
the offset of the original IP within the first set, then look for the
corresponding IP in the second set.

This method works, but it's not great in terms of efficiency: it means
that we need to loop over all the prefixes each time we want an offset
(although the IpListSubset optimisation helped, here) and then again
when we want to map this offset to the target address. Not to mention
the memory cost of saving both lists of prefixes as a TrieValue, for
each possible prefix, both for source and destination NAT.

Let's rework the way we build the NAT tables. Instead of storing all
these prefix lists in TrieValue objects and to associate them to each
possible original prefix, we split the list of original prefixes into
chunks, each of which gets the exact chunk (from the target prefixes)
into which the addresses map, at the same offset.

Okay, this sounds hard to visualise without a picture. Here's the
legacy method, YAML-ish style:

  dst_nat table:
     - key: 1.0.0.0/16
       value:
         TrieValue:
          original prefixes (IpList 1):
            - 1.0.0.0/16
            - 2.0.0.0/16
          destination prefixes (IpList 2):
            - 10.1.0.0/17
            - 10.2.0.0/17
            - 10.3.0.0/16
     - key: 2.0.0.0/16
       value:
         TrieValue:
          original prefixes (IpList 1):
            - 1.0.0.0/16
            - 2.0.0.0/16
          destination prefixes (IpList 2):
            - 10.1.0.0/17
            - 10.2.0.0/17
            - 10.3.0.0/16

From there, if a packet has destination 2.0.2.5, we compute the offset
of the address within the first IpList (2^16 + 2*2^8 + 5), then look for
the address at the same offset in the second set, and obtain 10.3.2.5.

The IpListSubset optimisation shrinks one of the two IpList objects.
here's the existing representation of the table:

  dst_nat table:
     - key: 1.0.0.0/16
       value:
         TrieValue:
          original prefixes (IpListSubset):
            prefix: 1.0.0.0/16
            offset: 0
          destination prefixes (IpList):
            - 10.1.0.0/17
            - 10.2.0.0/17
            - 10.3.0.0/16
     - key: 2.0.0.0/16
       value:
         TrieValue:
          original prefixes (IpListSubset):
            prefix: 2.0.0.0/16
            offset: 2^16
          destination prefixes (IpList):
            - 10.1.0.0/17
            - 10.2.0.0/17
            - 10.3.0.0/16

This time, we only have one tuple (prefix, prefix offset) in the
IpListSubset, and not the full list of original prefixes. When we
receive 2.0.2.5, we compute the offset within the prefix (2*2^8 + 5) and
add the prefix offset (2^16), to obtain the same offset and then the
same mapping as in the first case.

Now, the new model in this commit is the following:

  dst_nat table:
     - key: 1.0.0.0
       value:
         original prefix end: 1.0.127.255
         target range start: 10.1.0.0
     - key: 1.0.128.0
         original prefix end: 1.0.255.255
         target range start: 10.2.0.0
     - key: 2.0.0.0
         original prefix end: 2.0.255.255
         target range start: 10.3.0.0

Note that we have more keys (the 1.0.0.0/16 prefix was split into two
chunks, each being associated to a /17 range from the target prefixes;
target prefixes can be split too, if necessary, so that we keep a 1:1
address mapping between the associated ranges). We have more entries in
the table, but they are much smaller, each containing 3 IP addresses
only (one for the key, two making the value).

These entries are stored in a BTreeMap. When our packet with destination
2.0.2.5 arrives, we look in the map for the closest lower (or equal) key
in the map corresponding to the entry. In our case, this is 2.0.0.0. The
lookup returns the full entry, so we have the original prefix start and
end addresses, and the target range start address. From here, we compute
the offset of the address within the range (2*2^8 + 5), and retrieve the
address at the same offset within the target range (10.3.2.5).

We don't need the original prefix end address to compute the offset.
However, we do need this end address to make sure that the entry we find
in the table is the right one: we don't want to find the closest lower
range start address if the address we're looking up is not part of this
range. This happens for source NAT, for example, when we search the
different relevant expose sub-tables for the range corresponding to the
packet source IP.

The most complex part of this commit is the change of code for building
the NAT rule entries, because we need to split the existing prefixes
into these ranges to preserve the 1:1 mapping between each pair of
ranges. We also have some sensitive updates in the structure of the NAT
rules, and the insert() and lookup() function for these tables. In the
core logic for the NAT code, it looks like the code for the mapping gets
bigger, but this is because the new functions replace all of the IpList
processing, that we don't use anymore. The IpList submodule will be
entirely removed in a follow-up commit.

Signed-off-by: Quentin Monnet <[email protected]>
The iplist module has been rewritten, it disappeared at some point, then
it came back... but this time, it looks like we no longer need it, and
we remove it again. For good? Time will tell...

This is a follow-up to a rework of the stateless NAT tables, and the
module is now no longer in use.

Signed-off-by: Quentin Monnet <[email protected]>
To help debug stateless NAT in case of issue, log the ranges and the
offset that we computed for the IP address we attempt to translate.

Signed-off-by: Quentin Monnet <[email protected]>
The code for building the stateless NAT tables has grown and contains
several logical sections. It would be more readable to have each of them
in a dedicated submodule, along with their relevant unit tests, so this
commit does just that.

The three portions of code are:

- The code to collapse the list of allowed and excluded prefixes into an
  equivalent list of allowed fragments only, moved to a new file
  collapse.rs
- The code to build the TrieValue entries for the NAT tables, where we
  split the prefix lists again into pairs of ranges with a 1:1 mapping,
  moved to a new file range_builder.rs
- The higher-level API function, add_peering(), and its direct
  dependency helper, that we move to the mod.rs

No functional change.

Signed-off-by: Quentin Monnet <[email protected]>
The NatTableValue is no longer associated with a trie (we use a
BTreeHash and retrieve the lower bound for an IP to find the IP range
start address). Now, the name is misleading: let's rename the struct
(and some associated functions: generate_trie_values(),
generate_public_trie_values(), generate_private_trie_values()).

Signed-off-by: Quentin Monnet <[email protected]>
Add a simple unit test to validate the prefix list split into several IP
range segments.

Signed-off-by: Quentin Monnet <[email protected]>
When building stateless NAT tables, we turn our list of prefixes into
different ranges, and insert each of these ranges into the NAT tables.
We cannot easily merge and process contiguous CIDR prefixes; however,
IP ranges, being defined with just a start and an end address, are easy
to concatenate. Make sure to "merge" (concatenate) ranges from
contiguous prefixes when we process them to build the tables.

For an illustration, see the example from the unit test. We have:

    let prefixes_to_update = BTreeSet::from([
        "1.0.0.0/24".into(),
        "1.0.1.0/24".into(),
        "1.0.2.0/24".into(),
        "1.0.3.0/24".into(),
        "2.0.0.0/16".into(),
        "2.1.0.0/16".into(),
        "2.2.0.0/16".into(),
        "2.3.0.0/16".into(),
    ]);
    let prefixes_to_point_to = BTreeSet::from([
        "10.0.0.0/24".into(),
        "10.0.1.0/24".into(),
        "10.0.2.0/24".into(),
        "11.0.0.0/16".into(),
        "11.1.0.0/16".into(),
        "11.2.0.0/16".into(),
        "11.3.0.0/16".into(),
        "11.4.0.0/24".into(),
    ]);

Note how the first set has a last /24 prefix when the second switches to
a series of /16, breaking the "alignment" and causing more ranges to be
created. Converted into basic mapping ranges, this would produce:

    1.0.0.0-1.0.0.255     -> 10.0.0.0-10.0.0.255
    1.0.1.0-1.0.2.255     -> 10.0.1.0-10.0.1.255
    1.0.2.0-1.0.2.255     -> 10.0.2.0-10.0.2.255
    1.0.3.0-1.0.3.255     -> 11.0.0.0-11.0.0.255
    2.0.0.0-2.0.254.255   -> 11.0.1.0-11.0.255.255
    2.0.255.0-2.0.255.255 -> 11.1.0.0-11.1.0.255
    2.1.0.0-2.1.254.255   -> 11.1.1.0-11.1.255.255
    2.1.255.0-2.1.255.255 -> 11.2.0.0-11.2.0.255
    2.2.0.0-2.2.254.255   -> 11.2.1.0-11.2.255.255
    2.2.255.0-2.3.255.255 -> 11.3.0.0-11.3.0.255
    2.3.0.0-2.3.254.255   -> 11.3.1.0-11.3.255.255
    2.3.255.0-2.3.255.255 -> 11.4.0.0-11.4.0.255

But most of these ranges are contiguous, and can be merged to produce
fewer entries. The following mappings are equivalent:

    1.0.0.0-1.0.2.255    -> 10.0.0.0-10.0.2.255
    1.0.3.0-1.0.3.255    -> 11.0.0.0-11.0.0.255
    2.0.0.0-2.3.255.255  -> 11.0.1.0-11.4.0.255

This is what we implement in this commit.

Signed-off-by: Quentin Monnet <[email protected]>
@qmonnet qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch from d832a28 to ff721b7 Compare July 11, 2025 13:50
@qmonnet qmonnet enabled auto-merge July 11, 2025 13:50
@qmonnet
Copy link
Member Author

qmonnet commented Jul 11, 2025

Rebased to address the conflict.

I'm going to approve this assuming we have reasonable test coverage for the translations.

Thanks for the review!

However, I think there is probably an even more efficient, O(log n), single trie lookup, way to compute the destination addresses. I'll think about it more over the next week.

At the moment we have a tree lookup to find the relevant IP range, which is O(log n) already? (At least for destination NAT, and in the future for source NAT when we get a separate stage for destination VNI lookup). But if you have ideas to improve it further, yes I'm interested!

Add fuzzing tests in which we generate two lists of prefixes, pick
addresses within the first list, and make sure that we can find IP
ranges within the generated NAT ranges from the first prefix list. Make
sure these NAT ranges have a total size that matches the size (number of
IP addresses covered) of the prefix list.

Generating the required prefix lists is non-trivial, given that each
list must only contain non-overlapping prefixes, and both lists must
have the same total size (total number of IP addresses covered by the
prefixes), witout necessarily containing prefixes of the same lengths.
Much of the code is dedicated to the generation of these lists.

This test led to the discovery of a bug on the implementation of the Add
trait for PrefixSize, which has been addressed in a previous commit.

Signed-off-by: Quentin Monnet <[email protected]>
@qmonnet qmonnet force-pushed the pr/qmonnet/trie-value-target-split branch from ff721b7 to 805b608 Compare July 11, 2025 13:56
@qmonnet qmonnet added this pull request to the merge queue Jul 11, 2025
Merged via the queue into main with commit c461d77 Jul 11, 2025
16 checks passed
@qmonnet qmonnet deleted the pr/qmonnet/trie-value-target-split branch July 11, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nat Related to Network Address Translation (NAT)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants