-
Notifications
You must be signed in to change notification settings - Fork 5
Stateless NAT: Optimise NAT tables (get rid of prefix lists in tables) #667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Marking this as don't merge until other PRs get in |
Thanks Fredi, the PR was in draft mode already and it doesn't even target the |
316df2b
to
ae4844c
Compare
97af62e
to
42ef2bc
Compare
be0d5c5
to
d832a28
Compare
Talking with Daniel, the current sentiment is that this PR (and my other one on I'll remove the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to approve this assuming we have reasonable test coverage for the translations. However, I think there is probably an even more efficient, O(log n), single trie lookup, way to compute the destination addresses. I'll think about it more over the next week.
This commit changes the way we store information for stateless NAT mapping, and the way we process to the lookup and mapping computation. The former model used IpList and IpListSubset (an optimised version of the former) objects. These objects represent the whole sets of allowed original and target IP addresses for the VpcExpose, and we'd retrieve the offset of the original IP within the first set, then look for the corresponding IP in the second set. This method works, but it's not great in terms of efficiency: it means that we need to loop over all the prefixes each time we want an offset (although the IpListSubset optimisation helped, here) and then again when we want to map this offset to the target address. Not to mention the memory cost of saving both lists of prefixes as a TrieValue, for each possible prefix, both for source and destination NAT. Let's rework the way we build the NAT tables. Instead of storing all these prefix lists in TrieValue objects and to associate them to each possible original prefix, we split the list of original prefixes into chunks, each of which gets the exact chunk (from the target prefixes) into which the addresses map, at the same offset. Okay, this sounds hard to visualise without a picture. Here's the legacy method, YAML-ish style: dst_nat table: - key: 1.0.0.0/16 value: TrieValue: original prefixes (IpList 1): - 1.0.0.0/16 - 2.0.0.0/16 destination prefixes (IpList 2): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 - key: 2.0.0.0/16 value: TrieValue: original prefixes (IpList 1): - 1.0.0.0/16 - 2.0.0.0/16 destination prefixes (IpList 2): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 From there, if a packet has destination 2.0.2.5, we compute the offset of the address within the first IpList (2^16 + 2*2^8 + 5), then look for the address at the same offset in the second set, and obtain 10.3.2.5. The IpListSubset optimisation shrinks one of the two IpList objects. here's the existing representation of the table: dst_nat table: - key: 1.0.0.0/16 value: TrieValue: original prefixes (IpListSubset): prefix: 1.0.0.0/16 offset: 0 destination prefixes (IpList): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 - key: 2.0.0.0/16 value: TrieValue: original prefixes (IpListSubset): prefix: 2.0.0.0/16 offset: 2^16 destination prefixes (IpList): - 10.1.0.0/17 - 10.2.0.0/17 - 10.3.0.0/16 This time, we only have one tuple (prefix, prefix offset) in the IpListSubset, and not the full list of original prefixes. When we receive 2.0.2.5, we compute the offset within the prefix (2*2^8 + 5) and add the prefix offset (2^16), to obtain the same offset and then the same mapping as in the first case. Now, the new model in this commit is the following: dst_nat table: - key: 1.0.0.0 value: original prefix end: 1.0.127.255 target range start: 10.1.0.0 - key: 1.0.128.0 original prefix end: 1.0.255.255 target range start: 10.2.0.0 - key: 2.0.0.0 original prefix end: 2.0.255.255 target range start: 10.3.0.0 Note that we have more keys (the 1.0.0.0/16 prefix was split into two chunks, each being associated to a /17 range from the target prefixes; target prefixes can be split too, if necessary, so that we keep a 1:1 address mapping between the associated ranges). We have more entries in the table, but they are much smaller, each containing 3 IP addresses only (one for the key, two making the value). These entries are stored in a BTreeMap. When our packet with destination 2.0.2.5 arrives, we look in the map for the closest lower (or equal) key in the map corresponding to the entry. In our case, this is 2.0.0.0. The lookup returns the full entry, so we have the original prefix start and end addresses, and the target range start address. From here, we compute the offset of the address within the range (2*2^8 + 5), and retrieve the address at the same offset within the target range (10.3.2.5). We don't need the original prefix end address to compute the offset. However, we do need this end address to make sure that the entry we find in the table is the right one: we don't want to find the closest lower range start address if the address we're looking up is not part of this range. This happens for source NAT, for example, when we search the different relevant expose sub-tables for the range corresponding to the packet source IP. The most complex part of this commit is the change of code for building the NAT rule entries, because we need to split the existing prefixes into these ranges to preserve the 1:1 mapping between each pair of ranges. We also have some sensitive updates in the structure of the NAT rules, and the insert() and lookup() function for these tables. In the core logic for the NAT code, it looks like the code for the mapping gets bigger, but this is because the new functions replace all of the IpList processing, that we don't use anymore. The IpList submodule will be entirely removed in a follow-up commit. Signed-off-by: Quentin Monnet <[email protected]>
The iplist module has been rewritten, it disappeared at some point, then it came back... but this time, it looks like we no longer need it, and we remove it again. For good? Time will tell... This is a follow-up to a rework of the stateless NAT tables, and the module is now no longer in use. Signed-off-by: Quentin Monnet <[email protected]>
To help debug stateless NAT in case of issue, log the ranges and the offset that we computed for the IP address we attempt to translate. Signed-off-by: Quentin Monnet <[email protected]>
The code for building the stateless NAT tables has grown and contains several logical sections. It would be more readable to have each of them in a dedicated submodule, along with their relevant unit tests, so this commit does just that. The three portions of code are: - The code to collapse the list of allowed and excluded prefixes into an equivalent list of allowed fragments only, moved to a new file collapse.rs - The code to build the TrieValue entries for the NAT tables, where we split the prefix lists again into pairs of ranges with a 1:1 mapping, moved to a new file range_builder.rs - The higher-level API function, add_peering(), and its direct dependency helper, that we move to the mod.rs No functional change. Signed-off-by: Quentin Monnet <[email protected]>
The NatTableValue is no longer associated with a trie (we use a BTreeHash and retrieve the lower bound for an IP to find the IP range start address). Now, the name is misleading: let's rename the struct (and some associated functions: generate_trie_values(), generate_public_trie_values(), generate_private_trie_values()). Signed-off-by: Quentin Monnet <[email protected]>
Add a simple unit test to validate the prefix list split into several IP range segments. Signed-off-by: Quentin Monnet <[email protected]>
When building stateless NAT tables, we turn our list of prefixes into different ranges, and insert each of these ranges into the NAT tables. We cannot easily merge and process contiguous CIDR prefixes; however, IP ranges, being defined with just a start and an end address, are easy to concatenate. Make sure to "merge" (concatenate) ranges from contiguous prefixes when we process them to build the tables. For an illustration, see the example from the unit test. We have: let prefixes_to_update = BTreeSet::from([ "1.0.0.0/24".into(), "1.0.1.0/24".into(), "1.0.2.0/24".into(), "1.0.3.0/24".into(), "2.0.0.0/16".into(), "2.1.0.0/16".into(), "2.2.0.0/16".into(), "2.3.0.0/16".into(), ]); let prefixes_to_point_to = BTreeSet::from([ "10.0.0.0/24".into(), "10.0.1.0/24".into(), "10.0.2.0/24".into(), "11.0.0.0/16".into(), "11.1.0.0/16".into(), "11.2.0.0/16".into(), "11.3.0.0/16".into(), "11.4.0.0/24".into(), ]); Note how the first set has a last /24 prefix when the second switches to a series of /16, breaking the "alignment" and causing more ranges to be created. Converted into basic mapping ranges, this would produce: 1.0.0.0-1.0.0.255 -> 10.0.0.0-10.0.0.255 1.0.1.0-1.0.2.255 -> 10.0.1.0-10.0.1.255 1.0.2.0-1.0.2.255 -> 10.0.2.0-10.0.2.255 1.0.3.0-1.0.3.255 -> 11.0.0.0-11.0.0.255 2.0.0.0-2.0.254.255 -> 11.0.1.0-11.0.255.255 2.0.255.0-2.0.255.255 -> 11.1.0.0-11.1.0.255 2.1.0.0-2.1.254.255 -> 11.1.1.0-11.1.255.255 2.1.255.0-2.1.255.255 -> 11.2.0.0-11.2.0.255 2.2.0.0-2.2.254.255 -> 11.2.1.0-11.2.255.255 2.2.255.0-2.3.255.255 -> 11.3.0.0-11.3.0.255 2.3.0.0-2.3.254.255 -> 11.3.1.0-11.3.255.255 2.3.255.0-2.3.255.255 -> 11.4.0.0-11.4.0.255 But most of these ranges are contiguous, and can be merged to produce fewer entries. The following mappings are equivalent: 1.0.0.0-1.0.2.255 -> 10.0.0.0-10.0.2.255 1.0.3.0-1.0.3.255 -> 11.0.0.0-11.0.0.255 2.0.0.0-2.3.255.255 -> 11.0.1.0-11.4.0.255 This is what we implement in this commit. Signed-off-by: Quentin Monnet <[email protected]>
d832a28
to
ff721b7
Compare
Rebased to address the conflict.
Thanks for the review!
At the moment we have a tree lookup to find the relevant IP range, which is O(log n) already? (At least for destination NAT, and in the future for source NAT when we get a separate stage for destination VNI lookup). But if you have ideas to improve it further, yes I'm interested! |
Add fuzzing tests in which we generate two lists of prefixes, pick addresses within the first list, and make sure that we can find IP ranges within the generated NAT ranges from the first prefix list. Make sure these NAT ranges have a total size that matches the size (number of IP addresses covered) of the prefix list. Generating the required prefix lists is non-trivial, given that each list must only contain non-overlapping prefixes, and both lists must have the same total size (total number of IP addresses covered by the prefixes), witout necessarily containing prefixes of the same lengths. Much of the code is dedicated to the generation of these lists. This test led to the discovery of a bug on the implementation of the Add trait for PrefixSize, which has been addressed in a previous commit. Signed-off-by: Quentin Monnet <[email protected]>
ff721b7
to
805b608
Compare
For stateless NAT, replace the multiple lists of prefixes that are stored for computing the mapping to translate addresses. Instead, associate to each IP range from the original prefixes a range of the same size among the target prefixes. We end up with more entries (potentially), but each entry contains only three IP addresses instead of the prefix list. This also saves us from looping on whole prefix lists when computing the address mappings. This also allows us to remove the iplist submodule, unused after the update.
Please refer to commit descriptions for more details.