Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T368544

IPIP encapsulation considerations for low-traffic services
Open, MediumPublic

Description

IPIP encapsulation has a 20 bytes overhead that needs to be accounted somehow, in high-traffic[12] services we chose MSS clamping to avoid fragmentation between the load balancer and the realservers. For low-traffic services we also have the option of increasing the MTU given that we are discussing internal only services.

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
Vgutierrez triaged this task as Medium priority.Jun 26 2024, 2:46 PM

IPIP encapsulation has a 20 bytes overhead that needs to be accounted somehow, in high-traffic[12] services we chose MSS clamping to avoid fragmentation between the load balancer and the realservers. For low-traffic services we also have the option of increasing the MTU given that we are discussing internal only services.

Tbh we have that option for public-facing services also, as the additional 20 bytes / tunneling is only done across our own network, which can support jumbo frames.

It definitely gets tricky, however. You need to:

  • Increase the MTU on the LB and realservers so they can send/receive packets with an IP MTU of at least 1520 bytes
  • Ensure those servers do not send packets of over 1500 bytes to anything apart from hosts we know can support it
    • In theory, internally, PMTUD might work, but at best its inefficient
    • On the internet PMTUD can't be relied on

There are several ways you can try to achieve this. But my gut feeling is if we only have TCP-based services, and we already have a functioning solution using TCP MSS clamping, the simplest thing might be to also use it for the low_traffic services. For the most part requests that go via the LBs are small anyway, so the lower MSS isn't having a material affect.

Interested to hear what others think. It's definitely an interesting problem I'd be interested to work on the jumbo-frame option if people thought it was the way to go.

Could be convinced otherwise, but I'm generally in favor of the MSS clamping option -- we know it works and the tradeoffs are relatively easy to reason about.

There are a few services that are heavy on LB ingress, but they're the exceptions rather than the rule.

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Is there a compelling reason why it's better than our current solution?

T352956 is related (possibly a duplicate) and I 've mulling over it for a few months now. I think we need to have a larger in person discussion regarding this. There's some things I wanna understand on the kubernetes side before we move forward. I 'll send invites.

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Is there a compelling reason why it's better than our current solution?

I believe we wanna move away from all the VXLAN setup and stop requiring L2 connectivity between load balancers and realservers. This would allow us to have healthchecks following the same network path as production traffic, IMHO that's worth the effort :)

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Is there a compelling reason why it's better than our current solution?

I believe we wanna move away from all the VXLAN setup and stop requiring L2 connectivity between load balancers and realservers.

While i see why that's attractive, i think there are factors other than network design to consider here. For services that sit inside kubernetes, an LVS/katran load balancer is just a mostly transparent abstraction we introduce between ATS and its backends (the k8s worker nodes) and the component doing real load balancing (kube-proxy). I think it's on serviceops to decide which direction they want to go with - be it switching kube-proxy to ipvs mode, using the service mesh to do all load balancing, sticking with centralized LVS servers, or something else.

Note there was some phab/brain lag here, I wrote this before I saw joe's last response above, they overlap a bunch

For more context: eventually our Katran-based Liberica balancer will replace pybal/LVS. The Katran one has to use IPIP, that's its nature. We've been transitioning high-traffic[12] public LVS services to IPIP now, while still using pybal/LVS, so that we can mitigate some of the transition risks that are coming down the line. Eventually for the "low-traffic" cases, all of those services would need to go down one of a few roads:

  1. Move to liberica and use IPIP (in which case, again, it's better to separate the risks and transitions here, get over the IPIP hurdle now as a separate transition from the Liberica switch)
  2. Stay on pybal/LVS (forever?) as the last remaining use-cases for this stack, but Traffic will have moved on to the above for the public-facing stuff.
  3. Move away from both, perhaps by having some direct solution involving direct BGP from k8s ingress or whatever (which I think is ideal, but I don't know how close we are to getting there).

theoretically speaking we could keep low-traffic on liberica/IPVS (instead of liberica/Katran) to be able to get rid of pybal entirely. Besides k8s based services we have some other services on low-traffic that need load balancing AFAIK.

But even if we stay on Liberica/IPVS for low-traffic, having the ability of running healthchecks properly, using the same network path as incoming requests would be a nice benefit for low-traffic services.

I'd go ahead and take a step back: why do we need to switch to IPIP encapsulation for backend services?

Is there a compelling reason why it's better than our current solution?

Yeah. Running long-lines across our datacentre so every load-balancer can have an interface directly on every vlan is a lot of hassle. It results in complications like in T358260. And there are failure cases where a back-end vlan becomes unreachable, but PyBal keeps announcing the range because the health-check starts using the primary interface/default route (which works for healthcheck, but not L2 IPVS MAC rewrite). Lastly it means we need to support spanning layer-2 vlans across multiple switches (using EVPN/VXLAN at core sites, which works well but is extra complexity, and currently comes with an additional license fee from Juniper).

So IPIP or some solution where we can load-balance to back-end servers without direct L2 adjacency is definitely the preferred solution for netops.

theoretically speaking we could keep low-traffic on liberica/IPVS (instead of liberica/Katran) to be able to get rid of pybal entirely. Besides k8s based services we have some other services on low-traffic that need load balancing AFAIK.

But even if we stay on Liberica/IPVS for low-traffic, having the ability of running healthchecks properly, using the same network path as incoming requests would be a nice benefit for low-traffic services.

oh I agree 100% with this. My doubts were specifically for switching to katran, I was taking switching to the new control plane as a given :)

oh I agree 100% with this. My doubts were specifically for switching to katran, I was taking switching to the new control plane as a given :)

so, it would be great that we could move to IPIP encapsulation even if we stay with IPVS for healthchecking purposes and to get rid of the VXLAN requirements.

IPIP encapsulation is a necessary step in the good direction, whatever solution we decide on for load balancing, for the reasons mentioned by Cathal and Valentin. As data point, the vxlan license is an extra 100k for a 10 racks setup (plus yearly support).

About MSS clamping vs. higher MTU, I suggested that it might be easier to increase the MTU on low-traffic services vs. deploying the MSS clamping tool on some services. But I don't have strong preferences, maybe one is better on some usecases compared to the other.

I agree that the MSS clamping is the now battle-tested solution, it works well and doesn't impact performances. It also seems reasonably easy to deploy, but I think there was some concerns about deploying it to k8s or similar services ?

Increasing the MTU has more moving parts so more things that could go wrong. It could help increase performance on bandwidth heavy systems (by increasing the MTU all the way to 9000).
Working with internal only systems make it easier as we control both endpoints as well as the full path.

While writing this comment I also realized that a realserver can also be a client, so some clamping will always be needed, either for the MTU or with the MSS, so we could de-correlate a possible MTU increase to the IPIP clamping.

For example (just some quick thoughts):
client: MTU 1500 or 9000 (a realserver can potentially be a client)
LVS: MTU 9000 or 9020 (it makes sens to me to consider the LVS as a router anyway and thus have a higher MTU)
realserver: MTU 9000
An ip route 0/0 rule would be needed to "clamp" the outbound MTU or MSS (using mtu lock, or advmss for example) on any client or realserver with a jumbo MTU.
Otherwise even if the LVS have a MTU that takes into account the IPIP overhead, if both sides (client/realserver) have the same MTU, the the same problem we're seeing with a regular MTU would happen with a higher MTU as the MTU is also the "Maximum receiving unit".

The current MSS clamping solution implemented by Valentin is smart and only clamps traffic from the VIPs. If tcp-mss-clamper or the equivalent Ferm based mechanism can't be used on some systems (eg. k8s), maybe applying the ip route 0/0 rule system wide would be a good workaround. If there is a concern for the possible performance hit (but I doubt it), then we could compensate it with an MTU increase.

An ip route 0/0 rule would be needed to "clamp" the outbound MTU or MSS (using mtu lock, or advmss for example) on any client or realserver with a jumbo MTU.

The ability of Linux to have advmss or mtu set on a route is probably the most promising technical mechanism to support Jumbo frames. I've seen it used fleet-wide to support jumbo frames for internal traffic flows and keep internet-bound traffic working fine at 1500.

But it's a lot tricker if we want to support a mix of MTUs internally. The main issue with using advmss or mtu on a route is that it won't affect traffic to other hosts on the same subnet/vlan. So it gets you 99% of the way there but doesn't fully solve the problem.

Otherwise even if the LVS have a MTU that takes into account the IPIP overhead, if both sides (client/realserver) have the same MTU, the the same problem we're seeing with a regular MTU would happen with a higher MTU as the MTU is also the "Maximum receiving unit".

Yeah. What would be ideal in the tunnelling case is if hosts could have a different MRU - max receive unit - than MTU. So that they could receive 1520 or larger byte packets from the load-balancers, but would not generate oversize packets themselves. That doesn't seem possible in Linux though?