About how cilium thinks about iptables and how are they using it.
Audiences
For people who want to understand why cilium still uses iptables and how.
Cilium doesn’t like iptables
I’ve been playing with cilium for a while and was impressed by how it is leveraging ebpf to build a high performance, scalable solution for container networking , plus observability and L7-aware policy capability. I also admire the team which has been a leader / pioneer in using ebpf for networking for years. The most recent service mesh beta program looks very interesting. Check it out if you haven’t heard about cilium before !
One picture’s worth a thousand words. This is how cilium achieves what they’re doing. Because current implementation of kubernetes networking solution with iptables-based kube-proxy is not scalable.
Ebpf-based solution, on the other hand, is proved to be better ( high throughput / low latency / less resource consumption ) . Of course everything related to ebpf is fast, and the company has the best ebpf guys there AFAIK, no doubt about its performance.
When I see cilium offers kube-proxy-free solution which is based on their host-reachable-service, I was like: oh wao, that’s cool but a bit sad to say good bye to our old friend iptables. But turned out it’s not the case: with my current kernel version ( 4.19 ) , cilium stills need iptables for their proxy redirection task.
As the cilium documentation about its iptables usage is quite pure, hence this article.
Why does cilium use iptables still ?
There are 2 main reasons IMHO:
- To ensure cilium datapath can live well with existing iptables settings. ( Doesn’t matter what cilium’s opinion about iptables is, it’s still there as part of Linux networking and packet still go through it, just with much less impact ).
- To implement critical features in old kernel ( 4.x ) like L7 network policy ( TPROXY ) that can’t be implemented by ebpf-based solution as old kernel doesn’t support it. ( Cilium has several cool features that only works with modern kernel 5.x )
One very important point here: as cilium successfully replaces kube-proxy by its ebpf-based solution, it solves the not-scalable issue of iptables-based kube-proxy. There is no more the situation with huge number of iptables rules on all the nodes anymore, which is a huge step.
How does cilium uses iptables
The rules looks like this:
-A CILIUM_OUTPUT_raw -d 10.197.64.0/18 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -s 10.197.64.0/18 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack# 10.197.64.0/18 is my ipv4-native-routing-cidr -A CILIUM_OUTPUT_raw -o lxc+ -m mark --mark -m comment --comment "cilium: NOTRACK for proxy return traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -o cilium_host -m mark --mark 0xa00/0xfffffeff -m comment --comment "cilium: NOTRACK for proxy return traffic" -j CT --notrack#0xa00/0xfffffeff: MagicMarkIsProxy/MagicMarkProxyNoIDMask-A CILIUM_PRE_raw -d 10.197.64.0/18 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_PRE_raw -s 10.197.64.0/18 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_PRE_raw -m mark --mark 0x200/0xf00 -m comment --comment "cilium: NOTRACK for proxy traffic" -j CT --notrack#0x200/0xf00: MagicMarkIsToProxy/MagicMarkHostMask
-A CILIUM_PRE_mangle -m socket --transparent -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff#dns
-A CILIUM_PRE_mangle -p tcp -m mark --mark 0x779b0200 -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 39799 --on-ip 0.0.0.0 --tproxy-mark 0x200/0xffffffff
-A CILIUM_PRE_mangle -p udp -m mark --mark 0x779b0200 -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 39799 --on-ip 0.0.0.0 --tproxy-mark 0x200/0xffffffff#http
-A CILIUM_PRE_mangle -p tcp -m mark --mark 0x8f3c0200 -m comment --comment "cilium: TPROXY to host cilium-http-ingress proxy" -j TPROXY --on-port 15503 --on-ip 0.0.0.0 --tproxy-mark 0x200/0xffffffff
-A CILIUM_PRE_mangle -p udp -m mark --mark 0x8f3c0200 -m comment --comment "cilium: TPROXY to host cilium-http-ingress proxy" -j TPROXY --on-port 15503 --on-ip 0.0.0.0 --tproxy-mark 0x200/0xffffffff#mark is calculated like this
#port := uint32(byteorder.HostToNetwork16(proxyPort)) << 16 #ingressMarkMatch:=fmt.Sprintf("%#x",linux_defaults.MagicMarkIsToProxy|port)
-A CILIUM_FORWARD -o cilium_host -m comment --comment "cilium: any->cluster on cilium_host forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_host -m comment --comment "cilium: cluster->any on cilium_host forward accept (nodeport)" -j ACCEPT
-A CILIUM_FORWARD -i lxc+ -m comment --comment "cilium: cluster->any on lxc+ forward accept" -j ACCEPT
-A CILIUM_FORWARD -i cilium_net -m comment --comment "cilium: cluster->any on cilium_net forward accept (nodeport)" -j ACCEPT
#Just accept traffic to/from cilium's veth
-A CILIUM_INPUT -m mark --mark 0x200/0xf00 -m comment --comment "cilium: ACCEPT for proxy traffic" -j ACCEPT-A CILIUM_OUTPUT -m mark --mark 0xa00/0xfffffeff -m comment --comment "cilium: ACCEPT for proxy return traffic" -j ACCEPT
-A CILIUM_OUTPUT -m mark ! --mark 0xe00/0xf00 -m mark ! --mark 0xd00/0xf00 -m mark ! --mark 0xa00/0xe00 -m comment --comment "cilium: host->any mark as from host" -j MARK --set-xmark 0xc00/0xf00
I planned to have a walk through session here in this article but then i realized that it’s quite dry and boring. Plus some topic like TPROXY’s worth a separate write-up for those who doesn’t know it before to understand. So, here’s a summary for lazy folks. ( Ping me if you want an extra article about TPROXY, that’s quite a difficult topic for me to understand thoroughly )
- Rules are installed by cilium-agent when it starts up or when new L7 policy is created ( a couple of extra rules with fwmark based on random chosen port number will be generated )
- To my surprise, cilium doesn’t periodically synchronize those rules like kube-proxy. If you somehow remove a rule in its custom chain, you have to add it back manually or restart cilium-agent. Is this a bug or feature ?
- As explained above, rules contain 2 main parts:
- To make sure traffic go through default iptables table / chain well without being dropped by default policy. ( For example: ACCEPT traffic to / from custom veth like lxc_* or cilium_* in FORWARD chain, filter table ) . These rules are identical on all nodes.
- To redirect traffic to its envoy proxy ( mark packet and using TPROXY target in mangle table, match magic mark in raw table to avoid being tracked ). These rules are slightly different on each nodes - More about L7 policy related rules: it requires some external config to operate well:
- TPROXY requires policy routing which is used to forward marked traffic.
- Mark is set by bpf hooks. ( I’ll need to dig more into details for this step )
- Of course an application proxy required, with IP_TRANSPARENT socket option. - By default, cilium-agent manages conntrack by iptables ( install-no-conntrack-iptables-rules=false ) , there will be some extra rules in nat table and 1 ipset called cilium_node_set_v4 set up on each node, to handle various NAT operation ( for example, allow your node to connect to the internet ). In my 4.19 kernel, using enable-bpf-masquerade option for cilium-agent can have the same effect.
- 1 last but not least: number of rules doesn’t increase linearly with the number of endpoints or services ( like kube-proxy ), they are quite a fixed one. The only 1 thing can may be added along the way is 2 rules ( ingress / egress ) rule for each type of L7 policy ( dns , http, kafka so far )
Epiloque
Before getting my hands dirty with cilium, I even thought that we don’t need iptables at all, but that’s not true, at least with the kernel version ( 4.19 ) I am using. Cilium offers many cool things ( to me it’s wiredguard transparent encryption, maglev LB ) but many of them is still in beta or requires much newer kernel, that’s a pity. There are still times we have to stick with old stuffs like this. And understand it a bit more is no harm, especially with our old legendary iptables.
Hope this helps.