✉️ In Today’s Edition

In this edition, we will understand conntrack through real Kubernetes networking scenarios and see why it plays a critical role behind Kubernetes Services, kube-proxy, NAT, and DNS traffic.

You will learn:

  • What conntrack is and why it exists

  • Why Kubernetes Services depend on it

  • How to inspect the conntrack table

  • What happens when the table gets full

  • How to troubleshoot and fix conntrack exhaustion in production

Lets get started.

What is Conntrack?

Conntrack is short for connection tracking. It is a feature of the Linux kernel's netfilter (Network packet filtering) framework

Think of it as a memory table inside the Linux kernel that remembers every active network connection.

For each connection, here is what conntrack remembers.

  • The source IP and port

  • The destination IP and port

  • Where it actually got redirected to (after NAT)

  • The current state of the connection (NEW, ESTABLISHED, etc.)

  • A timeout for when to forget about it

The following image illustrates it better.

It makes more sense to understand conntrack through a real use case, and Kubernetes is one of the best examples. Let’s see how conntrack works in Kubernetes in detail.

Conntrack in Kubernetes

In Kubernetes, conntrack is heavily used by kube-proxy to track connections when traffic is routed through Services.

For example, when you create a Service, Kubernetes gives it a virtual IP. This IP doesn't exist on any network interface. No pod has it. No node has it. It's just a label Kubernetes uses to represent a group of pods.

So how does traffic to this fake IP actually reach your pods?

Let's break this down. Here is what happens internally.

  • A client pod sends a request to the Service

  • The packet hits the node's kernel, where iptables rules (set up by kube-proxy) intercept it

  • The kernel rewrites the destination from the Service to an actual Pod. This is called DNAT (Destination Network Address Translation)

  • The packet goes to the real pod

So far so good. But now the pod needs to send a response back. And here's where it gets interesting.

The pod has no idea the original request was ever meant for a Service. From its point of view, it just got a packet from some client and needs to answer it. So it sends the response back with itself as the sender.

But the client never talked to the pod. The client talked to the Service. If a reply shows up from some random pod IP it has never heard of, the client rejects it.

This is where conntrack plays its role.

During DNAT, conntrack stored a mapping of the original request and where it got redirected to. So when the reply packet comes back out, conntrack catches it, looks up that note, and rewrites the source IP back to the Service IP.

The client gets a clean reply that looks like it came straight from the Service. It has no idea any translation ever happened.

If you are setting up Kubeadm clusters, you will get a config option to modify conntrack values using KubeProxyConfiguration as shown below.

Viewing the Nodes Conntrack Table

You can inspect the conntrack table on any Linux machine using the conntrack CLI tool.

The conntrack CLI tool won't be available by default, you have to install it manually.

For Ubuntu, use the following command to install it.

sudo apt install conntrack

Then run the following command to see the conntrack table on any Linux machine.

sudo conntrack -L

You will get multiple lines of tracked connections.

tcp  6 431984 ESTABLISHED
  src=192.167.0.94 dst=172.30.1.2 sport=59860 dport=6443
  src=172.30.1.2 dst=192.167.0.94 sport=6443 dport=59860 
  [ASSURED] mark=0 use=1

This is one of the tracked connections, in this,

  • The first line shows the protocol and the connection state.

  • The second line is the original packet direction, from client to service.

  • The third line is the reply from the pod to the client, with source and destination reversed.

  • ASSURED means both sides have sent packets and the connection is fully established.

Conntrack Connection States

Every connection tracked by conntrack would be under one of the following states.

  1. NEW - First packet of a connection that conntrack has never seen before.

  2. ESTABLISHED - Both sides have exchanged packets and a two-way conversation is active.

  3. RELATED - A new connection linked to an existing one, like FTP data transfers.

  4. INVALID - Packet does not match any known connection and is usually dropped.

  5. TIME_WAIT - Connection is closing and conntrack is waiting before removing the entry.

Conntrack Table Exhaustion Problem

This is an actual production level issue. The conntrack table has a maximum size.

Kubernetes generates huge amounts of NAT traffic because of ClusterIP Services, kube-proxy iptables mode, readiness/liveness probes, service mesh traffic etc..

So, in a busy cluster with hundreds of pods making thousands of connections, that table fills up fast.

Many Linux systems default to values around 131072 entries, though the actual value depends on kernel and system memory.

If conntrack is full,

  • Random connection timeouts

  • Intermittent DNS failures

  • API calls that fail with no clear error

  • Works sometimes, fails sometimes behavior

  • Services that appear healthy but connections randomly drop

To check if the conntrack is full or about to full, use the following commands.

cat /proc/sys/net/netfilter/nf_conntrack_count

cat /proc/sys/net/netfilter/nf_conntrack_max

The first shows current usage. The second shows the limit. If count is getting close to max.

How to Fix Conntrack Table Exhaustion

Once you identify conntrack exhaustion, there are several ways to mitigate and prevent the issue.

1. Increase the conntrack table size

This is the fastest and most common mitigation.

You can increase the limit with sysctl:

sysctl -w net.netfilter.nf_conntrack_max=524288

To make it persistent, add it to /etc/sysctl.conf:

net.netfilter.nf_conntrack_max=524288

sudo sysctl -p

For Ubuntu worker nodes, values like 262144 or higher are commonly used in production clusters. Large clusters may require much higher values depending on traffic volume.

Warning:

Increasing the limit alone is not a permanent solution. Larger conntrack tables consume more memory, increase lookup overhead and can hide underlying traffic problems. So this should be treated as an immediate mitigation, not the only fix.

2. Reduce TCP connection timeouts

By default, conntrack keeps idle ESTABLISHED connections for 5 days and TIME_WAIT connections for 120 seconds.

In busy Kubernetes clusters, thousands of stale connections can accumulate and waste conntrack table space.

Reducing timeout values helps clean up old entries faster.

sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600

sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30

Note: Don't set timeout_established too low if your application uses long-lived idle connections like database connection pools, conntrack will drop the entry and the next packet will be marked INVALID.

3. Scale nodes horizontally

Because nf_conntrack_max is set per node, spreading pods across more nodes means fewer connections per node and less pressure on each conntrack table.

So adding more worker nodes spreads traffic across the cluster and reduces connection density per node.

For example,

  • 1000 pods on 5 nodes means very high conntrack pressure

  • 1000 pods on 20 nodes means lower conntrack pressure per node.

4. NodeLocal DNSCache

NodeLocal DNS Cache is not enabled by default in Kubernetes. It must be deployed manually as a DaemonSet (except on managed platforms like GKE Autopilot where it's now default).

Without a local cache, every query consumes a conntrack entry.

Pod —> kube-dns ClusterIP to (DNAT via kube-proxy rules) → CoreDNS Pod

With NodeLocal DNS cache, DNS queries are answered locally.

Pod —> Local DNS Cache (same node) —> [cache miss] —> kube-dns over TCP

NodeLocal DNSCache also uses NOTRACK iptables rules for local DNS traffic, allowing many DNS requests to bypass conntrack entirely.

Production Issue:

You can read this incident postmortem to understand the real issues caused by Conntrack table Exhaustion

Conclusion

Conntrack is one of the most critical and overlooked parts of Kubernetes networking.

Even modern datapaths like Cilium that use eBPF and can replace kube-proxy do not completely eliminate conntrack usage.

Why?

Because external SNAT, kernel NAT, and several stateful networking operations still rely on connection tracking internally.

This means conntrack exhaustion is not limited to iptables-based clusters.
It can also happen in:

  • nftables environments

  • eBPF-based clusters

  • service mesh deployments

  • high DNS traffic workloads

That is why monitoring conntrack usage is extremely important in production clusters using Prometheus/node-exporter metrics such as,

  • node_nf_conntrack_entries

  • node_nf_conntrack_entries_limit

It helps you identify overloaded nodes.

Reply

Avatar

or to participate

Keep Reading