✉️ In Today’s Edition
In this edition, we will understand conntrack through real Kubernetes networking scenarios and see why it plays a critical role behind Kubernetes Services, kube-proxy, NAT, and DNS traffic.
You will learn:
What conntrack is and why it exists
Why Kubernetes Services depend on it
How to inspect the conntrack table
What happens when the table gets full
How to troubleshoot and fix conntrack exhaustion in production
Lets get started.
What is Conntrack?
Conntrack is short for connection tracking. It is a feature of the Linux kernel's netfilter (Network packet filtering) framework
Think of it as a memory table inside the Linux kernel that remembers every active network connection.
For each connection, here is what conntrack remembers.
The source IP and port
The destination IP and port
Where it actually got redirected to (after NAT)
The current state of the connection (NEW, ESTABLISHED, etc.)
A timeout for when to forget about it
The following image illustrates it better.

It makes more sense to understand conntrack through a real use case, and Kubernetes is one of the best examples. Let’s see how conntrack works in Kubernetes in detail.
Conntrack in Kubernetes
In Kubernetes, conntrack is heavily used by kube-proxy to track connections when traffic is routed through Services.
For example, when you create a Service, Kubernetes gives it a virtual IP. This IP doesn't exist on any network interface. No pod has it. No node has it. It's just a label Kubernetes uses to represent a group of pods.
So how does traffic to this fake IP actually reach your pods?
Let's break this down. Here is what happens internally.
A client pod sends a request to the Service
The packet hits the node's kernel, where iptables rules (set up by kube-proxy) intercept it
The kernel rewrites the destination from the Service to an actual Pod. This is called DNAT (Destination Network Address Translation)
The packet goes to the real pod
So far so good. But now the pod needs to send a response back. And here's where it gets interesting.
The pod has no idea the original request was ever meant for a Service. From its point of view, it just got a packet from some client and needs to answer it. So it sends the response back with itself as the sender.
But the client never talked to the pod. The client talked to the Service. If a reply shows up from some random pod IP it has never heard of, the client rejects it.
This is where conntrack plays its role.
During DNAT, conntrack stored a mapping of the original request and where it got redirected to. So when the reply packet comes back out, conntrack catches it, looks up that note, and rewrites the source IP back to the Service IP.
The client gets a clean reply that looks like it came straight from the Service. It has no idea any translation ever happened.

If you are setting up Kubeadm clusters, you will get a config option to modify conntrack values using KubeProxyConfiguration as shown below.

Viewing the Nodes Conntrack Table
You can inspect the conntrack table on any Linux machine using the conntrack CLI tool.
The conntrack CLI tool won't be available by default, you have to install it manually.
For Ubuntu, use the following command to install it.
sudo apt install conntrackThen run the following command to see the conntrack table on any Linux machine.
sudo conntrack -LYou will get multiple lines of tracked connections.
tcp 6 431984 ESTABLISHED
src=192.167.0.94 dst=172.30.1.2 sport=59860 dport=6443
src=172.30.1.2 dst=192.167.0.94 sport=6443 dport=59860
[ASSURED] mark=0 use=1This is one of the tracked connections, in this,
The first line shows the protocol and the connection state.
The second line is the original packet direction, from client to service.
The third line is the reply from the pod to the client, with source and destination reversed.
ASSUREDmeans both sides have sent packets and the connection is fully established.
Conntrack Connection States
Every connection tracked by conntrack would be under one of the following states.
NEW - First packet of a connection that conntrack has never seen before.
ESTABLISHED - Both sides have exchanged packets and a two-way conversation is active.
RELATED - A new connection linked to an existing one, like FTP data transfers.
INVALID - Packet does not match any known connection and is usually dropped.
TIME_WAIT - Connection is closing and conntrack is waiting before removing the entry.
Conntrack Table Exhaustion Problem
This is an actual production level issue. The conntrack table has a maximum size.
Kubernetes generates huge amounts of NAT traffic because of ClusterIP Services, kube-proxy iptables mode, readiness/liveness probes, service mesh traffic etc..
So, in a busy cluster with hundreds of pods making thousands of connections, that table fills up fast.
Many Linux systems default to values around 131072 entries, though the actual value depends on kernel and system memory.
If conntrack is full,
Random connection timeouts
Intermittent DNS failures
API calls that fail with no clear error
Works sometimes, fails sometimes behavior
Services that appear healthy but connections randomly drop
To check if the conntrack is full or about to full, use the following commands.
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_maxThe first shows current usage. The second shows the limit. If count is getting close to max.
How to Fix Conntrack Table Exhaustion
Once you identify conntrack exhaustion, there are several ways to mitigate and prevent the issue.
1. Increase the conntrack table size
This is the fastest and most common mitigation.
You can increase the limit with sysctl:
sysctl -w net.netfilter.nf_conntrack_max=524288To make it persistent, add it to /etc/sysctl.conf:
net.netfilter.nf_conntrack_max=524288
sudo sysctl -pFor Ubuntu worker nodes, values like 262144 or higher are commonly used in production clusters. Large clusters may require much higher values depending on traffic volume.
Warning:
Increasing the limit alone is not a permanent solution. Larger conntrack tables consume more memory, increase lookup overhead and can hide underlying traffic problems. So this should be treated as an immediate mitigation, not the only fix.
2. Reduce TCP connection timeouts
By default, conntrack keeps idle ESTABLISHED connections for 5 days and TIME_WAIT connections for 120 seconds.
In busy Kubernetes clusters, thousands of stale connections can accumulate and waste conntrack table space.
Reducing timeout values helps clean up old entries faster.
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600
sudo sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30Note: Don't set timeout_established too low if your application uses long-lived idle connections like database connection pools, conntrack will drop the entry and the next packet will be marked INVALID.
3. Scale nodes horizontally
Because nf_conntrack_max is set per node, spreading pods across more nodes means fewer connections per node and less pressure on each conntrack table.
So adding more worker nodes spreads traffic across the cluster and reduces connection density per node.
For example,
1000 pods on 5 nodes means very high conntrack pressure
1000 pods on 20 nodes means lower conntrack pressure per node.
4. NodeLocal DNSCache
NodeLocal DNS Cache is not enabled by default in Kubernetes. It must be deployed manually as a DaemonSet (except on managed platforms like GKE Autopilot where it's now default).
Without a local cache, every query consumes a conntrack entry.
Pod —> kube-dns ClusterIP to (DNAT via kube-proxy rules) → CoreDNS Pod
With NodeLocal DNS cache, DNS queries are answered locally.
Pod —> Local DNS Cache (same node) —> [cache miss] —> kube-dns over TCP

NodeLocal DNSCache also uses NOTRACK iptables rules for local DNS traffic, allowing many DNS requests to bypass conntrack entirely.
Production Issue:
You can read this incident postmortem to understand the real issues caused by Conntrack table Exhaustion
Conclusion
Conntrack is one of the most critical and overlooked parts of Kubernetes networking.
Even modern datapaths like Cilium that use eBPF and can replace kube-proxy do not completely eliminate conntrack usage.
Why?
Because external SNAT, kernel NAT, and several stateful networking operations still rely on connection tracking internally.
This means conntrack exhaustion is not limited to iptables-based clusters.
It can also happen in:
nftables environments
eBPF-based clusters
service mesh deployments
high DNS traffic workloads
That is why monitoring conntrack usage is extremely important in production clusters using Prometheus/node-exporter metrics such as,
node_nf_conntrack_entriesnode_nf_conntrack_entries_limit
It helps you identify overloaded nodes.

