👋 Hi! I’m Bibin Wilson. In each edition, I share practical tips, guides, and the latest trends in DevOps and MLOps to make your day-to-day DevOps tasks more efficient. If someone forwarded this email to you, you can subscribe here to never miss out!
✉️ In Today’s Newsletter
Why Istio needed a new architecture
Overview of Istio architecture
Deep dive into key components: Istiod, Ztunnel, Waypoint Proxy, and CNI
Is Ztunnel a single point of failure?
Business use cases of Ambient Mesh, including cost benefits.
Hands-on guide to setting up Istio Ambient Mesh
Also, list of remote DevOps and Cloud job opportunities.
🎁 For Istio & k8s Certification Aspirants
If you are preparing for Istio Certification (ICA) or Kubernetes certifications, this is a good chance to save money. Use code LUNAR26CT at kube.promo/devops to get a flat 35% discount on individual certifications.
For 50% bundle discounts, extra coupons, and other offers, check this GitHub repository for the full list.
Also, checkout the Istio Certified Associate stud guide.
Before we dive deep into the Istio Ambient Mesh architecture, let’s first understand the problem with sidecars. This will help you understand the advantages of Ambient Mode much better.
Why Did Istio Need a New Architecture?
In the classic Istio sidecar model, every pod in your cluster runs two containers. Your app, and an Envoy proxy (sidecar) running right next to it. All inbound and outbound traffic for that app pod flows through the sidecar.
The sidecar proxy also handles all the service mesh features like encryption, traffic policies, retries, observability. But here is the problem. The cost adds up with every pod you deploy. For example, 10000 pods = 10000 proxies.
Also, upgrading Istio means restarting every pod in the mesh. That is a big maintenance headache.
The new Ambient mode removes the sidecar architecture entirely. Instead, it uses one proxy pod per-node that handle traffic for all pods on that node. Meaning, you get all the mesh capabilities without the per-pod sidecar proxy overhead.
The following image illustrates the difference. Same cluster, but huge difference in resource usage.

Note: Ambient mesh reached GA in Istio 1.24 (November 2024). ztunnel, waypoints, and all APIs are Stable and production-ready.
Istio Architecture
Istio consists of the following key components.
Istio Control Plane (Istiod)
Ztunnel (The Per-Node Layer 4 Proxy)
Waypoint Proxy (For Layer 7 Traffic Management)
Istio Gateway (To Handling External Traffic)
Istio CNI (Node agent)
The image below shows a high-level (1000-foot) overview of Istio components.

Lets dive deep into each component and see how it works and how it fits into the overall mesh architecture.
Istiod (The Control Plane)
Think of Istiod as the brain of your service mesh. It doesn't handle any actual traffic. Instead, it tells all the proxy components (ztunnel, waypoint etc) on what rules to follow.
Initially the Istio control plane was split into three separate services. 1. Galley for config validation, 2. Pilot for traffic management, and 3. Citadel for security. Now they merged everything into a single binary called Istiod (the Istio Daemon).

Here is what Istiod does.
Istiod watches Kubernetes for Istio custom resources (like
VirtualService,DestinationRule, etc.) that are created, updated, or deleted.It then validates the CRD configs.
It then converts the routing and policy rules from CRDs into detailed configuration, and pushes them to all proxy components (sidecars, gateways, ztunnel, waypoint) using the xDS protocol.
Istiod also creates and manages certificates used for mutual TLS (mTLS) inside the mesh so services can authenticate and communicate securely
If something changes in Kubernetes that affects the mesh (for example a pod stops), Istiod detects the Endpoint change and immediately pushes updated routing info to all proxies so they stop sending traffic to the dead pod.
💡 What is xDS? It is a set of APIs that Istiod uses to push configuration updates to proxies. If you want to understand more, please read this edition.
Ztunnel
Ztunnel stands for Zero Trust Tunnel. It is a rust based proxy and the core building block of Ambient Mesh.
Instead of running a proxy inside every pod, Ztunnel runs as a DaemonSet (one pod per node). All traffic from pods on that node flows through the Ztunnel on the same node first.
Ztunnel handles all Layer 3 and Layer 4 features like encryption, identity, and basic access control. It uses the HBONE protocol (HTTP-Based Overlay Network Environment) to create encrypted tunnels between services.

Here is how it works.
When traffic enters a node, ztunnel intercepts it using iptables by default ( You can also enable eBPF-based redirection)
Once intercepted it handles Layer 3 and 4 traffic
It then uses the HBONE protocol to create secure tunnels between services, ensuring zero trust communication.
It also enforces Layer 3 and 4 mTLS encryption, authentication, authorization policies covering identity, IP addresses, and ports.
Throughout this process, it collects Layer 4 telemetry including TCP metrics and connection logs.
Behind the scenes, ztunnel communicates with the Istio daemon using xDS APIs to receive configuration updates dynamically.
📈 Ztunnel Performance Benchmark
From Istio's 1.24 official benchmark data: a single ztunnel at 1,000 req/sec consumes approximately 0.06 vCPU and 12 MB of memory. That is a 3x reduction per-proxy compared to sidecars
Is Ztunnel a single point of failure (SPOF)?
If you compare this with the sidecar architecture, one common question people ask is,
Is ztunnel a single point of failure? What happens if ztunnel goes down on a node?
Since ztunnel runs as one pod per node, it may sound like a single point of failure. If ztunnel goes down, traffic to the pods on that specific node will be affected.
However, pods on other nodes are not impacted. Each node has its own healthy ztunnel instance. Because ztunnel runs as a DaemonSet, Kubernetes will automatically restart it, just like it does for any other daemonset pod.
So, is ztunnel a SPOF? Absolutely not.
The design assumes that nodes can fail, which is normal in distributed systems. Recovery is handled automatically by Kubernetes.
Waypoint Proxy
Ztunnel only understands TCP. Meaning, it has no idea about HTTP headers, request paths, or retry logic.
When you need Layer 7 features like HTTP routing, canary deployments, circuit breaking, rate limiting, or fault injection you need to implement an optional Waypoint Proxy (envoy).
Now, you may ask, why Waypoint is an optional component?
Well, not every service needs L7 features. So use it only when you actually need HTTP routing, circuit breaking, rate limiting etc. (Its a design choice).
Now, a key things to understand is, Waypoint proxy works on top of Ztunnel. It cannot function without Ztunnel being present. Ztunnel handles the L4 secure tunnel, Waypoint sits inside that tunnel and handles L7 logic.

Here is how it works.
First, you need to enable L7 policies by adding a label to your Service or Namespace.
When any source ztunnel gets traffic from the labelled services, it knows from its xDS config to route to the waypoint address instead of directly to the destination ztunnel.
The ztunnel then builds an HBONE tunnel to the waypoint proxy.
The waypoint proxy (Envoy) performs L7 processing ( retries, traffic splitting etc)
After processing, the waypoint forwards traffic via another HBONE tunnel to the destination ztunnel, which delivers to the pod.
Note: There are different patterns for using waypoint like per namespace, per service, or multi namespace depending on the use case.
Istio Gateway
Ztunnel and Waypoint handle east-west traffic. Meaning internal service-to-service communication inside your cluster.
But what about traffic coming in from outside? That's what Istio Gateway handles.
It works similarly to a Kubernetes Ingress controller. When you create an Istio Gateway object, it spins up an external Load Balancer. All traffic entering the cluster goes through this gateway first.

Here is how it works.
External request hits the cloud Load Balancer
Load Balancer forwards traffic to the Istio Gateway pod
Istio Gateway applies ingress/egress rules (TLS termination, routing, etc.)
Traffic is forwarded to the correct internal service via Ztunnel
✅ Important Note: Istio fully supports the Kubernetes Gateway API (not just Istio's own Gateway resource). The Gateway API is more powerful and is the recommended approach for routing external traffic in modern clusters.
Istio CNI
Your app doesn't know Istio exists. There is no proxy injected into the pod.
So how does traffic from your pods actually end up at Ztunnel?
Well, that is the job of Istio CNI. It is a DaemonSet that runs on every node and creates iptables rules inside your pod's network namespace to redirect traffic to the local Ztunnel.

Here is how it works.
You label a namespace with
istio.io/dataplane-mode=ambientto make it part of the mesh.The Istio CNI node agent detects the label change
For every pod in that namespace on its node, the CNI agent configures iptables rules inside that pod's network namespace to redirect traffic to the local ztunnel.
Pods in unlabeled namespaces get no iptables rules and their traffic flows normally, outside the mesh.
💡 Performance With eBPF: Instead of iptables, you can configure Istio CNI to use eBPF mode for lower latency and less CPU overhead.
The Business Case for Ambient Mesh
We used the Ambient mesh cost calculator for resource comparison between the traditional Istio Sidecar architecture and the Ambient Mesh architecture.
The comparison is based on a large-scale cluster with 10,000 pods, 50 namespaces, and 100 nodes.
The following image illustrates the significant infrastructure savings both in terms of compute resources and monetary cost (up to $432,674) that organizations can achieve by switching to Ambient mode.

In a sidecar model, every pod runs its own Envoy proxy. In this example, those 10,000 proxies consume a massive amount of CPU and RAM.
Because Ambient Mesh uses a shared Ztunnel per node (only 100 proxies total in this scenario) instead of one per pod, the CPU requirements drop from 1,000 vCPUs to just 20–74 vCPUs.
Whats Next?
Now that you have a full understanding of how Ambient Mesh works, its time to get your hands dirty with a sample implementation.
Below, I have shared an end-to-end tutorial where you set up Ambient Mesh and deploy a sample application to test both L4 and L7 traffic capabilities.
🧱 Hands on Guide to Istio Ambient Mode

By the end of this tutorial, you will have learned the following using hands-on tutorial.
Install Istio Ambient Mode using Helm charts
Set up a sample application with two versions running at the same time
Configure HTTPRoute for canary routing
Enable ROUND_ROBIN traffic policy using Istio DestinationRule
Validate canary traffic routing using requests
Understand Careem’s production case study using Gateway API with Istio
🧰 Remote Job Opportunities
Emplay - Senior GCP DevOps Engineer (5 yrs & above)
Trilogy - Senior DevOps Engineer
First Advantage - Senior DevOps Engineer (5+ yrs & above)
Smart Working - DevOps Engineer (5 yrs)
OpenTable - Site Reliability Engineer II (5+ yrs)
NEXGEN Cloud - DevOps Release Manager (4+ yrs)
Zimperium - Platform Engineer (DevOps) (3 - 5 yrs)
MNJ Software - DevOps Engineer (3 - 6 yrs)
FulfillmentIQ - DevOps Engineer (GCP) (4 - 6 yrs)
Outsource Bigdata - DevOps Engineer (Cloud Automation & IT Infrastructure) (4+ yrs)

