Kubernetes Built-in Features for AI/ML

👋 Hi! I’m Bibin Wilson. In each edition, I share practical tips, guides, and the latest trends in DevOps and MLOps to make your day-to-day DevOps tasks more efficient. If someone forwarded this email to you, you can subscribe here to never miss out!

🚀 Today’s Highlights

Here’s what we’re covering in this edition:

  • Catch up on missed updates and must-read blogs

  • Explore Kubernetes native AI/ML features.

  • Key projects from the Kubernetes community for AI/ML.

  • Google Cloud Internal Developer Platform (IDP) and more..

📚 What You Might’ve Missed

In our last newsletter, we explored a new way to mount AWS S3 buckets using fstab — a handy option for managing large datasets and configs.

Native sidecars are now officially stable in Kubernetes 1.33! Check out our hands-on guide to start using them in your workloads.

Want to try GitOps? Our quick-start guide shows you how to set up ArgoCD in just 9 minutes.

Kubernetes Native AI/ML Features

It’s really important for people in the DevOps space to stay up to date with the latest AI/ML developments, especially when it comes to infrastructure management.

That’s why I wanted to put together this edition to share some insights into where Kubernetes fits into the AI/ML landscape, something I’ve personally been keeping a close eye on.

Kubernetes community started offering several built-in features to deploy, manage, and scale AI/ML applications efficiently.

Lets look at each feature in detail.

1. Kubernetes Device Plugins (Stable)

GPUs are one of the key requirements for AI and ML applications.

To support this need, Kubernetes offers a feature called device plugins.

Here’s how it works:

  • Device plugins run on specific nodes (usually as DaemonSets). It registers with kubelet and communicates via gRPC.

  • They let nodes advertise their hardware (like NVIDIA or AMD GPUs) to the kubelet.

  • The kubelet shares this information with Kubernetes, so the system knows which nodes have GPUs.

Once the device plugin is set up, you can request a GPU in your Pod spec, like this:

resources:
  limits:
    nvidia.com/gpu: 1

The scheduler sees your GPU request and finds a node with available NVIDIA GPUs and the pod gets scheduled to that node.

Once scheduled, the Kubelet invokes the device plugin's Allocate() method to reserve a specific GPU. The plugin then provides the necessary details, such as the GPU device ID. Using this information, the Kubelet launches your container with the appropriate GPU configurations.

The following image illustrates the complete workflow of device plugins.

For example, in AWS EKS, you can use GPU-enabled nodes with the NVIDIA device plugin to support GPU-based ML workloads.

2. Mounting Container Images as Volumes (Beta)

Kubernetes version 1.31 has introduced a new alpha feature that allows you to use OCI image volumes directly within Kubernetes pods.

OCI images are images that follow Open Container Initiative specifications.

You can use this feature to store binary artifacts inside images and mount them directly to pods.

This approach is especially helpful for machine learning projects involving large language models. Deploying LLMs often requires pulling models from various sources like cloud object storage or external URIs.

By packaging model data in OCI images, it's much easier to manage and switch between different models. A project already exploring this concept is KServe, which includes a feature called Modelcars.

Modelcars lets you use OCI images that contain model data. With native OCI volume support in Kubernetes, some of the existing challenges are reduced, streamlining the overall workflow.

3. Gateway API Inference Extension (Alpha)

Traditional load balancers treat all traffic equally based on URL paths or round‑robin rules.

But large language models (LLMs) are different:

  • They run for longer (seconds to minutes).

  • They often hold things in memory (like token caches or adapters).

  • Some requests need low latency (e.g., chat) while others can wait (e.g., batch).

Instead of treating AI model workloads like normal web traffic, Gateway API Inference Extension (built on top of Kubernetes’ Gateway API) adds model-aware routing to AI inference workloads.

  • It routes based on model identity, readiness, and request urgency.

  • It helps Kubernetes use GPUs more smartly and serve requests faster.

One of the key features it enables is body-based routing, where the system inspects the actual request body (like JSON payloads) to determine how and where to route the request.

💡 Body-based routing is a technique used in API gateways or load balancers where the routing decision is made by inspecting the contents of the HTTP request body, rather than just the method, path, headers, or query parameters.

This is especially useful for LLM workloads, where critical information like model name, priority, or task type often lives in the request body rather than the URL.

Refer Gateway API Inference Extension to learn more.

Other Kubernetes Community Projects

The following projects are maintained by the official Kubernetes community to address key challenges in managing AI/ML workloads on Kubernetes:

1. JobSet

When training big AI models, the job is split across many machines (GPU nodes). All parts (called workers) need to start at the same time and stay in sync. If one worker fails, the training state may get corrupted, and you may have to start over.

JobSet API solves this by coordinating multiple interconnected jobs that must work together as a single unit. It helps start, manage, and recover all these connected jobs together.

2. Kueue

When multiple teams compete for expensive GPU resources, jobs often wait inefficiently or get scheduled to suboptimal hardware. Teams face unfair resource allocation and GPU resources sit idle while jobs queue poorly.

Kueue helps fix this by:

  • Using queues with priorities, so more important jobs go first

  • Being aware of the hardware layout (topology), so it picks the best GPUs for each job

  • Making sure all teams get a fair share of the GPU resources

3. LeaderWorkerSet

When AI models are too big to run on one machine, you need to break them into parts and run them across multiple GPUs or nodes. But doing that is hard.

You have to:

  • Set up each deployment manually

  • Handle complex networking

  • Make sure all parts know how to find and talk to each other (called service discovery and coordination)

LeaderWorkerSet makes this easier. It gives you one simple API to deploy the full setup. It:

  • Automatically handles the networking

  • Coordinates everything

  • Treats the whole thing as one unit

This saves time and reduces errors when running large AI models across many machines.

🔥Keep Yourself Updated

GitHub Actions: - Recent incidents highlight how attackers are abusing GitHub Actions to access secrets, poison workflows, and target supply chains - Read more.

Curious about real-world LLM use cases in apps? Swiggy is using small language models (SLMs) to help their app better understand what you're actually craving. Want to learn more? Read on.

For years, platform engineering has meant stitching together disjointed tools. In a new blog post, Google Cloud's Richard Seroter asks: Is this the end of that DIY approach? Get his hands-on take on the new, integrated Cloud Internal Developer Platform (IDP).

Key Kubernetes Cluster Configurations

What did you think of todays email?

Your feedback helps me create better guides for you!

Login or Subscribe to participate in polls.

Reply

or to participate.