📦 Kubeflow for MLOps: A Practical Crash Course

✉️ In Today’s MLOps Edition

In the last edition, we set up Feast on Kubernetes for the ML feature store. Today we will look at a key tool used in the MLOps/LLMOps & GenAI Kubeflow pipelines.

It covers,

Understand the Kubeflow stack
Learn how Kubeflow Pipelines work
Explore the KFP architecture and execution flow
Creating a KFP pipeline from scratch on Kubernetes
A look in to Cloudflare’s in-house MLOPs platform
and more..

❝

Before You Continue: This is part of an ongoing MLOps series. You can check this repo to go through all the previous editions in order.

If you're planning to strengthen your Kubernetes skills and prepare for CKA certification, our Kubernetes & CKA course is designed to help you learn through clear explanations, illustrations, real-world examples, and hands-on practice. It also includes 80+ CKA practice scenarios as well.

Get it Here: Complete Kubernetes & CKA Course

✉️ Missed Previous MLOps Editions?

We are in Phase 2 of the MLOps series, where we look at enterprise orchestration tools and workflows for machine learning.

In this edition, we will take the same employee attrition project and rebuild it the way production ML systems are actually designed. If you missed the previous editions, I recommend reading these Phase 2 editions first.

Data Versioning Fundamentals
Data Version Control (DVC) with AWS S3 (Hands-on)
Data Versioning using Airflow on Kubernetes (Hands-on)
Feature Store Fundamentals Explained (Hands-on)

In the Phase 1 Data Preparation and model training edition, we manually performed all the steps, including feature engineering, data preprocessing, model training, hyperparameter tuning, etc to understand the steps involved in model training.

In an enterprise environment, you need an orchestration platform that can automate the entire data preparation, model training, tuning, and retraining whenever new data becomes available.

That is where Kubeflow comes into the picture.

About Kubeflow

When it comes to running AI/ML workloads on Kubernetes, Kubeflow is the foundation platform.

When the project started (2017–2020), it was primarily built as an MLOps platform for Kubernetes to build and manage traditional ML pipelines.

Fast forward to 2026 and the platform has changed a lot. Kubeflow is no longer just an MLOps platform. It has become a cloud-native AI platform for building and operating GenAI, LLMOps, Agentic AI, distributed AI training, and fine-tuning foundation models.

The KubeFlow Stack

The Kubeflow ecosystem has nine subprojects at the time of writing this edition. Each project solves a specific use case in the AI/ML lifecycle.

The following image illustrates the key projects that are part of Kubeflow and what each does.

The best part is that these subprojects can run independently or as part of the full stack. You can adopt only Pipelines, only Trainer, or only Notebooks instead of installing the whole platform.

This way you are free to choose only the subprojects that work for your project’s requirements.

For example, in our final MLOps capstone project, we will use Kubeflow Pipelines and Kubeflow Trainer from the Kubeflow stack.

We will perform the data preparation and model training workflow using Kubeflow Pipelines. For the model training step, the pipeline delegates the training job to Kubeflow Trainer, as shown in the image below.

❝

Important Node:

And the Feast feature store from last edition, plus KServe which we will cover later, are also part of the Kubeflow ecosystem. So you have already been using Kubeflow without realizing it.

This edition focuses on the orchestration subproject, Kubeflow Pipelines.

What is Kubeflow Pipelines (KFP)?

Kubeflow Pipelines (KFP) is one of the core components of the Kubeflow project. It is a standalone DAG runtime that lets you describe your entire ML workflow in Python.

Here is the interesting part.

Kubeflow Pipelines uses Argo Workflows as its workflow execution engine.

Argo Workflows is a Kubernetes-native DAG orchestration engine. However, you write DAG declaratively in YAML (CRD) and not as Python code and Argo executes each task as a Kubernetes pod (Like Airflow Kubernetes operator).

Kubeflow uses Argo Workflows as the backend and adds an AI/ML layer on top of Argo Workflows.

So instead of just orchestrating containers, it understands concepts like datasets, model artifacts, experiment tracking, metrics, caching, and reproducible ML pipelines.

When it comes to pipelines, instead of writing Argo Workflow YAML directly, we need to write the pipelines using Python DAGs like we did in Airflow.

Kubeflow then compiles your DAG into an Argo Workflow and submits it to Kubernetes for execution.

Kubeflow Pipelines Architecture

Before you start learning about the workflow, you need to understand all the components that make Kubeflow Pipelines.

The following image illustrates high-level architecture of KFP.

Now lets look at each KFP component and what it does.

ml-pipeline - This is the api server of Kubeflow pipeline, which is the entry point for it.
ml-pipeline-ui - Manages Kubeflow Pipelines UI
workflow controller - This is the Argo workflow controller, which runs pipelines and creates pods for each task.
mysql - Database that stores pipeline definition, experiment details, run history, and metadata.
seaweedfs server - Object storage that stores artifacts that have the same functionality as Amazon S3 and Minio.
cache-server - This checks if a task ran with the same code before, if there is no change, it will skip the task and use the cache of the previous run.
cache-deployer - Creates TLS certificates for the cache server webhook.
ML Metadata Pods - Track inputs and outputs of each task.

The following pipeline components get deployed dynamically for every pipeline run, which I have explained in the upcoming section in detail.

system-dag-driver-* - This pod will be created for each run, and it is responsible for setting the details of the run in MLMD.
system-container-driver-* - Runner pod which gets input data and gives it to the container implementation pod.
system-container-impl-* - Runner pod which runs actual code.

Kubeflow Pipelines Structure

Let's understand what a Kubeflow Pipelines DAG looks like with our project example.

The employee attrition model data preparation has multiple steps, like data ingestion, data validation, feature engineering, data preprocessing, etc.

These are mostly sequential tasks, and in real-world ML projects, some tasks happen in parallel or are repeated.

The following image shows an example KFP data preparation pipeline.

As you can see, the KFP pipeline uses the @dsl.pipeline decorator to define the entire workflow. Inside this pipeline function, you call each pipeline component (task) and connect them using inputs and outputs.

The interesting part is that you never explicitly write a DAG.

❝

A DAG is a workflow where each step (node) depends on the output of one or more previous steps, with no cycles.

Kubeflow creates the DAG automatically from task dependencies. Whenever one task consumes another task's output, a dependency is created.

How Kubeflow Pipelines Work

Now that you understand how a Kubeflow Pipeline is structured, let's see what actually happens when you run it.

As we discussed earlier, a Kubeflow Pipeline is executed as a Directed Acyclic Graph (DAG). Each task depends on the outputs of one or more previous tasks. If there is no dependency between two tasks, Kubeflow runs them in parallel.

Also, every pipeline task runs in its own Kubernetes pod. Instead of running everything in a single process like a Python script, Kubeflow schedules each step as an independent containerized workload.

Pipeline Execution Flow

When you submit the pipeline, the KFP SDK compiles the Python pipeline definition into an Intermediate Representation (IR) YAML. It describes the entire workflow, including its tasks, inputs, outputs, and dependencies

The compiled IR YAML is then submitted to the Kubeflow API server, which records the run in MySQL and creates an Argo Workflow CRD.

Then the Argo Workflow controller gets the CRD details and creates pods for running the task.

The following illustration shows what happens behind the scenes.

A system-dag-driver pod initializes the pipeline run and creates the execution context in ML Metadata (MLMD).
For each pipeline task, a system-container-driver pod checks for cached results. If a cache hit is found, the task is skipped.
If there is no cache, the container-driver gathers the required inputs and generates the executor_input JSON for the task.
A system-container-impl pod is then created to run your actual component code.
When the task finishes, Kubeflow stores the parameters, artifacts, metrics, and lineage in ML Metadata (MLMD) for experiment tracking and future caching.

❝

Lineage in ML is simply the history of how a model, dataset, or artifact was created. So later you can answer questions like Which dataset produced Model v5?

Task-level caching

One of the most useful features of Kubeflow Pipelines is task-level caching.

Because of the caching, if the pipeline fails at any stage, KFP reuses the previous results from the cache and skips the steps during the re-run. This saves time and unwanted re-execution during re-runs.

Triggering the Pipeline

At this point, you might be wondering, how do we actually trigger a pipeline?

During development, it's usually very straightforward. You trigger the pipeline from your local machine using the Kubeflow Pipelines SDK. We will do exactly that in the next hands-on section.

In a production environment, however, you rarely run the pipelines manually. Pipelines are usually triggered automatically in one of three ways:

On a schedule, such as running every night or every week.
On an event, for example when new training data is uploaded to an S3 bucket.
From a CI/CD pipeline, where the Kubeflow Pipelines API triggers the workflow automatically after code is merged into the main branch.

Kubeflow Pipelines Setup (Must Do Exercise)

Now it's time to put everything you have learned into practice.

To reinforce everything you have learned in this guide, I highly recommend completing this hands-on exercise.

In this hands-on guide, you will learn how to:

Setting up Kubeflow Pipelines on Kubernetes
Access Kubeflow Pipelines UI
Trigger Kubeflow Pipelines Run using Kubeflow SDK
and more..

👉 Follow this Detailed Guide: Set Up Kubeflow Pipelines

That’s a Wrap!

Before we wrap up, one final thought.

If your goal is to move into AI infrastructure, MLOps, or LLMOps, I think Kubeflow is one of the useful platforms you can learn today.

The interesting part is that even if your future employer doesn't use Kubeflow, the knowledge you gain won't go to waste. By learning Kubeflow, you will understand how production AI platforms are designed and operated.

You will learn concepts like AI workflow orchestration, model training pipelines, distributed GPU workloads, model serving, experiment tracking, etc.

What's Coming Next?

So far, we have looked into the orchestration layer using Kubeflow Pipelines.

However, in production, model training is typically handled by Kubeflow Trainer because many models require specialized hardware such as GPUs or TPUs.

In the next edition, we'll see how Kubeflow Pipelines integrates with Kubeflow Trainer to offload model training and run distributed training workloads on Kubernetes.

🤔 Airflow Vs Kubeflow Pipelines

This is probably the first question many DevOps engineers ask.

If we already have Airflow, why do we need another workflow orchestration tool?

The answer is simple.

Airflow was designed to orchestrate general-purpose workflows such as ETL pipelines, data engineering jobs, and scheduled tasks. It works really well for those use cases.

Kubeflow Pipelines, on the other hand, was built specifically for machine learning workflows. It understands ML concepts out of the box and provides capabilities that are essential when building and operating production ML pipelines

🧱 Cloudflare MLOps Platform

If you are wondering whether companies are actually investing in Kubernetes-based AI platforms, here's a real-world example.

Cloudflare shared how they're building their internal MLOps platform for data scientists and AI engineers. The platform is built on Kubernetes, follows a GitOps approach, and includes open-source tools such as Kubeflow, deployKF, Airflow, and MLflow

👉 Read it Here: Inside Cloudflare's MLOps Platform

📦 Kubeflow for MLOps: A Practical Crash Course

✉️ In Today’s MLOps Edition

✉️ Missed Previous MLOps Editions?

About Kubeflow

The KubeFlow Stack

What is Kubeflow Pipelines (KFP)?

Kubeflow Pipelines Architecture

Kubeflow Pipelines Structure

How Kubeflow Pipelines Work

Pipeline Execution Flow

Task-level caching

Triggering the Pipeline

Kubeflow Pipelines Setup (Must Do Exercise)

That’s a Wrap!

What's Coming Next?

🤔 Airflow Vs Kubeflow Pipelines

🧱 Cloudflare MLOps Platform

Reply

Keep Reading

DevOpsCube Newsletter

Home

Our Courses

BLOG

ADVERTISE

POLICIES