👋 Hi! I’m Bibin Wilson. In each edition, I share practical tips, guides, and the latest trends in DevOps and MLOps to make your day-to-day DevOps tasks more efficient. If someone forwarded this email to you, you can subscribe here to never miss out!

✉️ In Today’s MLOps Edition

  • Why the dataset changes

  • What data drift is and how it affects model predictions

  • What is model decay

  • What is data versioning

  • Tools for data versioning

The core idea of this edition is to learn some core fundamentals that set the tone for the Phase 2 pipeline that we are going to build.

✉️ Missed the Phase 1 Editions?

Phase 1 covered the complete ML workflow on a local machine, starting from raw data and ending with a deployed model on Kubernetes.

In Phase 2, we take the same employee attrition project and rebuild it the way production ML systems are actually designed. If you are new to this series, I recommend reading the following Phase 1 editions first.

In Phase 1, the dataset for the employee attrition model was a static CSV file sitting on your local machine. You loaded it once, trained the model, and deployed it.

That works fine for learning. In a real organization with 500,000 employees, the dataset is not static. It is the data that changes every month.

That is the first problem Phase 2 addresses. Before we look in to enterprise-grade model training pipelines, we need to understand a basic concept. How do we manage the dataset over time?

Why data keeps changing

So the data keeps changing every month?

The reason is simple. Some employees leave the organization, and new employees join and data changes accordingly. The following table shows an example of what changes every month in a real organization.

What changes

Example

Effect on the dataset

New employees join

150 new hires onboard in Q2

150 new rows added to the dataset

Performance reviews are completed

Annual review cycle finishes in March

Performance score columns updated for existing rows

Employees leave the company

Attrition happens, exit interviews recorded

The label (stayed vs. left) gets added for those rows

New features are added

Team starts tracking remote vs. office status

New column appears in the dataset

Historical data corrections

Payroll system bug found, salary data patched

Existing row values change retroactively

Scheduled ETL Pipelines

When the data changes, we need to update the dataset used by the ML model as well.

Since the data keeps changing, the data engineering team runs the ETL pipeline on a schedule (for example, monthly) and creates a fresh dataset during each run. Because the ETL pipeline processes different data each time, the output dataset will be slightly different from the previous run.

The following image illustrates the Airflow-based dataset pipeline we discussed in the first edition.

The Data Drift Problem

Every month the ETL pipeline produces a new dataset. This means the data the model was trained on in January is different from the data available in July.

In ML, this is called data drift. It means the properties of the data change over time.

Here is the problem with data drift. The model trained in January has never seen the new data. It does not know what patterns to expect from the updated employee data.

For example, salary ranges may change, new roles may appear, or employee behavior patterns may shift. These changes make the new data different from the data used during training.

So when you ask the model to predict attrition risk for current employees, the predictions accuracy may be less reliable. Here if you see, the model did not break and the code did not change. The only change is the data.

Model Decay

Over time, data drift leads to model decay. It means, the model's prediction accuracy slowly decreases because it was trained on older data.

This is why production ML systems regularly retrain models using updated datasets and track different dataset versions used for each model training run.

Model decay is a very important concept in machine learning, especially in production systems.

As a DevOps engineer, you are not responsible for detecting drift in model predictions. That is the data scientist's job.

What you are responsible for is building the infrastructure that makes it possible to act on data drift when it is detected.

How Model Decay is Detected?

Model decay is usually detected by comparing the model's predictions with the actual outcomes.

In our employee attrition example, the model predicts whether an employee will stay or leave. These predictions are logged and stored when the model serves them.

After some time, the actual outcome becomes available. This tells us which employees actually stayed and which employees left the company.

So from where the “actual outcome” comes from?

In real systems, the actual data comes later from the business system. In our employee attrition example, the source of truth is usually the HR system that keeps track of employee records and resignation events.

Once this data becomes available, the model's predictions are compared with the real outcomes.

Using this comparison, the team calculates metrics such as accuracy, precision, and recall (metrics that we covered in the model training edition).

For example,

Month

Model accuracy

Status

January

88%

Healthy

February

86%

Healthy

March

81%

Degrading

April

74%

Retrain needed

If these metrics drop over time, it indicates that the model is decaying. This steady decline is a signal that the model needs retraining with newer data.

So are there tools to measure model decay? Yes

There are dedicated tools that help teams track model performance over time and detect decay automatically.

For example, Evidently AI is an open source tool that generates reports on data drift and model performance degradation. It compares current predictions against a reference baseline.

Important Note:

Model monitoring is a full topic on its own. In a dedicated edition later in this series, we will set up Evidently AI alongside our attrition model and build a Grafana dashboard that tracks accuracy, data drift, and prediction score

For now, the idea is for you to know these concepts and tools exist.

What is Data Versioning?

Data versioning is the practice of tracking every version of a dataset over time, similar to how container images are versioned in a registry.

When you deploy a new application version, the old image is still in your registry. You can roll back to it at any time. Data versioning works the same way for datasets. Every version is stored, tagged, and retrievable.

Each version of the dataset gets a unique identifier. Every model training run records which dataset version it used. This creates a complete link between the model in production and the exact data that trained it.

What if data versioning is not there?

Well, every time when the ETL pipeline runs, the new dataset overwrites the old one in S3. The previous version is gone. However, with data versioning, every version is preserved and every training run is traceable.

Tool Used for Data Versioning

So how do we actually version the data?

As you know, you cannot just put a 300 MB CSV or larger file into Git for tracking. Git is not built for large binary files. Data versioning tools solves this.

DVC (Data Version Control) is the most widely used open source tool for Data versioning. It is designed to work alongside Git.

With DVC, the tracking happens in Git and the storage is actually a cloud storage like S3, Azure Blob, GCS, etc.

The following image illustrated how DVC works with git and s3. It is just to give you a mental model. We will look at in detail in the next edition.

DVC is not the only option for Data versioning. There is also a tool called LakeFS.

For our attrition project, we will use DVC with S3. It is the simpler option to start with, and it covers everything we need for this series.

What's coming next?

In the next edition, we will do a hands-on setup of DVC with S3.

We will deploy Airflow on Kubernetes and walk through an example DAG that uses DVC to version the employee attrition dataset in S3.

This will be the first step in setting up the ML pipeline we are building in this series using Kubeflow + MLflow.

Reply

Avatar

or to participate

Keep Reading