MLOPS: Versioning Data With DVC

👋 Hi! I’m Bibin Wilson. Every Saturday, I publish a deep-dive MLOps edition to help you upskill step by step. If someone forwarded this email to you, you can subscribe here so you never miss an update.

✉️ In Today’s MLOps Edition

What DVC is and how it solves the large-file problem Git cannot handle
How DVC works under the hood (pointer files, cache, and remote storage)
Where DVC runs: local machine vs. automated Airflow pipelines
Hands-on: versioning employee_attrition.csv with DVC and pushing it to S3

🎓 Linux Foundation Offer

Get upto 40% Off on CKA, CKAD, CKS.. certifications and bundles.

Use code 35KUBECT at kube.promo/devops

👉 Check Out All the Offers

📦 MLOps Code Repository

All hands-on code for the entire MLOps series is being pushed to the DevOps to MLOps GitHub repository. We will refer to the code in this repository in every edition.

In the last edition, we covered three concepts that set the foundation for Phase 2: data drift, model decay, and data versioning.

The core problem was this: the employee_attrition.csv that the ETL pipeline produces in January is different from the one it produces in July.

The solution is data versioning. Every version of the dataset should be tracked, stored, and traceable back to the model that used it. Today we build that (hands-on, with DVC and S3).

What is DVC?

DVC (Data Version Control) is an open source CLI tool that can be used with version control tools like Git for handling data. You can call it "Git for Data".

Why can’t we use Git for this? Well, a 2GB training dataset or a 500MB model can't live in a Git repository. Here is where DVC comes in.

DVC provides Git-like version control for data, models, and large files without storing the actual files in Git. It stores lightweight pointer files (.dvc files) in Git and the actual data resides in a remote storage (Eg., Amazon S3).

Simply put, it is the bridge between your Git repo and storage where data resides. Meaning, Git tracks your code and .dvc pointer files. Your actual data resides in a remote storage like AWS S3. DVC manages the sync between the two.

The following image illustrates how DVC fits in with local workstation, Github and remote storage.

How DVC Works in a Real MLOps Pipeline

One common question that comes up when working with DVC is,

Where Does DVC Actually Run?

In a real MLOps setup, DVC runs in two key places, each serving a different purpose. Lets look at them.

1. Inside a Workflow System Like Airflow

In a real enterprise setup, the Airflow ETL pipeline produces the final employee_attrition.csv every month as new HR data comes in.

When it finishes producing employee_attrition.csv, the very next task in the Airflow DAG runs dvc add and dvc push. No human involved. The data gets versioned automatically every single run.

The following image illustrates the high level workflow.

❝

📌 Important Note

As a DevOps engineer, DVC is your territory. You are responsible for making sure the S3 remote exists, has the right permissions, and that dvc push/pull works inside tools like Airflow.

For example, Airflow worker needs the DVC CLI installed and AWS credentials available (via IAM role or a mounted secret). That is your job to provision.

Also, as a DevOps engineer, if you are building the Airflow DAG that orchestrates the pipeline, Airflow DVC integration will be your design responsibility (This depends on the project you are working with).

2. On a Data Scientist's Local Machine

Data scientists work with versioned data, they do not create it. The ETL pipeline owns and versions the datasets. Data scientists simply consume the data for training and experimentation.

So in practice, a data scientist clones the repository and runs dvc pull. This fetches the exact version of employee_attrition.csv that matches the current codebase from storage like Amazon S3.

This ensures that both code and data are always in sync. If they need an older dataset for comparison or tuning, they can simply run git checkout to a previous commit and execute dvc pull.

Versioning Dataset with DVC (Hands-on)

Now let’s look at the hands-on by managing the employee_attrition.csv dataset using DVC.

In a real setup, this file would be managed by DVC inside an Apache Airflow worker. The ETL pipeline produces the CSV, and the DAG automatically runs dvc add and dvc push. We will cover that in the next edition.

For learning purposes, we will push the dataset version that we have in the repo to Amazon S3 using a local DVC setup. This helps you clearly understand what DVC is doing under the hood.

❝

Note: Once you understand this flow, integrating it into an Apache Airflow in the next edition becomes straightforward.

Step 1: Create the S3 Bucket

Before running any DVC commands, you need an AWS S3 bucket to act as the DVC remote. This is where the actual employee_attrition.csv file will be stored.

We will be using the same bucket to implement the entire MLOPs workflow.

Lets get started with the hands-on setup.

Step 2: Pull the latest changes.

Pull the latest changes to get the latest updates from the MLOPS repo.

$ cd mlops-for-devops

$ git pull origin main

❝

💡 Note on Working Directory

We will execute all commands from the repository's root directory. i.e., mlops-for-devops. Every path in the steps below is relative to this directory.

Step 3: Install DVC Using Pip

Create a virtual environment named dvc-env and install DVC and its S3 support using the following commands.

$ python3 -m venv dvc-env

$ source dvc-env/bin/activate

$ pip install dvc dvc-s3

Verify the version.

dvc --version

Step 4: Initialise DVC in the Git Repo

We are initialising DVC at the root of the mlops-for-devops repo. This ensures DVC manages the entire project. While it is technically possible to initialize DVC in a subdirectory, it is not recommended for standard MLOps workflows.

$ dvc init

Initialization creates a .dvc folder in the repository, along with following configuration and ignore files.

.dvc/config is the main DVC configuration file.
.dvc/.gitignore ignores DVC cache from Git.

Step 5 : Add the S3 Remote

Now, we need to configure an S3 bucket to DVC as remote storage.

❝

Note: Replace the bucket name with yours in the below command before running it.

The following command tells DVC to store all versioned data in the given S3 location. ml-dataset is just a name you give to the remote (like a label). -d sets it as the default remote.

dvc remote add -d ml-dataset s3://dcube-attrition-data/datasets

After executing the command, if you open the .dvc/config file you can see the config added as given below.

[core]
    remote = ml-dataset
['remote "ml-dataset"']
    url = s3://dcube-attrition-data/datasets

Step 6: Stop Tracking Dataset in Git

In Phase 1, employee_attrition.csv was committed directly to Git. Before DVC can manage it, we need to remove it from Git's tracking

❝

💡 Key Insight

Git and DVC cannot both track the same file. Git tracks the .dvc pointer file. DVC tracks the actual data.

git rm -r --cached phase-1-local-dev/datasets/employee_attrition.csv

Here, we are not deleting the dataset. We are only telling Git to stop tracking it.

Step 7 : Add Dataset to DVC Tracking

Our actual dataset that needs to be versioned is present at phase-1-local-dev/datasets/employee_attrition.csv. We need to tell DVC to track the dataset using the following command.

dvc add phase-1-local-dev/datasets/employee_attrition.csv

Now, DVC starts tracking the dataset instead of Git.

Also, DVC hashes the file, moves it to local cache, and creates the pointer under phase-1-local-dev/datasets/employee_attrition.csv.dvc

The following image shows the DVC + our MLOps project structure and how responsibilities are split between Git, DVC.

Step 8: Push Dataset to S3

Now run the following command to push the actual CSV data file into the configured S3 bucket.

dvc push

And now, if you check your configured S3 bucket, you can see the following structure.

s3://your-bucket/
  └── files/
      └── md5/
          └── 8f/
              └── 28b4894c8d5aac17cc23e68127a768

You can see the same structure in the local cache directory (.dvc/cache) as well.

❝

💡 Key Insight

When you version a new dataset, DVC does not store the entire file again. It computes a checksum of every file and only stores files that are genuinely new.

So if you add 500 new employee records to the CSV, DVC stores the new version but the unchanged rows from the previous version are not duplicated in S3

Step 9: Commit the changes to Git

After pushing the data to S3, the next step is to commit the changes to Git and push it.

This is the most important step in the workflow. Without committing, the dataset version is not recorded in Git. It means, DVC will not know which data version belongs to which code.

$ git add .
$ git commit -m "Track dataset with DVC"
$ git push origin main

This commits the .dvc file created in Step 7. Also it saves the dataset version (metadata) in Git and links the codebase to a specific data version.

Now, anyone can run git checkout + dvc pull and reproduce the exact setup in the future. Meaning, recreating the same project state (code, data, and configuration) so that you can rerun the experiment if needed.

❝

Important Note: Git does not store the actual dataset. It stores the pointer file created by DVC

That's a Wrap!

Now that our dataset is versioned and sitting in S3, the next step is to automate the whole thing so no human has to run dvc push manually.

In the next edition, we will use the same dvc add and dvc push commands into an Airflow DAG running on Kubernetes.

Every time the ETL pipeline finishes producing a fresh employee_attrition.csv, the data will be versioned and pushed to S3 automatically.