Managing data, models and experiments in a structured and scalable way poses major challenges for enterprises, organizations, startups and individual data scientists. Across the industry, many antipatterns such as the manual management of datasets or inefficient experiment tracking in Excel spreadsheets are widespread. Data Version Control, or DVC for short, addresses these issues and focuses on versioning AI projects. This article shows developers and data scientists the basic functionalities of DVC. For this purpose, different models are trained and evaluated with the Kaggle dataset Churn Prediction. Churn Prediction is an advance analysis to identify customers at risk of churn. It provides information about which customers are at risk of leaving. As a result, it enables targeted churn prevention measures of high effectiveness and efficiency to be taken.
Using Git & DVC for sustainable experimentation
In classic software projects, tools like Git for code management are part of the standard repertoire. In AI projects, data and model management play an essential role alongside code management. All steps of an experiment should be reproducible to avoid errors and to leverage models faster from development to production. This makes experiments reproducible and collaboration within data science teams more efficient. Since Git is not suitable for storing big data, another solution is needed. This is where DVC helps by extending the strengths of Git to track data, models, and experiments. DVC is an open-source Python library and command-line utility.
The full code of the example is in the Github repository naxty/dvc-churn-prediction. The author downloaded the dataset directly from Kaggle and saved it in the data directory under the name raw.csv. Furthermore, in the src/
directory the files prepare.py
, preprocess.py
and evaluate.py
scripts were created to process the raw data and evaluate models.
In this article, the steps from data preparation to model development and evaluation are followed with DVC. For this purpose, it is first initialized and set up. The basic concepts such as data pipeline and stage are explained and applied in separating data into training and test data and feature engineering. Two experiments are created, each using XGBoost and CatBoost to develop models based on the same features. These are then versioned and compared as a Git branch using DVC.
# Overview of the repository naxty/dvc-churn-prediction
├── assets # Scripts to follow and to perform project end to end
├── data # Data folder
│ └── raw.csv
├── models # Model folder
├── params.yaml # Params file for hyperparameters
├── plots # Plots folders
├── requirements.txt # Python packages
└── src # Source code
├── prepare.py # Splitting raw data in training and testing data
├── preprocess.py # Feature Engineering
└── evaluate.py # Evaluation
Initialization of DVC
DVC goes hand in hand with Git and therefore it can be only used in an existing Git project. The Python package is installed via the package manager pip with the command pip install dvc
. Afterwards it is initialized via the command dvc init
. The command creates the directory .dvc/
and adds more files to the staging area. In the interaction of Git and DVC Git is still responsible for the code management. DVC takes care of the data, models and experiments. These are versioned over metafiles with Git. DVC prevents that real data is ending up in the Github repository through generating gitignore-files on execution of the underlying commands.
git init
pip install dvc
dvc init
git status
# Changes to be committed:
# new file: .dvc/.gitignore
# new file: .dvc/config
# new file: .dvcignore
git commit -m "DVC Init"
The dvc add *file*
command adds files to the repository. It creates a gitignore file and a file.dvc file. To record a potential change to the file, an md-5 hash is created. In the gitignore file, the name of the file is added. Through this workflow, developers store the metadata in the repository. The real data is stored on the local machine or on a remote store.
dvc add data/raw.csv
# Afer running `dvc add` DVC will automatically
# create the gitignore and dvc file which needs
# to be added to the repository
git add data/.gitignore data/raw.csv.dvc
cat data/raw.csv.dvc
# outs:
# - md5: 7321aa9a1e24a9e139f42124f736b415
# size: 684858
# path: raw.csv
DVC supports data storage in a remote store. This works in a similar way to Git. Data is stored centralized and developers can easily upload and download it. You can choose from a wide range of remote storage (Amazon S3, Google Cloud Storage and HDFS). As a counterpart to Git there are the commands dvc push
and dvc pull
for initiating the data transfer.
dvc remote add -d myremote s3://XXXXX/churn-prediction-assets
git commit .dvc/config -m "Add remote storage"
dvc push
Data Pipeline
Besides storing the raw data there is also the need to store the outputs of data processing, data transformations, trainings and evaluations. All of these steps are very common in very machine learning project and usually produce a certain output. In DVC we can bundle these inputs and outputs through the concept of Data Pipelines.
Data pipelines are a set of programming language-independent programs, such as Python or shell scripts. These achieve some output under certain dependencies like the input files.
The command dvc run <script>
creates a stage. This forms the union of code, dependencies, input and output data. Thus, all modifiable artifacts are versioned for reproducibility. Developers specify them with the following flags:
-n Stage name
-d Dependencies on source code, files, or directories: Typically, the script to be executed is included with the dependencies to create a new artifact when the source code is changed.
-o Outputs: File or directory
-m Metrics: JSON or YAML file to track metrics.
--plots: Charts, Tabular (CSV, TSV) or Hierarchical (JSON, YAML) data.
-p parameters: Hyperparameters such as the seed for the random variables.
dvc run -n prepare \
-d src/prepare.py -d data/raw.csv \
-o data/train.csv -o data/test.csv \
-p seed \
python src/prepare.py
git add dvc.yaml data/.gitignore dvc.lock
git commit -m "Execute prepare stage"
Executing the command dvc run ...
writes all specified parameters and flags to a dvc.lock
file with associated hashes for each. If the hashes are the same, the experiment is executed only once and the computational overhead is saved. Furthermore, DVC creates a dvc.yaml file that writes the complex command legibly to the YAML file dvc.yaml
.
# dvc.yaml After running dvc run src/prepare.py
stages:
prepare:
cmd: python src/prepare.py
deps:
- data/raw.csv
- src/prepare.py
params:
- seed
outs:
- data/test.csv
- data/train.csv
Experiment tracking
A common bad practice for new data science teams is losing the overview of the experiments that are performed and their associated metrics. Sometimes, these parameters end up in Excelsheets … but that’s not reliable and requires a lot of manual effort to maintain. Very often this leads to technical debts and builds up bad practices that will influence the productivity of the data science teams in the long term.
DVC has a solution for this as well and provides with dvc exp
commands to develop experiments in a sustainable behaviour. Experiments and their metrics can be easily tracked and compared with this subcommand. The initialization is done with dvc exp init -i
. Developers define the paths to data, parameters, source code, models, metrics and plots in this process.
# Default project structure by running `dvc exp`
├── data/ # Data
├── metrics.json # Metrics
├── models/ # Modelle
├── params.yaml # Required to store hyperparameters like seed
├── plots/ # Charts/Plots
└── src/ # Source code
XGBoost and Catboost Experiments with DVC
In the following we perform two experiments. Each experiment has the same setup but uses a different training algorithm. The models will be developed using xgboost and catboost and each will be run in an independent experiment. Feature engineering and evaluation will be done using the same scripts.
First of all we create the python script xgboost_train.py
in the src/
directory. The script uses the XGBClassifier of the xgboost. It reads the training data, trains the model and saves it under the name xgboost.joblib
in the models/
directory.
dvc run -n xgboost_training \
-d src/xgboost_train.py -d data/train_processed.csv \
-o models/xgboost.joblib \
python src/xgboost_train.py
git add dvc.yaml dvc.lock
git commit -m "Execute xgboost_training stage"
The evaluation of the experiment is done with the src/evaluate.py
script. It calculates various metrics (F1 score, Accuracy) and creates the confusion matrix and a feature importance chart. Running the script we pass the argument to the model path. Thus it can also be used for the evaluation of models with the same API. By default we can comparing experiments with DVC by saving the metrics in JSON or YAML file format.
dvc run -n xgboost_evaluation \
-d src/evaluate.py -d data/test_processed.csv \
-m metrics.json --plots plots \
python src/evaluate.py models/xgboost.joblib
dvc metrics show
# Path accuray f1
# metrics.json 0.82467 0.4207
Administration of experiments as Git Branch
One of the handy features of DVC is the ability to manage experiments using Git branches. Code, data, models, metrics, and plots are conveniently located in one place. This simplifies the viewing of experiments. The command dvc exp branch <experiment> <branch>
commits the experiment.
The command dvc exp run -n exp-xgboost
runs the full experiment. The -n argument specifies the name of the experiment. During execution, DVC checks which artifacts are already present and creates them if necessary. Developers evaluate the results with the dvc metrics show
and dvc plots show
commands. The plots command generates a static web page.
dvc exp run -n exp-xgboost
#...
#Stage 'prepare' didn't change, skipping
#Stage 'preprocess' didn't change, skipping
#...
dvc exp branch exp-xgboost exp-xgboost
git checkout exp-xgboost
git add src/xgboost_train.py
git commit -m "Commit training"
Running the catboost experiments
Next we want to try out the catboost training. We are switching back to the main branch and create the file catboost_train.py
in src/
. The script uses the CatBoostClassifier from the catboost package. Then we run the training and evaluation with DVC and add the exp-catboost
branch to the repository.
dvc run -n catboost_training \
-d src/catboost_train.py -d data/train_processed.csv \
-o models/catboost.joblib \
python src/catboost_train.py
dvc run -n catboost_evaluation \
-d src/evaluate.py -d data/test_processed.csv \
-m metrics.json --plots plots \
python src/evaluate.py models/catboost.joblib
dvc exp run -n exp-catboost
dvc exp branch exp-catboost exp-catboost
git add src/catboost_train.py
git commit -m "Catboost experiment"
Comparing experiments
Versioning the experiments in separate branches makes it very easy for different experiments to be run simultaneously within the same project without much effort, and for the development to be structured. In practice, experiments are always developed iteratively and, for example, different algorithms are tried out. In the case of the article, another experiment is performed using the catboost software package. The experiment is given the name exp-catboost. Now, in order to choose the best model, it is necessary to compare the results. The command dvc exp show
displays the experiment information across the repository.
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Experiment Created f1 accuray seed prepare.test_size prepare.columns_to_drop data/raw.csv data/test.csv data/test_processed.csv data/train.csv data/train_processed.csv src/catboost_train.py src/evaluate.py src/prepare.py src/preprocess.py src/xgboost_train.py
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
workspace - - - 1337 0.3 ['Gender', 'Age', 'Surname'] 7321aa9 3ff50bf - 26aa741 - - - 45d3824 c03f208 -
main 01:03 AM - - 1337 0.3 ['Gender', 'Age', 'Surname'] 7321aa9 3ff50bf - 26aa741 - - - 45d3824 c03f208 -
├── 59e4a6d [exp-catboost] 01:54 AM 0.40093 0.82767 1337 0.3 ['Gender', 'Age', 'Surname'] 7321aa9 3ff50bf 39d1a9d 26aa741 adbf669 6ed8cec 7c5cbc2 45d3824 c03f208 -
└── cff6b68 [exp-xgboost] 01:28 AM 0.4207 0.82467 1337 0.3 ['Gender', 'Age', 'Surname'] 7321aa9 3ff50bf 39d1a9d 26aa741 adbf669 - 7c5cbc2 45d3824 c03f208 e111c68
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Conclusion
Despite the gradual rise in acceptance of code management for classic software projects, Data Science and Machine Learning projects are still in their infancy. Management of data, models and experiments plays an important role to develop experiments in a reproducible, scaled and structured way. DVC is an innovative technical tool to efficiently manage Machine Learning projects. Due to its open source nature and easy integration into existing Git repositories, it stands out from the crowd. Developers can use the tool easily in their projects thanks to its ease of installation and few dependencies. For data science teams, it facilitates collaborative work and painless execution and management of experiments. The tool is aimed more at tech-savvy users. It helps develop data science projects in a more automated way. In addition, it’s important to keep in mind that DVC has only been around for a short time and does not yet have a longstanding set of best practices.