Introduction
The ML series blogs we posted, recently, focused on the details of creating ML models addressing VMRay’s defined use case, which is enhancing its phishing URL detection. In this series, we tackled how we engineered features (i.e. feature engineering) to be used in model training, using the clean output of VMRay’s superior technology.
We provided a few examples of what features we extracted, what’s the nature/classification of this feature (e.g. URL-content based), and what’s the source of this feature (i.e. which artifact from VMRay’s sandbox output). In addition, we also provided a brief discussion on how we used ensembling for the final prediction.
Lastly, we also discussed how we effectively integrated these trained models into VMRay’s product to compensate for its existing core detection system, which is the success/exit criteria for VMRay’s ML journey… at least for this phase.
Having something in production, as mentioned, is arguably the best indicator of success in the software industry, when it comes to implementing something new/radical in a company’s product deck. It shows the company’s innovation, technical and engineering skills, and drive to execution.
However, deploying software in production requires maintainability/sustainability, which is achievable by operationalizing the entire process/workflow… machine learning solution systems included. Actually, a portion of the vast advancements in the AI / ML world is focused on the design and creation of frameworks enabling it to follow an established software development life cycle, including its ML operations, or in short and more common terms, MLOps.
In this post, we will focus on the fundamentals of MLOps, including its purpose, challenges, and components. Let’s dive in, shall we?
Purpose: Why do we need to operationalize Machine Learning?
The main purpose of MLOps is to streamline the development, deployment, and monitoring pipeline for ML systems. This means aggregating the inputs, feedback, and contributions from the different teams involved, and ensuring that all steps in the process are recorded, repeatable, and reproducible.
With the growing importance and scale of ML projects, interactions between teams of data scientists, software developers, IT administrators, and architects are in desperate need of clear guidelines and practices.
MLOps provides these guidelines and enables teams to collaborate with the goal of building and maintaining a robust and scalable ML system. In conclusion, we can summarize MLOps as:
Challenges: Bringing the best of data and code together
ML systems are just another type of software development system which may use waterfall, agile, or DevOps development strategies , why is it more challenging to deploy and sustain then?
The answer is simply that ML solution systems are not just code.
It’s DATA + CODE.
ML solution systems, as we might be familiar already, depend on both the quantity and quality of the data it’s being fed with, making the data essential to training the model. The code then enables us to fit (train) the data, come up with a model, and make predictions and other insights:
Given this relationship of code and data, carefully bridging the two together in development and production, so they evolve (or version) in a controlled way, to have a scalable ML system.
Data for training, testing, and inference will change over time, across different sources, and needs to be accompanied by code revision whenever necessary. Without an established MLOps approach, there can be a disconnect in how code and data are linked, which causes problems in production, gets in the way of smooth deployment, and leads to results that are hard to trace or reproduce.
Components: CI/CD, Versioning and Testing
Now that we established the goal and challenges of MLOps, let’s describe the components involved.
One core concept of Data Engineering is the data pipeline. It describes the procedure of ingesting, transforming, and storing the data. For MLOps we use ML pipelines where we extend this concept by an ML model using the output of a data pipeline.
To visualize ml pipelines it is usual to use the concept of DAGs (directed acyclic graphs) and below we have an example of such a representation in Airflow:
The CI/CD (continuous integration / delivery) practice for MLOps consists of several steps.
We have to deliver the DATA + CODE for the ML pipeline, do the training of the ML model and implement it into the product. Since we need to make the deployment reproducible the MLOps practice of Versioning involves versioning of the trained model combined with versioning of the DATA + CODE of the ml pipeline.
To ensure the reliability, MLOps practice Testing covers every step in the delivery process. We have unit tests for the code in the ml pipeline transforming the data. For the data, we use data validation to ensure the expected format of the ingested data.
In the validation of a new ML model, the critical difficulty is that machine learning is only approximating the reality. That means we can not rely on binary testing logic like unit tests, but use statistical measures. The same is true for monitoring the behavior of a delivered model. We need to use statistical indicators and in contrast to the model evaluation, we usually do not have accurate labels. Hence we rely on comparing the classification ratio of different time ranges and model versions.
To summarize the MLOps we have:
CI/CD