|
The steepening trajectory towards event-driven and real-time API architecture is imminent. This means incorporating event-based APIs into a technology strategy and leveraging existing API legacy systems that may have incurred a fair amount of technical debt, especially for historically progressive organizations.
About four months ago, when a well-known ESP asked us to build a RealTimeML model for the email industry, we realized that to process 50M emails for a given RealTime ML deployment (Send Time Optimization Model) required us to dive deeper into time trials. We asked ourselves, what if we needed to serve multiple models to a single endpoint during these trials? What if an ESP wanted to utilize two of our RealTimeML models or a bundle of models since there are instances where specific endpoints will require multi-model modality or MMEs. We tested to serve these predictions to a single endpoint as quickly as possible.
In order to serve inferences with low latency and with high accuracy, we would require extensive expertise in ML engineering or MLOPs. Think of an ESP with 100K clients. At a 2% activation rate, 2K clients at that ESP would subscribe to a couple of different RealTimeML Models. Machine Learning engineers focused on deploying the models to endpoints, or in some cases, the edge, but more importantly, after successful deployment, they would monitor these models for data, concept, or feature drift.
Just an FYI, we have had 101 applicants for the role in our MLOPs role at our company. While a few have pretty impressive resumes, the role in the industry still needs to be defined with greater granularity. Data Scientists and Software Engineers apply for these roles, and while the role is specific to deployment and monitoring, data scientists do collaborate.
This brought us to the attention that as we supply these Models, each Model becomes a product. This is where the systems engineers consider and monitor both model and pipeline health assigned to each endpoint upon deployment of these models. If the endpoint had several different models serving thousands of clients, how do we ensure pipeline health, system integrity and monitor data, feature and concept drift?
In building a prototype dashboard for measuring elements of the ML Lifecycle, we modified it to accommodate different metrics for each type of Model we build. We chose to create a dashboard where we would monitor some aspects of each ML lifecycle of the Machine Learning ecosystem, from Model Registry to Model Monitoring and Business Value. These include:
For prototypes, we know it is not the most efficient way to measure in real-time, but since we have had traction in other segments using this theory, we believe these dashboards present value to the end-user. Still, we feel that more specific metrics are needed for each ML lifecycle level to represent specific metrics.
We now see the RealTimeML predictive analytics models we build as actual products for different industries, and each product has a particular set of MLOPs requirements and metrics that should be measured. When creating a dashboard for your constituency, enough time should be invested in the design of your API if you plan to stay relevant in this automated decision-making decade.
This article will attempt to provide a focused but limited view of what may be measured for RealTimeML and what an MLOPs dashboard may look like related to all ML lifecycles. This article will be a two-part series focusing on what has been measured when serving predictions and what we may consider measuring when monitoring the Model’s health and other related components.
This first article will focus on two fundamental aspects of RealtimeML lifecycles. According to Andrew Ng, they will include what measurements are needed for understanding Data Preparation and Model Development, the first two phases of the ML Lifecycle.
The second part of the article will focus on the MLOPs structure for the final three design components of the ML Lifecycle, which are, Deployment, Monitoring and Business Value.
Unlike other dashboards, these dashboards will be cross-functional and panel-rich so that you can drill down on specificities of your Machine Learning lifecycle and model health. However, cost considerations should be taken into effect. Each specific measurement should examine the phase of the ML Lifecycle, the metric you want to measure or display, and the description of the metric. For the data preparation lifecycle, you may want to know the data source and how many third-party data sets were introduced to a given model.
A dashboard you create should be able to toggle between the models that ingest data as batched or streaming models. You may want to toggle between models served by industry type or sort by the number of inferences or predictions served in days/hours/minutes. However, your design choice is paramount, along with the efficacy of the models themselves and what is available for you to measure, and perhaps if and when a model should ultimately be deprecated as the industry matures.
Detecting and fixing ML system failures in production is a requirement and must be monitored to understand when and why a model would fail in production after proving to work well during Development. According to Chip Huyen, system failures would include but are not limited to:
Dependency failure might cause your system to break. This failure mode is expected when the dependency is maintained by an organization outside your immediate control, like a third party. Deployment failures are caused by deployment errors, such as when we accidentally deploy an older version of your Model instead of the current version or when your systems do not have the proper permissions to read or write specific files. Hardware failures: when the hardware you use to deploy your models, such as CPUs or GPUs, does not behave correctly. For example, the CPUs you use might overheat and break down2. Downtime or crashing: if a system component runs from a server somewhere, such as AWS or a hosted service, and that server is down, your system will also be down.
An API that measures RealtimeML metrics should have ways to measure the types of data you are using. For example:
Understanding the distribution of your dataset is a focal point. Before you have an adequately labeled dataset, you have to determine the scope of the data in its entirety, what percent is text, numerical, image, audio or video, etc. What percent is time-series data? How many features, and what percent of those features can be augmented to increase the size of the dataset if needed for greater accuracy.
We have spoken about this before, and it is becoming more significant. Gone are the days when we introduce more and more data. It is the age of the (data freshness) and, more importantly, the integrity of the data source. These are items you can measure and customize.
Secondly, where the data was sourced initially? Was it first-party data or third-party data, what version was the dataset introduced to the specific Model and, of course, what percent of the dataset is complete? And, what percent needs imputations?
If you are working with images or even audio, you might need to apply data augmentation methods to the images to determine a better engagement rate for images. If you have 10K images and augmented all 10K, you could end up with a dataset that is 5-8x the size, or 80K images with colour and styling variations of each image.
We generally think expanding datasets is helpful, especially when a campaign engineer decides which image will have the highest engagement rate prior to sending. In e-commerce, there is data available when using the 360-degree viewport. By introducing this feature to the dataset “view prior to purchase,” you can derive that the current buyer segment liked perhaps a 45-degree angled view and thus, use a similar angled view in your emails to gain inference on engagement rate prior to sending. So, in this case, you could measure the complete dataset and know that 75% of your dataset is based on augmented imagery.
Measuring model development is not as simple as it sounds. There is a ton to consider. We have been able to break down this lifecycle in a few different ways, but as you might suspect, there are several ways to skin the cat. It all depends on how granular you want your measurements to be. We here at Loxz have started with some general metrics. Some of the vital measurements you want to consider are these elements:
What % of data are outliers vs. Normal Distribution. In this case, you can compare benchmarks on your training data in case of an imbalance of outliers.
What % of the dataset being introduced has been transformed? In this case, if you do not have numerical data, the ability to transform an NLP dataset into numerical features to ascertain critical vectors and the use of tokens will help. These features can then be used in building machine learning or deep learning models with higher accuracy. Kudos to Agnes on our team, building our Sentiment Analysis model, using this exact approach and receiving a relatively high accuracy score. She cannot wait to introduce another dataset to this Model as our discussion was about threshold.
What percent of features in the dataset were extracted. This is commonly used for deep learning models, where dimensionality reduction is required to maximize GPU usage and minimize the costs of machine learning. While much of these costs have come down over the last 12-18 months, the industry requirement for high-performance chips to run deep learning models on many features in a short time is still in demand and will remain so through 2025 at least. This is one reason why there is a chip shortage on the market for very high-performance hardware.
What percent of structured vs. unstructured do your datasets consist of? It is common to have both structured and unstructured data, but it is not yet measured accurately. Nevertheless, a high percentage of your datasets will have a combination of both, and it is essential to measure this. This is super important because as you find where your strengths are in your data science team, you will be able to distribute the workload accordingly for those scientists who like to work on structured or semi-structured data instead of unstructured data like text and NLP models.
What types of data-labeling techniques were used, and what % of it was Semi-Supervised or Human Labeling. Also, what % has been performed by a third party. Understanding the metrics behind these reports can help you determine your model building efficiency for the following models you build. Here is a list of other items your MLOPs person might want to measure: some of these will be covered in the next segments, where we will break down measuring model deployments, model monitoring, and the proper KPIs for measuring business value and figure out ROIs for each Model deployed. Other ML Lifecycle items to measure:
In our subsequent article in this series, we will examine ways to measure the other three components of ML Lifecycle, including Model Deployment, Model Monitoring (Feature, Concept, and Data Drift), along with examining a confidence score that anticipates the concept or data draft might occur. Finally, we will dive into what KPIs you should be monitoring to provide the best ROI for each Model.
Sponsored byVerisign
Sponsored byCSC
Sponsored byRadix
Sponsored byDNIB.com
Sponsored byIPv4.Global
Sponsored byVerisign
Sponsored byWhoisXML API