Domain Drift: The Problem and Strategies To Address It
Nothing lasts forever – be it youth, beauty, your laptop’s battery, or the accuracy of machine learning models. We’re not here to troubleshoot your battery issues or offer skincare tips, but we are qualified to explain why your models lose accuracy over time. So, let’s get to it.
At the highest level, machine learning models take inputs of data and attempt to infer information and interpret an ever-shifting, dynamic environment. The world around us has innumerable variables that we can’t control and may often struggle to predict. As environments and targets change – sometimes in significant ways over time – our models’ effectiveness will change as well as they struggle to adapt to new conditions. This concept is called Domain Drift. In this piece, we’ll explore how it can (and probably already does) affect your models, its different forms, how to detect it, and how synthetic data can help us address this tricky problem.
Domain drift, also known as Concept Drift, is an occurrence where the properties of the target we are trying to predict change over time. Machine learning is based on the idea that, by inputting data that represents the world and what we are trying to achieve, our model can “learn” from these inputs and teach itself to predict and analyze future data to help us obtain our target. An easy example: we want to train our model to recognize cats. We input datasets with pictures of cats, of all sorts and shapes, from different angles and in different poses. We define where the cat is in the image; this stage is called annotation. With these images and the related annotations, our model teaches itself to recognize cats in future images. Assuming that cats never change, the results should be consistent and a well-trained model should continue to perform well by successfully recognizing cats. Like any deterministic model or function, one input would mean a constant output. Imagine though, that cats starting wearing glasses, or growing a second tail, or wearing sneakers. The model, trained to identify one set of cats, won’t be as efficient at recognizing this new world. Important features of our target domain have changed.
Presumably, cats won’t start wearing glasses or start changing their hairstyles. But, there are many more plausible scenarios where the target can change and our existing model will no longer work well enough. Imagine a modern factory, with fully automated production lines and robotic devices. If we want to detect issues and stoppages, we’ll train our computer vision model with tons of data that reflects the process of assembling our products. But, perhaps a parts supplier has changed and so the process must be adjusted to reflect this new component. Unfortunately, our model has received no training data for this new reality. It will likely be unable to accurately assess issues in this step of the manufacturing process.
Let’s try another example. Take this image from the 1980’s – what do you see?
Now, look at this one –
Watch on-demand “Implementing Data-Centric AI with Synthetic Data” with Microsoft and Meta
Both people are listening to music. Yet, if we trained our CV model to identify people listening to music, but exclusively used images of boomboxes, would it have any chance of classifying the man with the AirPods? Probably not. The visual expression of listening to music has changed; the domain has drifted. But, unless we update our model, there’s no way it can keep up. It doesn’t stand a chance.
While the above example is easy to notice and contextualize, real-life changes are often subtle and nuanced. And they can change in a variety of ways:
- Gradual change over time – For instance, the ways we listen to music. We didn’t switch in one day from boomboxes to wireless earbuds. There were many stages on the way: stereo headphones, cable earbuds, etc. This raises questions about how often we need to update our model along this trajectory, especially in areas that are affected by consumer product design, trends, and fads.
- Recurring or cyclical change – For instance, the changing of seasons. A system trained to recognize people in summer clothing doesn’t necessarily translate into recognizing a figure swaddled in a fur coat and scarf.
- Abrupt change – We need to look no further than COVID – 19 and the sudden omnipresence of masks. The target domain of facial recognition changed suddenly and dramatically, but systems had not been trained with data of people wearing face masks.
Domain drift can be discovered utilizing both active and passive methods. Actively, we need to monitor our model’s predictions and efficiency. By testing our model at regular intervals, we should see similar results. If we see a decrease in the accuracy of our models, it might be time to check and see if domain drift is the reason. One strategy is to invest in dashboards that can help us visualize changes to the accuracy of our models over time. Implementing consistent, automated evaluation and benchmarking processes that are reflected visually can help draw our attention to shifts that require attention.
We can also utilize ensembles to help us detect domain drift. Ensembles are a number of models, trained on different datasets that should work together and cooperatively. If we detect that over time their level of agreement decreases, this could potentially indicate domain drift. The underlying assumption is that domain drift will affect the various trained models differently. By continuously measuring the agreement between the models we can then flag changes in the domain. This is significant because ensembles can help us both detect and correct domain drift by mitigating outliers influence, and correcting small levels of domain drift.
Another interesting way we can detect Domain Drift is by stress-testing our model by using synthetic data to highlight specific aspects of the system. We can test how different parameters like lighting, color, noise, or lens parameters can affect your model. This won’t necessarily help us recognize domain drift, but it can help us be proactive in recognizing areas that might be problematic if the domain were to shift over time. By taking forward-looking action we can make our models more robust.
In Indre Zliobaite’s paper, Concept Drift: an Overview he suggests that making assumptions about the future is a crucial step in dealing with domain drift. Using synthetic data together with dynamic and future-oriented thinking can be a potent combination in minimizing potential data drift. Synthetic data gives us the ability to generate data as opposed to waiting for real-world occurrences to happen. It enables us to control scenarios and act before domain drift. This can be used in the context of the stress test mentioned above, or more broadly in training our models with scenarios that we foresee before they come to pass.
Passively, we can utilize alert tools like Prometheus, or specialized tools like the scikit-multiflow library for Python that use ADWIN to detect drift. Beyond the data workarounds and tools, communication within teams about changes to the product, or the surrounding environment can help alert us to potential drift.
Once data drift has been identified, there are many ways to deal with it. Most of them are not mutually exclusive – in other words, teams can and employ multiple strategies in tandem For many teams, the starting point is a static model that utilizes your data. But, this alone is usually an inefficient way to address domain drift, even if it can show the signs of domain drift over time. So, instead of just relying on noticing and troubleshooting variations in the effectiveness of your static model, here are some more proactive approaches:
Periodical Re-fit or Update
Periodic Updates use recently captured data that can either replace existing data or be added to existing datasets. By consistently manually gathering sufficient amounts of data, you are more likely to capture gradual domain drifts and ensure that new environmental conditions are represented. This works best with gradual and recurring shifts. By replenishing our datasets at set intervals, we help maintain relevance barring any seismic shifts that happen overnight.
ML models allow us to assign a different weighting to different datasets – in this case, we can assign a proportionally higher weight to more recent data. Weighting data can mean a gradual forgetting of old data and is based on the assumption that old information loses relevance over time – in other words, that the domain drift is linear and permanent. In the case of recurring drift – this assumption is not necessarily true. This a blind adaptation model – meaning that there is no explicit detection of change. This weighting technique is related to periodical re-fits in that it also relies on ongoing data collection.
As discussed earlier, this approach both helps to detect and to confront domain drift. Here, we use newer models that can correct our static models based on more recent data, and simultaneously train different models on different datasets. We build several models and dynamically combine them in a way that is best used in the context of a gradual shift, when things may be changing slowly, unevenly, or if there is significant overlap or co-existence of key conditions.
Detect and Choose Model
Using this solution, we build systems to scrutinize incoming data, detect changes and then automatically select a model that will be most effective in solving our current problem. Generally, this approach is more appropriate for recurring drifts because it allows us to build models corresponding to predictable, cyclical conditions and then switch between them seamlessly as conditions change.
By curating data effectively, from initial training through the lifespan of our model, we can help minimize effects to our algorithms. The more thought we put into obtaining and curating our data, the more like that our training data will have enough variation and edge cases to handle gradual domain drift.
No matter how you choose to tackle Domain Drift, you’ll need new, large training datasets recent enough to re-train your model effectively. The application in question will determine just how big that dataset should be and what edge cases it should contain. For instance, in smart city applications the emergence of scooters and hoverboards requires new datasets of those objects to teach models to recognize them. Depending on your goal, you may need scooter image datasets in a wide range of weather conditions. You may need edge cases showing near-accidents or datasets showing a range of makes and models.
We know that building these retraining datasets takes time and money. All the while, your model may be losing effectiveness and accuracy. Whether it is drones making deliveries, or robots walking dogs – our domains are constantly drifting. Our models have to keep up. So, how you obtain this data efficiently and effectively is of the utmost importance.
Simulated data has some key advantages that can help you deal with domain drift problems.
One key advantage of simulated data is that it gives teams agility to deal rapidly with Domain Drift. If you can predict or recognize a shift in your target, you can simulate it without waiting for the drift to be noticeable enough to notice manually in new data. Simulated data also allows you to be as proactive and forward-looking as you desire. Even if a particular domain drift is far off (or unlikely to occur), you can still prepare for it by simulating those shifts just in case. The only thing needed to generate simulated data is your imagination.
Datagen’s simulated data solutions can be a great option for combating data drift. First, they can easily be fine-tuned to reflect gradual, seismic, or cyclical drifts. Using advanced networks, we can simulate real-world imagery so that teams can stay a step ahead of domain drift. We believe that algorithms shouldn’t have to suffer from domain drift just because collecting data and manually annotating it has been so slow in the past.
Second, we also leverage Domain adaptation techniques; a method that uses a model trained on a source distribution in the context of a different target domain. Using this technique we can transform existing datasets so that they are adapted to a new domain. This allows for more efficient utilization of existing resources and knowledge.
At Datagen, we are cognizant of the challenges that domain drift can pose – but, we work together with our partners to enable them to continue their progress in a dynamic world.
Watch on-demand “Implementing Data-Centric AI with Synthetic Data” with Microsoft and Meta