Tadas Baltrusaitis is a principal scientist working in the Microsoft Mixed Reality and AI lab in Cambridge UK where he leads the human synthetics team. He recently co-authored the groundbreaking paper DigiFace 1M, a data set of 1 million synthetic images for facial recognition. His PhD research focused on automatic facial expression analysis in especially difficult real world settings. He was also a postdoctoral associate at Carnegie Mellon University where his primary research lay in automatic understanding of human behavior, expressions and mental states using computer vision.
This transcript has been edited for length and clarity. Listen to the full episode here.
In This Article
Tadas B.: I was always interested in how we can apply technologies to solve real world problems and when I started my PhD, I was interested in understanding humans and maybe applying that to smart interfaces. And we were thinking maybe if we could recognize someone’s mental state, emotional state, maybe we could adapt interfaces appropriately if we noticed that someone’s bored or excited.
And when I started that I realized actually the technology’s not there yet. We can’t actually do that. So that pushed my interest into more the computer vision, machine learning side of things. Let’s build technology that can do that, that can track faces, that can track your expressions, your eye gaze, head pose and other sort of objective markers of your behavior and have been in that area ever since. And also looking at how we can apply these technologies for various exciting applications.
Inferring Mental States from Facial Expressions
Tadas B.: When I started my PhD, I was slightly naive about the complexity of the problem and now I’m sort of much more appreciative of the intricacies there and smiling is a great example.
People smile when they’re embarrassed. People smile when they’re unsure. There’s a lot of social rules. Those social rules are dictated by the culture you grew up in. The culture you’re in. You might be behaving differently based on the context you’re in. So I think to be able to actually understand your mental states.
We need to factor all of those in and that’s really difficult. You need to have all the context, and even then there’s different types of your emotional or mental state. One is what you feel inside, and no one can read that. There’s the one that you intend to show. There’s the one that’s perceived by the observer and they’re all slightly different. There’s noise added to the system at every step of the way. Some intentional noise, some unintentional noise. So yeah, it’s a fascinating topic. I’m sort of thinking maybe we need to be careful when tackling it and also really think about why we are doing it. It’s not always obvious what an application would be of recognizing your internal state.
If you attach reflective dots on people’s faces and don’t show the faces at all, but just record those dots and replay them back, people are still able to infer the meaning behind them. They’re still able to infer the intended emotion expressed behind them.
So there’s so much information in the motion that we’re really not modeling that well or not capturing or probably not even understanding all that well. I think there’s gonna be a lot of exciting work to do in that space.
Listen to the full episode here.
Fake It Til You Make It
Tadas B.: As I mentioned before, I was fortunate to be in a group that already had a proven track record with synthetic data, and Microsoft already demonstrated the value of synthetic data for full body tracking, for hand tracking. It was in a more specialized domain with specialized types of imagery.
We hadn’t seen proof that this could work for faces and for visual light cameras, not depth cameras or infrared cameras. It was both wanting to demonstrate internally that this works because even though we sort of had evidence, look, it works for kinetic body tracking and for hand tracking in other ones too.
People were still a bit careful and dubious. For full face applications the images don’t look as well and I can explain it partly by the fact that we’re so sensitive to facial imagery. If synthetic body or synthetic hand doesn’t look right, eh, you know, people will sort of be okay with it. And what will matter is evidence that it generalizes.
With faces, even if you have a bit of evidence that it generalizes, people will say, oh, but it does look a bit creepy. It doesn’t look really realistic. And you sort of, you have to push past that and say, no, it’s fine that it might be an uncanny valley for a DNN. That might not matter as much.
And that’s why we wanted to build that evidence base and it was tricky, like when you train models on the data and because synthetic data annotations are a bit different, you validate on real data sets. Sometimes your synthetic data predictions based on synthetic data are better than the ones based on real data, but the real data’s annotated by people. That’s just a different type of annotation. It’s being able to evaluate on real data and sort of build that evidence based on several tasks and on several data sets.
Tadas B.: The projects we’re involved in, we always need real data to evaluate on because just training on synthetic data, you might not know how well you’re performing because you could test on synthetic data and that there’s some value in some more niche settings of that. But the only answer you’re gonna get, how well we generalize to our settings is through real data collection.
And it’s gonna inform you what gaps might be missing in your synthetic data and there always will be gaps. Human appearance is vast. Not only the shapes of faces and how we look, but also things like, hairstyles and clothing. And I genuinely hope that I’m never able to build a system that captures all of it, because that means human creativity ended. We’ll always have new types of garments, new hairstyles. new facial tattoos, piercings, what may have you, I hope that we never hit that point.
Advice for Future CV Engineers
Tadas B.: Don’t be afraid to look at the data and focus on the data. So often I’ve seen people get distracted by, oh, I’ll just try this slightly different architecture or this slight tweak to the op, maybe even a big tweak to the architecture. Really having an appreciation of what their training data and what their test data is because typically that’s driven by just the quality of your data. And I know it’s maybe not the most glamorous of activities to clean or organize your data or your test data and see maybe some of the test data annotations are wrong, maybe some training data annotations are wrong, but often the biggest gains will be there. And once you do that, yes, there’s huge value in improving algorithms, as well. But I think a bit of focus on data and appreciating the data work. We often ignore that or under-appreciate the work even though it’s hugely important. I know there’s movements for more data centric AI now, but that’s strangely late considering the state of the field.