Body Models: Driving the Age of the Avatar
This transcript was edited for clarity and length.
Michael J. Black is one of the founding directors of the Max Planck Institute (MPI) for intelligent systems in Tübingen, Germany. He completed his PhD in computer science at Yale university, his postdoc at the University of Toronto and has co-authored over 200 peer reviewed papers today.
His research focuses on understanding humans and their behavior in video, working on the boundary of computer vision, machine learning and computer graphics. He worked on realistic 3D human body models, such as SMPL. This has been widely used in both academia and industry. In 2017, the startup that he co-founded to commercialize these technologies was acquired by Amazon. Today, Michael and his teams at MPI are developing exciting new capabilities in computer vision.
History of body model development
Michael J. Black: So we wanted to learn the template shape of the body that would then be deformed. We added these blend shapes for body shape to change the body. We wanted to learn the weights that you use in linear blend skinny. We learned the joint locations because the body shapes change. Your joints are in a different place than my joints. So your joints have to be a function of your body shape.
We learned that, and then we learned the pose correctives and here was the big problem: how to parameterize. The joints of the body or the limbs of the body are, can be described in lots of different ways. You can use Euler angles or an access angle representation Rodrigues or, quaternions, there’s all kinds of representations.
And what we needed was a representation that would be related to shape deformation in a really simple way, because we wanted it to be a simple model. And there, it turned out, oddly, that the rotation matrices describing the part rotations, we could linearly relate the shape deformation of the body, to the elements of the part rotation matrices.
And this was the best thing we came up with. We had many different formulations, but this was dead simple. Remarkably, when we trained it, even though it sounds more restrictive in many ways than SCAPE, it was more accurate than SCAPE. So it ticked off all the boxes that we needed. Accuracy, speed, compactness, and full compatibility.
I’ve had this experience many times in my career where there’s an almost innate drive among academics to do something really complicated and fancy.
Like it’s kind of fun. And I have done that in my career. So I was trying to solve a problem and I came up with this very fancy nonlinear particle, filtering technique to solve it. And, I was describing it to someone and they said, have you just tried a common filter and it was like, no, but I should have, of course that’s the first thing I should have done.
And of course the common filter beat my fancy nonlinear, crazy particle filter. And so I’ve had to learn that lesson again and again, start with the simple solution. And then when it breaks, then develop something fancier. But if you want people to adopt your stuff, if you want it to have an impact in the world, then if it’s compatible, you have a much higher chance of it actually working out.
The age of the avatar
Michael J. Black: Everybody is going to have an avatar. And in fact, many avatars that you use for different things. You’re gonna have an avatar for shopping, for clothing and avatar for being in virtual meetings and avatar for, going to a concert, an avatar for playing a game. Now, are these all going to be separate avatars or are you gonna have control of them?
Are they going to be like you in some way? i.e., Will they have your facial expressions? If I see you from a distance, I would recognize you because I know you. It should be the same with your avatar, whether it’s a Lego character, Roblox or something, or it’s physically like you, it should embody you.
And so your emotions, your expressions coming out through all of your avatars, I think, is the future we want.
So what I see as a major shift coming is a seamless technology that allows me to go shop for Nike clothing, and then have that Nike clothing on my avatar, in my fitness app, but also have it physically fitting me.… I think we need a unifying, avatar technology that can support your animated avatar, can support you in your video game, whatever it happens to be. And could support you doing e-commerce when you’re shopping for clothing or bicycles or whatever else it is.
Meta-humans are super cool avatars, but you’ve gotta really know what you’re doing to create one. So most avatar creation methods are goofy things where you pick your hair color and you pick some makeup and things like that. And it makes an avatar, but it’s not really embodying you.
So we need to break down that barrier and make avatar creation absolutely dead, trivial, accurate, low friction. And then we need to be able to animate those avatars and we need to make it so that they plug and play with just about anything in the metaverses. And then there will be real power to this idea of meta around avatars.
Career tips for computer vision engineers
Michael J. Black: Right now everybody is super focused on deep learning and it’s great. It works. And it’s an effort just to keep up with that, but I would really encourage people beginning in this field to not just do that.
Learning linear, algebra probability, physics, whatever it happens to be. These skills that you get in these other disciplines, really ground and inform your ideas and can be incorporated in. So there are many properties of the world. For example, machines don’t have to learn. They just are there.
And if you understand mathematics or physics, you can exploit them as a training machine. So one of the great things about Datagen is providing synthetic data to people, to train machines, but there are other ways there are some generic, in addition to that, there are generic priors that we know about the world.
We know about the physics of the world, and it doesn’t necessarily have to be learned. So I would not just be very narrowly focused on deep learning. I would just look a little bit more broadly, learn some geometry. The 3D world is important. And even if your networks are going to learn about the geometry for you, if you don’t have intuitions about prospective cameras and occlusion and or properties of light, if you don’t know about illumination and how it interacts with surfaces, then you may not really understand what your method is doing and why it’s not doing the right things.
And you won’t know how to choose the right data to actually solve that problem. So having physical intuitions about the world I think is really critical.