The Next Frontier: Computer Vision on 3D Data

Ofir Zuk (Chakon)

19/10/2022

9 Min read

Or Litany currently works as a senior research scientist at Nvidia. He earned his BSC in physics and mathematics from Hebrew University and his master’s degree from the Technion. After that, he went on to do his PhD at Tel Aviv University, where he worked on analyzing 3D data with graph neural networks. For his postdoc, Or attended Stanford University where he pushed the cutting edge of 3D data analysis. Or is an extremely accomplished researcher with research that focuses on 3D deep learning for scene understanding, point cloud analysis and shape analysis.

This transcript has been edited for length and clarity. Listen to the full episode here.

Why is 3D Important?

Or Litany: Why is 3D important? I think different fields will give different answers. One could say it’s just interesting, maybe because it’s harder, but really the world is mostly 3D.

Why are we even doing computer vision? A lot of it, not all of it, of course not all of it, but a lot of it is about having machines help us understand the world or understand the world themselves and act in it. And if the world is 3D, then they need, even if implicitly, even if not explicitly, they need to be able to understand this readiness of the world.

You don’t want machines to see just the front surface of a person. Assume that person is cardboard, right? You want them to assume that the person has some depth, right? And that’s bias that you have to bake in it.

So that’s one, right? We need understanding that somewhere inside the machinery is 3D, then comes natural data that comes in 3D. We wanna see how far things are. We have LIDARs for that. We have depth images for that, and that’s a different type of data to work on. LIDAR is very different than pixels.

That’s just the way the raw information comes out of the sensor. And much like we have machinery to process pixels, we need to develop machineries to process point clouds.

So that’s, you could call it 3D imagining – what a scene would look like from another perspective – that’s 3D. So we need it everywhere from robots cleaning our house and assisting us in various activities.

3D Generative Models in the Future

Or Litany: I’m thinking about these, about it a lot these days. We are seeing those images and Dall-E and it’s just remarkable. I remember myself reading the first paper and thinking, oh this is a super cool idea. But look at those rooms, they look terrible. You know, there were some bedrooms there and I was like this is ages from working.

And, I was very wrong. Now I’m looking at images and in certain classes like faces, I know people can say that they can tell the difference between AI generated and they can solve deep, fake and stuff. But, to my eyes, they all look amazing. You know, they just look real. So really the progress, and whether it’s GS or diffusion models, really the machinery has evolved very quickly, but it’s also in a lot of ways, very much driven by huge amounts of data.

So this race for more compute and more data just keeps proving itself time after time after time and I’m ignoring all the frustrating aspects of it and saying, and focusing on the good, the good is that for a lot of time now, we’ve been, as a society, collecting lots of images and lots of text that comes with those images.

And also the machinery to process this data as raw as possible, and now generate them – make them out of thin air, just create beautiful images. And that is controllable by text queries. And one reason you could say is because we don’t have, or we now only recently started having, 3D scanners in our pockets.

So given that we had digital cameras in our pockets for at least a decade, now it’s gonna take a while. It’ll start emerging, we’ll see more and more people uploading their scanned models to the internet and then maybe tagging them, writing some stuff about them. And I’m not talking about artists uploading models to the cloud.

That’s useful, very useful, but that’s not even close to, to the orders of magnitude of data we’re talking about with images and text that will come. I’m certain, you know, 10 years from now, but we don’t wanna wait. And there’s a hack here because one cool thing about 3D is that you can project it and make it into 2D and that’s a really useful key. And I think the most interesting thing that’s happening now in the 3D community, or one of the most interesting things, is the neural rendering of all types and kinds and sorts. So I’m not specifically talking about one or another, but neural rendering is really, really opening up this aspect of being able to supervise your model with realistic 3D images, but still be working in 3D.

Listen to the full episode here.

Autonomous Driving & Simulation

Or Litany: And in things like driving sometimes edge cases are the thing you actually are most interested in.

One thing we did recently is we looked at data of people driving. You can just represent trajectories and looking at a couple of seconds of the past, you can train a generative model to try and predict what the future is gonna look like six seconds into the future. The interesting thing about that is now you have some statistical model that can somehow give you a few proposals of certain cars and what their trajectory is gonna look like in the future.

But another cool thing about those generative models is the ability to manipulate them. So now, if you are trying to stay close to your prior that you learned from how humans generally drive, you can start manipulating them, asking them to do certain things that you’ve never seen before. For example, a collision, a collision that was never captured in those huge trajectory datasets like Waymo’s and, and new scenes data, but you can definitely learn what humans usually drive like.

And then if you take some trajectories and you try to push them towards the collision, which is a geometric function, right, you just try to make two trajectories collide or intersect. Then you can start seeing some interesting emergent behavior taking this idea and using those beautiful 3D content creation tools.

You can actually place real cars and real scenery behind those. And here you generated a collision in a computer graphic simulator that you’ve never captured in real life.

Tips for the Next Generation of Computer Vision Engineers

Or Litany: Like I said, in the beginning of our interview, this whole content has become much more accessible. I feel like we’re now at a time where just taking university courses is important, of course, but really if you can self-educate, there’s no limit to what you can do.

And there’s just crazy things that I see out there. You know, even when we work on researchers, we suddenly will discover some GitHub repo of this guy that we’ve never heard of. And then this person is building these amazing tools and you ask yourself why didn’t they publish it?

And just, just because it’s accessible and maybe, you know, getting stars on GitHub is more important than getting your H index up. So really getting your hands dirty. And, you know, being curiosity driven, read two minute papers, I really recommend these things, even though they’re like, “shallow”, because what I learned in the past year and a half, working with interns is that when people are curiosity driven, there’s just no stopping to what they can do.

And it’s funny because whenever we hire an intern, we try to think of what projects they will work on. And maybe we’ll have them meet with a few people and then decide or not, or maybe you can prototype something and encourage them to do that. Really what wins in the end is someone who, of course, everyone is excited by different things.

But if you could find that thing that excites you or in my case, if I can find that thing that excites the intern, it’s almost guaranteed that they’ll do magical work. And that’s amazing. And I feel like the reason I’m recommending this is because there used to be a time where you had to finish your degree to ask yourself now that I understand what I’m doing. Is this interesting to me?

But because everything is so accessible. Now you can even if not 100%, but like 80% understand what this two minute paper guy is talking about or what that paper I just read from arXiv is trying to do. You can very quickly get to a point where you could identify fields that excite you, whether it’s by the applications or the methodology and anything really, or just even listen to talks online and see, oh, that person is exciting.

Some people are just driven by their colleagues. They’re just, oh, that person looks exciting. I wanna work with them. Where do they work? Oh, they happen to work for Nvidia. I wanna get hired. you know, so that’s my recommendation to really find your thing. That it gets you excited and that you can hopefully also do well or believe you can do well. And it’s so accessible now that I think it’s much easier to find than first committing to, oh, I’m gonna do that.