Gil Elbaz hosts Or Litany, a senior research scientist at NVIDIA. They discuss the impact of 3D on computer vision and where it’s going in the near future. As well, they talk about the impact of industry on academia and vice versa. Or speaks about the future of 3D generative models, NeRF and how multi-modal models are changing computer vision. Together, Gil and Or explore the best ways to succeed in the field of AI.
TOPICS & TIMESTAMPS
[2:01] Starting his journey
[5:03] Heat transfer equation in graphics
[10:21] Multimodal changing Computer Vision
[17:47] Why is 3D Important?
[23:17] 3D Generative Models in the next years
[26:25] Neural Rendering
[29:39] Connection between images/video & 3D
[31:39] Temporal Data
[33:45] Autonomous Driving & Simulation
[36:27] Prof Leonidas Guibas
[41:56] NeRF & Editing 3D information
[46:02] Manipulation of 3D representations
[52:23] future of NeRF
[1:06:03] Meta [FAIR] experience
[1:10:58] Sanya Fidler
[1:21:31] Career Tips for Computer Vision Engineers
LINKS & RESOURCES
EG3D: Efficient Geometry-aware 3D Generative Adversarial Networks
Our guest is Or Litany. Or Litany currently works as a senior research scientist at Nvidia. He earned his BSC in physics and mathematics from Hebrew University and his master’s degree from the Technion. After that, he went on to do his PhD at Tel Aviv University, where he worked on analyzing 3D data with graph neural networks under professor Alex Bronstein. For his postdoc, Or attended Stanford University studying under the legendary professor Leonard Guibas, as well as working as part of FAIR, the research group of Meta, where he pushed the cutting edge of 3D data analysis.
Or is an extremely accomplished researcher with research that focuses on 3D deep learning for scene understanding, point cloud analysis and shape analysis. In 2023, or will be joining the Technion as an assistant professor.
ABOUT THE HOST
I’m Gil Elbaz, Co-founder and CTO of Datagen. In this podcast, I speak with interesting computer vision thinkers and practitioners. I ask the big questions that touch on the issues and challenges that ML and CV engineers deal with every day. On the way, I hope you uncover a new subject or gain a different perspective, as well as enjoying engaging conversation. It’s about much more than the technical processes – it’s about people, journeys, and ideas. Turn up the volume, insights inside.
[00:00:00] Or Litany: We now only recently started having 3D scanners in our pockets. So. Given that we had digital cameras in our pockets for at least a decade. Now it’s gonna take a while. It’ll start emerging, we’ll see more and more people uploading their scanned models to the internet and then maybe tagging them, writing some stuff about them.
but that’s not even close to, to the orders of magnitude of data we’re talking about And I think the most interesting thing that’s happening now in really community, or one of the most interesting things is the neural rendering of all types and kinds of sorts.
Gil Elbaz: Welcome to unboxing AI, the podcast about the future of computer vision, where industry experts share their insights about bringing computer vision capabilities into production. I’m Gil Elbaz, Datagen co-founder and CTO. Ya’lla let’s get started. Today we have here with us Or Litany, Or Litany currently works as a senior research scientist at Nvidia.
He earned his BSC in physics and mathematics from Hebrew university and his master’s degree from the Technion. After that, he went on to do his PhD at Tel Aviv university, where he worked on analyzing 3D data with graph neural networks under professor Alex Bronstein. For his postdoc, Or attended Stanford university, studying under the legendary professor Leonard Guibas, as well as working as part of FAIR, the research group of Meta, where he pushed the cutting edge of 3D data analysis.
Or is an extremely accomplished researcher with research that focuses on 3D deep learning for scene understanding, point cloud analysis and shape analysis. In 2023, or will be joining the Technion as an assistant professor. Or welcome to unboxing AI. I’m really glad to have you on our show. Let’s kick this off.
Or Litany: Thanks for inviting me. And this is fun.
Gil Elbaz: Awesome. So maybe to kick this off and take us back in time a bit, can you tell me a little bit about, you know, when you started coding, when you kind of started this journey and how this led to where you are today?
Or Litany: Yeah, sure. So. I never thought I’d be doing computer science.
It’s funny that I’m actually, I’m gonna join going all the way to like the final step or, or the next step of my, of my career. I’m joining as a faculty to the computer science department, but I never actually studied computer science. in high school. I was interested in physics. I thought that’s gonna be my thing.
And then I joined, which is heavily focused on physics with this premise that if you study something general enough, such as physics and math, then later on you can, it just gives you the tools to later on study, whatever you need on the job training. So I began this journey of like, my entire career has been this sort of like a step by step, sort of like on the job training and everything I picked up, including coding was kind of like only what’s necessary, you know, only this thing that.
Kind of, I was very much curiosity driven and every next step was, oh, that’ll be cool. You know, I wanna try and do that. And oops looks like I need some understanding of, I dunno, at first it was electrical engineering. Like I was working for the air force and, and I had to use some digital signal processing and I knew nothing about that.
Right. When I just joined. So I went and took some courses and then, oh, this is cool. Let’s do a master’s degree in that, you know? And then this is how I picked up everything, including coding at some point. But I never felt comfortable to call myself even a coder, even when I was coding up stuff, it was always like, yeah, but please don’t look at it.
Gil Elbaz: Well, that’s amazing. And I’d love to hear kind of how you see physics connecting to kind of computer vision. It’s very interesting to have that kind of unique back.
Or Litany: Yeah. I mean, the one thing is just the general tools. I guess not being afraid of equations is a very good starting point.
Absolutely. That’s something I noticed that some, especially in the field of computer vision and graphics, maybe even more so in recent years where everything is kind of deep learning people really avoid, try to avoid as much as possible equations. And I appreciate the attempt to gain intuition, but equations are there in repurpose, right?
So there’s some stuff that’s just more intuitive at some point to represent using equations than it is using words. And they’re there for us to use, you know, as a concise representation of what we wanna say with many words, much like images are worth a thousand words or, you know, whatever number you want us throw in there.
So I think this is one, one thing that definitely going, starting from like a good physics and, and math background just opened up my. You know, my ability to go deeper into maybe less familiar territory and still enjoy those papers and be able to pick up those concepts early on in my masters, when I just started working on computer vision or graphics or whatever you wanna call it, I was working on nor shapes.
And that community surprisingly actually uses a lot of physics. Like you’ll find the heat equation, you know, signatures for non rigid shapes. And, you know, and, and then later on the wave kernel signature, which is actually derived from Schrodinger equation, and that actually felt nice to look at those things that, oh, look, this is useful actually in other, so this is actually a super interesting topic.
Gil Elbaz: Maybe you can give us like a short explanation on how heat equation actually connects with this 3D geometry.
Or Litany: Yeah, sure. So in a way, when one comes to work on non rigid shapes, you’re trying to find, and, and that’s actually probably true more generally in computer vision and, and graphics that. It’s all about symmetries in a way, you know, we, we’re trying to find much like, I dunno, convolution neural neural networks exploit some symmetry in the form of shift equi variants, right?
When you wanna work, when you come to work on non rigid shapes, you somehow want to be in variant to their deformation because otherwise each, you know, I dunno if you’re describing a cat and cat can deform. I dunno if you ever owned a cat, but cats can deform in various ways, really infinite ways.
And you don’t want each pose of the cat to suddenly emerge as a new 3D entity. You know, you want to treat them all as one. Actually yesterday, I listened to your interview with Michael Black, who was here a couple weeks ago. He also worked a lot of his careers about that, right? Like mm-hmm, let’s model human shapes in a concise way and not waste parameters on, you know, the same person raising their hand, cuz that’s not interesting.
And then when you think of like environs, what stays the same when we move around, you can think of many different things like limb blanks, right? Just something that that kind of is, is mostly fixed. But also if you think of a heat source placed on your nose, right? And then you measure how long the heat is gonna take to propagate from your nose to your elbow, you know, that’s largely going to stay environ into your pose.
Deformation, as long as you don’t touch your hands together and you break mm-hmm topology, mm-hmm on the basis of this. This is a very interesting insight. And based on that, you can say, oh, well now if I have two shapes and the signature I can give to any verdicts is like, what kind of heat? Or what’s the amount of heat.
That will end up being on that elbow after a certain period of time, it’s gonna be largely similar between two, the formable shapes and, and that’s a really useful insight that’s been used for then later performing shape correspondences, you can match the elbows between two persons and then yeah. Map the texture and, and whatnot.
Gil Elbaz: Yeah. And this is like a critical step in a lot of these non rigid registration and non rigid analysis of shapes. And I also started off as a mechanical engineer originally. So we learned heat equations for calculating heat in different metals and things like that. And, and really when I saw it connecting back in the academia, I saw it connecting to 3D shape analysis.
I was really blown away by the creativity and kind of connecting these multidisciplinary capabilities from very engineering, heavy engineering background and a computer science-y, 3D shape analysis on some kind of data. So more like a more signal processing background connecting these two together was very, very interesting for me.
Or Litany: Yeah. And this in, I think this interdisciplinary thing is really the key, you know, because a lot of times now that I’m more “senior” or whatever, I’m, you know, I get to do more chairing in conferences and things like that. And I get to oversee some reviewers and you can see the biases that computer vision and graphics give, you know, people study that and they are used to the way papers are written.
And now when, when they see like some. Attempt from someone, you know, like an alien author coming from some weird field, like physics or mechanical engineering, their initial temptation is to just reject the paper. You know, this is not something that we’re used to seeing. Why did you put the related work at the end and not, you know, section number two or any of those things.
Right. And especially when it comes to like, like different approaches, I saw this graph, there was a guy, a really interesting guy who came to Stanford as a visiting scholar for a while. And he was working on creativity in AI. And he showed me this graph, showing how people appreciate creativity. And it’s a very interesting graph.
Actually, people appreciate creativity to a certain extent. And then the graph drops rapidly. So something is too creative. We, as, as people are just inclined to, you know, reject it as something. So you have to be careful, you know, those, and on the other hand, you don’t wanna be incremental, right?
That’s the number one curse word in, in, in reviews of tapers. So there’s like a very novelty. Exactly.
Gil Elbaz: Definitely. So yeah. Yeah. Novelty is the number one thing, but not too novel.
Or Litany: Exactly. There’s a sweet spot that you need to learn when you work in this field. And, but I feel like our responsibility is to open up and accept those interdisciplinary, especially when we work with something as general as AI, that basically is applicable to so many things.
And suddenly you don’t wanna reject the paper envision because it has some language in it. Right. But I remember back in, I don’t know, what was it, 2012 when I saw Fafa giving a talk and then suddenly she was working on, on videos and, and scripts from movies, which is an amazing multimodal type of data.
And there was some discussions in the audience and people were like, why is this vision related? You know, this is not pure vision. Why are we discussing? Why is this an oral in a CVR or whatever, C CV. And I think that’s exactly the type of open mind that academics need to maintain and, and preserve and maybe, you know, closing back to like the physics background.
Maybe this is something that’s useful, go and study some other stuff for a while. [00:10:21] Multimodal Changing Computer Vision
Gil Elbaz: And, yeah, definitely. I think that it could be a very big part of like this interdisciplinary push that we’re seeing now. And, and really like you’re mentioning these fields are converging. I’d love to maybe touch on that a bit.
You see, like these amazing capabilities, if it’s GPT-3, if it’s DALL-E and different diffusion based models like glide, there are really very interesting foundational models that are being developed now that are very much interdisciplinary. I’d love to kind of understand how you see the development of this field going forward with respect to these large multi-model models and also research, meaning how does research.
Kind of envelop all of these different multimodal disciplines and you need a lot of background in order to, to deal with this in, in many ways, how do you think we go about really progressing with this?
Or Litany: Okay. So I have to wear two hats here, right? One is just being a person in this world. This is very exciting that now the same machinery can be used to do everything down the road.
I don’t know. Everything’s that transformer. Okay, fine. I buy that. I’d love to have that ability to just not think, not overthink what type of data I’m working on. I just throw a huge network at it and enjoy the outcome. And if it’s multimodal, you know, the better for sure. That’s what we want. We want to be able to have automatic tools that take all of the data in the world, process it and output stuff.
That’s maybe too hard for us as people or some interesting insights. And it’s not that we are there yet, but to put on my other hat as an academic, there is also a problem here because then you ask yourself, okay, where do my expertise come into play at some point. If everything’s been invented then, or, you know, if all the tools were, I mean, in 3D, we see that a lot, a lot of works even before even touching tools, even problem setups are often inspired by other fields.
So a lot of times you’ll see in 3D that we define problems in the same way they’re defined in 2D. Right? And it’s not necessarily the case that we’re actually interested in the same type of problems, but it’s very tempting to say, oh yeah, this problem has been, has been proposed in 2D. Let’s do the same in 3D.
Just one quick example. Maybe it’s more of a problem than a problem set up. But object detection 2D is defined by, you know, find me a box. And usually their box is represented by some center point and sizes of that box. And in 3D, the center point of a 3D box is actually a lot of time. A lot of times is, is in thin air.
There’s not no content there, you know, in, in 2D when I detect a person or, or, or cat or whatever object. The central point of that box will often contain the content of the box. You know, that person or that, that object. And in 3D a lot of time, if you detect that table, you can imagine that the central point, the central point of that table is actually in thin air.
And that’s a problem. The sensor never sensed anything there. So why are we defining this as a canonical point of reference? This is really weird. I think when one works in a, in a certain field, they have to think, okay, what’s unique in that problem to my field. Do I really have to just mimic? What’s been.
If masked auto encodes work well for, for text input, because predicting the next word is a really turned out to be a really good pretext task. Does this mean that also masking regions in images is a useful task and what’s a token is a word is a token that kind of makes sense is a patch of an image, a token too.
It’s not as semantically identifiable as a word token is right. And then if you want to extend it to 3D, then what is this now? Right. Are we talking just voxels? Is it eight by eight by eight? Like, who’s gonna give me the magic number to work with is eight by eight by eight.
Gil Elbaz: All you need. good question.
Right, exactly. I agree. And I think that it’s, it’s a very interesting intuition saying that we tend to borrow a lot from different domains. So from different previous works and different domains, and it has progressed us in a very nice way, but there are limitations there and there is a need to take a step back sometimes and think about, okay, how do I actually want to represent this information?
How do I actually want to go about analyzing this stuff? I love this intuition because it really promotes. Both pushing forward, what we already know, but also taking this new step back. I’d love to hear if you see kind of this taking a step back as a trend that is happening a lot now in the world of 3D computer vision
Or Litany: if borrow concepts and ideas from, from other modalities a trend,
Gil Elbaz: both that and the opposite.
Or Litany: Yeah. Oh, and borrow from 3D to other domains. Yeah. Yeah. So exactly. So one of the good things about having process of unification of machineries, right, is that all the fields have become much more approachable, more accessible to researchers wherever, right. My good friends, Srinath Sridhar, from Brown University, invited me to be a judge in his course that he’s gave to his students.
And within no time students who had zero background in, in 3D or very minimal background in field, they just took his course, basically. And at the end of the course, they were giving state of the art results using neural fields. And there was like amazing exactly. And all the tools were available to them.
So they could easily use, you know, 3D graphic tools to generate data. And download online code and then modify it a little bit because once everything’s unified, then, then everything is familiar as well. You know, you don’t have to study a new lingo when you switch to a, you know, to a new field and everything’s more accessible and yeah, maybe people who had some, 15 or 20 years of experience in a field kind of feel, or is my leverage, you know, they still have it.
I feel in, in a lot of ways, like what I mentioned before, ask the right questions or identity and in a way it’s even better that you are forced to ask yourself, okay, now what’s unique to VD about this, or what’s unique to text about this. So making it accessible is like a forcing function to researchers in each field to ask what’s unique to my field and what I can offer and going back to, like I mentioned, paper reviews previously, I think it’s now becoming the case where, you know, there was a period where just deep learning everything.
Was accepted to conferences, right? Someone did whatever solved problem X and, and now there’s a deep X that prep would get accepted. And if you’re first to the game, then you’ll have more citations and more draw your H index very quickly. But I think the field became much more interesting once that started saturating, because then you had to think a bit more carefully about the problem.
Okay. So we try to CNN, but now that’s just a. Okay, what else can we say about the problem? That’s gonna make it more interesting. Maybe I’m solving optical flow, maybe just running a CNN and, and summarizing. Everything is one global vector is not the way to go. That’s good. So it’s a very natural process, right?
Where you see this unification, everything seems solved and then you read it again and then you look into the details and then you suddenly realize it’s very far from being solved. Definitely.
Gil Elbaz: I completely agree. And I would wanna maybe dive in and ask a little bit more like deeper into the world of 3D.
I know that you have become really an expert in 3D computer vision or computer vision on 3D information. I’ve seen very interesting talks that you’ve given in the past. I love the concept by the way of the deep half transform. And so I I’d love to understand, first of all, if you can give us a little bit of motivation of why 3D is so important to work on, and also why is it so hard to deal
Or Litany: I see, that’s a very broad question. So why is 3D important? I think different fields will give different answers. One could say it’s just interesting, you know, maybe cuz it’s harder, but really the world is mostly 3D or at time you get another one and if you zoom out and you think, why are we doing all this?
Why are we even doing computer vision? A lot of it, not all of it, of course not all of it, but a lot of it is about having machines, help us understand the world or understand the world themselves and act in, in it. And if the world is 3D, then they need, even if implicitly, even if, if not explicitly, they need to be able to understand this readiness of the world.
Right? You don’t want machines to see just the front surface of a person. Assume that person is a cardboard, right? You want them to assume that the person has some, some depth, right? And that’s bias that you have to bake in it. And if you wanna call it, you know, 3D or two and a half D or 2.1 D it, it doesn’t really matter.
So that’s one, right? We need understanding that somewhere inside the machinery is 3D, then comes natural data that comes in in VD. We wanna see how far things are. We have LIDARs for that. We have depth images for that, and that’s a different type of data to work on. LIDAR is very different than pixels.
That’s just the way the raw information comes out of the sensor. And much like we have machinery to process pixels. We need to develop machineries to process point clouds. And it’s funny because some criticism about that is that some academic benchmarks actually generate point clouds from surfaces, from reconstructed surfaces, which doesn’t really mimic the actual generator, but it doesn’t matter, you know, at the end of the day, it’s a stimula.
Whenever a new benchmark comes out and a better dataset. I know like large dataset now for autonomous driving, autonomous driving in general, really pushed forward this, this field and the need to process lighter information and radar information, and then fusing lighter and images, seeing lap pedestrian and knowing how far that pedestrian is, is key to being able to break on time.
So that’s, you could call it 3D imagining what a scene would look like from another perspective that’s 3D. And so that all is, or, or, you know, from which angle to grasp a water bottle to pick it up. Okay. So, so we needed everywhere from robots, cleaning our house and assisting us in different various activities surgery.
And that’s all just the part of acting in the. Now we’re also entering an era where the world itself is kind of starting to become a mixture of real and virtual. And again, because the physical world is still really, you need also the virtual content to also play nice with this readiness. If you plug in AR entities, right, and models into the world, you need to know where to place them.
You don’t wanna place them where it makes sense. You know, you want the Pokemon to, to be on the ground and not floating, you know, in the
Gil Elbaz: Pokemon is our motivation. Definitely. exactly. I agree. And I think that for me, computer vision in an abstract way is really connecting between the real world and the digital world.
Really being able to infer from visual data could be images, point clouds, whatever high level or semantic information that can then be used for downstream applications. It’s kind of a pipeline between the real world and. The digital world that we can then use and make this visual information that’s unstructured, structured and useful.
And rich, I see that ultimately, right. Ideally the world is 3D or 4D, right? And we have 3D entities moving around, manipulating things, doing different high level and low level things in the scene. And it’s, I very much connect with what you described, including AR VR and robotics and autonomous vehicles, and really looking at the world a few years forward.
I think that we’re gonna be able to create much more and more of a natural connection between this real world and digital world. And with the advent of like AR projecting this new digital space on top of the real world with this mixed reality metaverse type applications, this is gonna be extremely fun and extremely interesting.
Or Litany: I fully agree. And I think you, you’re touching another interesting point here, which is beyond the side of understanding the world and maybe copying it and digitizing it and, and acting on it. There’s also the content creation side, which is a huge part of it, right. We’ll start seeing more and more. I mean, we have artists generating data, right?
I mean, you have a company that’s doing that. That will become a very significant in part or crucial ingredient in the future. And then AI tools that can help artists generate content. Especially in 3D, which is harder. Very hard. Yes. This is huge, right? I mean, we want everyone to be able to draw, even building with Lego blocks in Minecraft is, is hard, right?
Yes. It should be made simpler and you should be made fun and you should remove the struggle and help artists create or focus on creativity. When you look at, we can draw inspiration from audio here next to me, if there’s an audio recording software, right? You want to remove the struggle and be able to create effects and create beautiful.
Like editing tools that are intuitive to the artist that you can just listen to it or in the case of 3D, just look at it and appreciate it and, and correct it and fix it. But don’t worry about the little details, like fixing broken manifolds and, and water tightness and, and things like that because it wouldn’t print on a real printer, right?
That’s not something you want an artist to, to deal with.
Gil Elbaz: Definitely. I love this connection to the generative kind of aspect to the place where we put the tools of the artists in the center. I’d love to understand, you know, how do you see this 3D generative approaches evolving in the next upcoming years?
Or Litany: Yeah, that’s a great question. I’m thinking about these, about it a lot these days, cuz you know, we are seeing those image and Dolly and, and it’s just remarkable how I remember myself reading the first gun paper and thinking, oh this is. Super cool idea. But look at those rooms, they look terrible. You know, there were some bedrooms there and I was like, ah, this is clearly not, you know, this is ages from working.
And, and I was, I was very wrong. I mean, now I’m looking at images and, and in certain classes like faces, I know people can say that they can tell the difference between AI generated and they can solve deep, fake and stuff. But, to my eyes, they all look amazing. You know, they just look real. So really the progress and whether it’s GS or diffusion models, really the machinery has evolved very quickly, but it’s also in a lot of ways, very much driven by huge amounts of data.
So this race for more compute and more data just keeps proving itself time after time after time and I’m ignoring all the, you know, the frustrating aspects of it and saying, and focusing on the good, the good is that for a lot of time now we’ve been as a society. I mean, collecting lots of images and lots of text that comes with those images.
And also the machinery to process this data as raw as possible, and now generate them, like, make them out of thin air, just create beautiful images. And that are controllable by text queries. That’s remarkable where it’s really in that unfortunately behind. And one reason you could say is because we don’t have, or we now only recently started having 3D scanners in our pockets.
So given that we had digital cameras in our pockets for at least a decade, now it’s gonna take a while. It’ll start emerging, we’ll see more and more people uploading their scanned models to the internet and then maybe tagging them, writing some stuff about them. And I’m not talking about artists uploading models to the cloud.
That’s useful, very useful, but that’s not even close to, to the orders of magnitude of data we’re talking about with images and text that will come. I’m certain, you know, 10 years from now, but we don’t wanna wait. And there’s a hack here because one cool thing about 3D is that you can project it and make it into 2D and that’s a really useful key. And I think the most interesting thing that’s happening now in the 3D community, or one of the most interesting things is the neural rendering of all types and kinds and sorts. Right? So I’m not specifically talking about one or another, but there are neural rendering is really, really opening up this aspect of being able to supervise your model with realistic 3D images, but still be working in 3D.
So under the hood, there’s a 3D model.
Gil Elbaz: Yes. So, maybe before we dive deeper, can you talk a little bit about how 3D information was represented in the past and what neural rendering is and, and what it means for the field?
Or Litany: So the representation didn’t necessarily change as much as the ability to connect this video representation in the Tod representation in a differentiatable manner.
So if you wanted to take a 3D model and render it to an image, you could do that. But if you wanted to now guide that image to change a little bit by manipulating its gradient or move around pixels, you didn’t have a natural way to propagate that information back to the 3D representation. And I really think this is the key.
When I say neural rendering, this is the key because this is, it meant that if you wanted to supervise your 3D content directly, you’d have to do it using 3D content. And that we don’t have enough of what we do have a lot of is images and videos and text even. And on those images and videos and texts also lies another interesting layer, a semantic layer of pre-train networks.
There also know a lot about this content and using that to guide the 3D supervision instead of directly supervising its VD that I think is opening up this great opportunity to basically just bypass. We don’t need 3D. To supervise VD generation. And I think this is, this is key and, and that’s, what’s opening up all these, or I think this is what will open up this opportunity to generate 3D content in the next decade until comes a point where there’s enough 3D data to maybe do it without it, or maybe it’ll always be a mixed solution of both.
But I think the good news is that I don’t think we have to wait and we’ll start seeing more and more actually in, in our group, we, we already see progress using that and image supervision is already being used to generate 3D content. So I’m excited about that.
Gil Elbaz: Yeah, it’s, it’s super exciting. And I think that style again, really opened my eyes towards how it’s possible to learn completely in the image space with uncorrelated data and extract.
And now with style again, three, and there’s a lot of different works that came out recently in CVPR. We see a lot. Different ways that are completely different from each other, but achieve a very similar result of extracting 3D information from this network that was trained completely in the image space.
That was quite insightful for me. And so your intuition is this type of approach is going to lead us in the next few years before we have the actual raw data needed to fully train these robust 3D networks that are more similar to what we see in the pure image space like Dali and those other ones.
So looking forward, you see a world where we’re gonna see very large models that can generate 3D assets, 3D objects at scale, in an easy way that can also be edited. With words, maybe potentially
Or Litany: 100%. Yeah. I think we’re not even far from it. And what I’m seeing this neural rendering as a huge enabler to do that with the emergence of more 3D sensors and, and data, because that’s part of the thing.
And of course, videos I feel are very interesting some intermediate ground where it’s 2D, but it’s a sequence of 2D. And if you are smart about collecting it or using it, then there is actually some 3D there.
Gil Elbaz: So this is interesting. How do you see the connection between 3D analysis of data in 3D and videos?
Do you see that there are two different domains, two different things that we should focus on or do they come together in some ways?
Or Litany: So it depends on what we call video. If we think of a static scene and a moving camera, then in a lot of ways, this is just multi view imagery. As long as you can position yourself and get deposed, then it’s just a better source of, of 2D inputs.
Even if you want to train models using that type of supervision, instead of having just a single view, you have multiple views, that’s much more useful. It removes ambiguity in the reconstruction. But videos, they actually, they introduce another challenge where the content moves. It’s not just that the observer moves is also the content that’s changing.
And so one way would be to, yeah, if you’re just interested in static background, you can find ways to remove that dynamic content, but actually think dynamic content is a lot of time. The thing we are actually more interested in. And so now you have this really interesting and challenging problem of how do we take in the wild, raw dynamic content, maybe people moving around and how do we take videos of those of those people and reconstruct accurate models from just the observable views of them so that we can maybe replace them in the scene, reenact them?
I think this is, this is really, really the exciting part. So it’s not that video is another form of 3D. It adds also this temporal ingredient to it, which I think is really interest.
Gil Elbaz: Definitely. And, and I see also things that connect, you know, when, when looking at temporal data, there are also these long range connections that are a bit similar to text in different ways.
They’re semantic. Sometimes, you know, there are connections that are very far apart in the temporal space, they are important, and they’re very natural for people to understand very challenging for machine learning algorithms to understand. Can you talk a little bit about the challenges with temporal data and like how you see us attempting to progress with them?
Or Litany: So I’m not working primarily on temporal data a lot. What I do think is that there are a couple of interesting things when one comes to work on the portal data. First of all, it’s funny because going back a little bit, when one of the challenges of working like one form of 3D is point clouds, right? And point clouds are assumed to be a set. Set meaning it has no defined order. And that forces you to, if you want to process a certain data and you can’t assume canonical ordering, you have to be smart about the way you process it, temporal data in a way solves this problem for you, because it has canonical order. That’s the order in which you recorded it.
And if you want to learn physics from videos, then it’s a good queue, right? You want to see that if you drop the bottle, it falls to the floor and you don’t wanna look at the sequence as, as a set, you don’t wanna remove the order. That’s actually a really useful queue. You can learn their work studying interesting aspects for again, for autonomous vehicles are starting intent, for example, and that’s something really useful.
Or Litany: Being able to look at a lot of data where you get to see some initial state and get a glimpse at the future. What that looked like couple of seconds in, in the future, you can learn a lot, right? And you can, you can an. Before people jumped to the road and, and things like that we did recently have, based on these insights, we had a work trying to model, I think, one usage of, of the temporal data for, for, you know, future prediction and, and leveraging that in, in learning one interesting aspect of looking at some initial state and, and then getting to see what the future of that initial state looked like is something that we recently looked at for another huge problem, computer vision, which is the long tail and sort of like the long tail is something it’s a really pain.
It’s a huge pain point, right? Because we’ve assumed that with enough data, we can solve everything and we keep proving ourselves. That’s the case, but we’re solving things on average in a. Okay. So we are very much biased and we can, even if in those, you know, images we generate, we generate the majority of images we’ve seen, and we tend to mode, collapse and ignore a lot of important ed cases.
[00:33:45] Autonomous Driving & Simulation
Or Litany: And in things like driving sometimes edge cases is the thing you actually are most interested in, you know, driving on an empty, you know, I just, when I just came into the room, we talked about Sunnyvale being like this, very like a too sterile environment driving. And I see those autonomous vehicles driving next to me.
And I’m really hoping every time I see them that they’re also training elsewhere because down here in Tel Aviv, for example, roads are much more crowded and people ride scooters and, and pedestrians are less careful in a way. So I think edge cases are really interesting. You don’t have much of those. And that’s a problem.
One thing we did recently is we looked at data of people driving, and you can just represent as, as trajectories and looking at a couple of seconds of the past, you can train a generative model to try and predict what the future is gonna gonna look like, you know, six seconds seconds into the future. The interesting thing about that is now you have some statistical model that can somehow give you a few proposals of certain cars and what their trajectory is gonna look like in the future.
But another cool thing about those generative models and is the ability to manipulate them. So now, if you are trying to stay close to your prior that you learned from how humans generally drive, you can start manipulating them, asking them to do certain things that you’ve never seen in. For example, a collision, a collision that something that was never captured in those huge trajectory data sets like Waymo’s and, and new scenes data, but you can definitely learn what humans usually drive like.
And then if you take some trajectories and you try to push them towards the collision, which is a geometric function, right, you just try to make two trajectories collide or intersect. Then you can start seeing some interesting emergent behavior taking this idea and using those, you know, beautiful like 3D content creation tools.
You can actually. Place real cars and real scenery behind those. And, and here you generated a collision in a computer graphic simulator that you’ve never captured in, in real life. So I think this is, this is something really useful that temporal data can give you.
Gil Elbaz:And then using that simulation, you’d be able to train or add training data to a certain model and teach it.
Let’s say not to get into those situations, for example. Exactly. That’s exactly right. Interesting. Coming back to the 3D side of things a bit, you’ve been working with Professor Guibas. Who’s a very legendary professor, right? He developed a lot of these very substantial algorithms like Delaunay triangulation, red-black tree, PointNet, and a lot of things around 3D aware visual generation.
And so I’d love to understand from your side, what are the big things that maybe surprised you when you started working with him and what are some of the big takeaways that you might wanna share?
Or Litany: So Leo is, is really really a model. I admired him as a student. The way I came about being his postdoc was almost random.
Definitely. I was planning to reach out at some point he attended one of my talks in neurographics. And then the day after I see an email in my inbox from Leo Guibas and I’m sure it’s a mistake or maybe some invitation to review a paper. And instead it was an invitation to come visit his lab for some time and, and work with, with his students.
And this is how kind of the relationship started, which ended up being a full blown postdoc. So what I love about Leo, and I think this is something that, you know, as I was a postdoc with him, you know, one of the things I, I try to do carefully is take notes. Like, what are the things I like? And what are the things I don’t like that I can take with me to my future academic position.
And definitely the one thing I love about Leo is that he has this courage. He’s not, you know, he’s done stuff in the past. So Leo has worked, there are so seminal that he could have just. paused his career at that point, many points along his career could just stop there and just enjoy the fruits of, of his labor.
You know, for years he’ll get, you know, tons of people following out these works and citations and, and whatnot. But the beautiful thing about Leo is like, he’s a, he keeps he’s just so curiosity driven everything. It’s very easy to get him engaged in a conversation about nearly anything. Really. You jump into his office with a cool new idea, and he will always listen.
He will always engage in a conversation and 99% of the time he’ll have at least one smart thing to say about this, even if it’s like way beyond his field of expertise, if that’s even a thing for him, you know, because he’s an expert in so many things. I really think that’s key. Going back to early days of deep learning, I was still a student back then.
And one of the frustrating things was walking around. You know, university hallways is a lot of professors felt. This is, yeah, there’s this new thing coming up now, but we’re not really, we don’t really feel it’s our thing. So we’re maybe not gonna start teaching a course about this. And as a student, you take your course in YouTube, cuz that was the only thing available.
And then it’s hard to find a mentor you wanna work in deep learning. You wanna find, you know, someone that has this open mindness in, like jumping into this new emerging field because you want to keep up to date. That’s what academia should do, right. Should lead the field and not lag behind industry.
And there was like a short period of time where I think it flipped around and was the opposite. And I think some people kept it going and Leo is one of them. Leo was never afraid to jump head first into like this new unexplored territory. Of course, you know, given his background and, and expertise. This resulted in, in, in works like PointNet for example, which was focused on point clouds.
But every couple of years, there’s some new insight that Leo has. And I can tell you that this is really, really an inspiration to not dictate what the group works on. We don’t work prime. And it’s funny, cuz I, I remember Leo saying at some point there was a CVPR deadline coming up and the entire group was submitting to CVPR and he realized at that point that no one was submitting or only few were submitting to C graph that year.
And that’s really a change, a dramatic change in Leo’s lab because it used to be mostly graphics focused. And I couldn’t say it bothered him. It was, you know, fine. If that’s the new, exciting thing that everyone’s excited about. I’m I’m game, you know, I’m gonna, you know, I’m gonna bring my curious mind into that and, and try to do brilliant things.
Gil Elbaz: Yeah. That’s incredible. And I really appreciate him all the time working and pushing the limits and going to this new domain. I think that at least with PointNet, right. I looked at it originally and I was very impressed by kind of this new representation in a way for point clouds.
I was working on point clouds back in the day before there were almost any before PointNet. Also, there were almost no deep representations of point clouds. And so we had these very initial, not great representations that we started working with. And we compared against classic methods for analyzing point clouds.
And when I saw point and I understood, okay, someone came and had very, very, very interesting intuitions about how to represent this information and a few years forward looking towards neural radiance fields and how we’re moving in away from a mesh and texture towards a neural representation of 3D information and Lenox that also came out recently, which are also a new representation.
This is extremely interesting concept that we think a lot of times that representations are usually fixed and there are a lot of different problems that we can solve on them. So, you know, we have a P and G to represent an image and we have an EG for, for video, but looking towards 3D because it’s, it’s such like a, a new field in a way.
And there’s less methods of that are standardized for capturing this information. I think there’s a lot of freedom on how to represent this information and, and that really the PointNet paper really opened my eyes towards this insight.
Or Litany: Yeah. Yeah. I shared the same sentiment. The PointNet was a huge thing for me as well.
And then doing my postdoc. I got to work with Charles the, the author of PointNet. And I got to also learn a lot from him. And, and you could say it’s credit to be given. It’s not only Leo Guibas. Of course. You know, Charles was his student, himself a brilliant researcher as well.
Or Litany: Amazing. You mentioned neural field by the way, neural radiance field. And I think. Here enters really, as you say, like here’s a new emergent representation that actually is different, different from point out or, or images doesn’t fall out naturally from any sensor. So I was myself struggling for a while to understand, like, do we really grant it the, the place among other let’s call ’em primal representations where I’m, I’m using the term primal here to say anything that falls out naturally from, from a sensor, like raw pixels, like point clouds from a lighter. And I think it’s really interesting to just adopt this mindset and think of like, okay, what does it mean to be a primer representation?
It means that you are the output of a sensor, but what are neural fields? These, these are network weights, and this really opens up an interesting questions because here we are entering an uncharted territory for people, especially what are network weights. It’s not something that, I mean, points we understand if you capture the scene and there was a point floating in the air, you can ask yourself, what is that point?
Did I accidentally hit a butterfly with my lighter? Maybe, you know, you can remove it. If you don’t don’t. But if that reflected in a slight change of weight in one of your MLP weights, where is it, what do you do with it? How do you remove it? How do you manipulate it? And that is something that I’m recently very fascinated with.
I feel like as we’re working our, our, our weight towards turning those neural fields into primary representations, maybe even developing sensors that instead of outputting multi view images, they immediately update weights. Right. Which opens up other interesting challenges. Yeah, definitely. There’s the question of, okay.
All other primary representations, we have tools to manipulate, so we probably need to have tools to manipulate this radiance fields as well. And because the manipulation is happening. Really those semantic channels or network rates or whatever you wanna call them features. We’re actually getting further and further away from this explainable and intuitive representations.
And we need to start developing tools to act on them and going back to artists and content generation, you know, how do you give your artists now a set of weights and ask them, how do these culture using this? What kind of hammer can they use? And I think building a, it’s a huge new field and just invites us to study, you know, new toolboxes, you know, what are we going to develop to the future art, 3D artist generator?
What are they going to use?
Gil Elbaz: Yeah, I completely agree with you. And I think that, you know, if you look at for instance, image representations, so we look at PNG for example, that you hold the pixels there with JPEG, you already have this compression layer on top of it. And that’s very interesting. If you dive into the JPEG compression and see how it’s built, it shows a lot of interesting intuition because on one hand you can compress very nicely this information, but on the other hand, it’s still compatible with a lot of the editing techniques and the editing capabilities of these standard pixel representations.
And so when we’re looking towards these neural representations, I think that one of the interesting things that will make it much more useful over time is really like you’re saying to add this layer of control and this layer of. Editing and semantic editing potentially on top of this 3D data. And that would potentially make it very, very useful, even.
So with respect to images, even though there are nice tools, like very simple tools, for instance, in, in windows, right? For editing images, the brightness, the contrast, it’s still very limited, right? You can’t erase someone from the background without Photoshop for more advanced tools. You can’t change a smile into a very serious face, right? Or remove someone’s glasses in an easy way, at least not yet. But I think that with neural representations, This semantic level of control might be more natural in a way if you’re representing data as neural radiance fields and not necessarily as 3D meshes or 3D point clouds.
Or Litany: I fully agree. I think there used to be interesting debate on what’s the one what’s the holy grail of 3D representations. Like, should it be point cloud, should it be a mesh, should it be a sign distance field or whatever? And I think the serious answer is that there’s not gonna be one because each representation has its own merits.
And sometimes you just want to be able, you know, you should, we should borrow the pros of each of each representation and just utilize it for whatever application you need. So if it’s visual fidelity or if it’s manipulability, I mean, anything would have its own type of representation. And I think you’re hitting the nail in the head saying, well, now we have this new emergent emerging representation called neural fields. And it seems to already be kind of abstract and already live in feature world where we are used to image features be, you know, living there and maybe some text features. So could that maybe be the one representation where we are doing more semantic manipulation?
You can imagine the artist it’s much like if you take an image, like, I, I think that your example was spot on, right? If you take an image and you try to draw something on that image, that’s the. Task you can do, I can do it in paint, right. In, in, in mm-hmm windows, like you said, but if I wanted to take a set face and make it smile, that’s much harder.
I’m gonna have to paint a lot of pixels and maybe I wouldn’t even be able to complete that job in a full day of work to do in paint. Yes, exactly. But if we took style gown or one of those, you know, manipulable image representations, we could, a lot of times just, you know, invert G go back to some latent space, take it into, I don’t know, use some clip embedding space to take it more into the happy face gradient and then regenerate that face.
And suddenly it’s smiling. Or add sunglasses or not, you know, all the usual face, many manipulations. And I think with neural fields, we’re seeing something similar, everything lives in feature world. It actually gives a big opportunity to start and match those features with other pre-train latent spaces that are well structured that really buys a huge opportunity to do more semantic manipulations.
Whereas surprisingly adding pixels becomes the harder problem in a sense, right?
Gil Elbaz: Yes. Yes. One of my takeaways also, from what you’ve mentioned is, is the interoperability meaning. You’re very much correct. I think that the different representations each have different advantages and disadvantages. I agreed like semantic level editing.
Definitely a neuro representation would probably lend a lot of ease of manipulation there. But also if I look at meshes, for example, which we deal with a lot, for an artist who control 3D meshes is much easier than a neural radiance field, at least now. And I think in the near future, the separate ability of objects when represented as meshes, and let’s say skeletons, for example, are quite useful.
And also this connects in a way to, to Michael Black’s work in on simple and the whole line of work around that, where we see a very human specific representation, which is not a mesh and it’s not a neural radiance field. It’s its own representation in a way, which is a light. High quality human representation.
And that also lends itself to a lot of different, very interesting applications. So I think in a way what your intuition is, or what I’m learning from this is that there isn’t one representation. It’s not that neural radiance fields is gonna take over everything, but there are a lot of different advantages for these different representations and potentially connecting between these different representations could be one of the big game changers
Or Litany: Yeah. Yeah. And I think trying to unify is not a bad idea, but we also need to remember why we have those, right. And they’re the, for a reason, and they’re still very much, very much useful. And you gave an interesting example earlier with image compression actually with Leo. We had a lot of these chats about there’s this field called homomorphic encryption, where people try to encode, basically in, in simple words, it’s trying to some.
Some form of equivariance, but in a more like a coding decoding type of framework where you want to manipulate your image and then compress it. And if you decode it, you get back your manipulated image or you can first encode your image and then try to what’s the equivalent operator. You develop inside your late space in a way so that when you decode it, you’ll get the same result as if you manipulated your image in, in primal space.
Gil Elbaz: So it’s manipulation of encoded data
Or Litany: in a way. Exactly. In simple words its if you took your JPEG image and you wanted to add a pixel, that’s very easy to do in paint, but if you already compressed your image and you have a zip. What’s the operator that works on the zip file and adds a pixel at coordinate 13, 17.
Wow. Okay. So, and that sounds hard and challenging and even maybe even interesting, right. As, as, as a physic proposal. Yes. But maybe not necessarily practical and maybe compression is there, or one agenda could be the compression. Is there just to deliver data more easily between two sides? And if that’s the case, then maybe we don’t need necessarily to have explicit manipulation abilities inside those.
So if you, if you imagine neural fields as a form of compressing 3D content, maybe we don’t necessarily have to have explicit tools to edit there. Maybe if you want an explicit editing, you can go back to a me presentation or some other textured master presentation, move the geometry a little bit, add some geometry and then recompress it, or re-code it back into a neuro field.
So that could be one option. So it’s not necessarily that we need. On the other hand, like we said earlier, the semantic editing. It’s much more inviting in a way to do it, in this already abstract representation.
Gil Elbaz:Yeah. I completely agree. And I think that one of the big questions I have really is what are the next few steps in this field of neural representations?
I’ll keep it general as neural representations. Cause there are so many,
Or Litany: so we’re trying to coin the term neural fields neural, because this thing is being called so many things.
Gil Elbaz: Yes, yes. Yeah. How do you see this evolving, like the next few steps forward? What are the big milestones that we, we as a community should be looking at?
Or Litany: Yeah, I think my favorite, the one I said before my, you know, my personal kind of point of curiosity to this thing is manipulation. For sure. Like this is to me the main gap or, or once this is done, I’m kind of willing to accept and call it a primary representation. I’m okay with it. So this is, you know, and we’ve talked about that.
So manipulation is there for sure. I think compute or kind of efficiency. That’s one huge thing. So people are starting different architectures. You mentioned plenoxels and video release this instant N G P, which is crazy fast to learn, you know, in orders of seconds, you, you just get, you can learn on NeRF, which is amazing.
And I think beyond saving frustration for students trying to, you know, develop a new methodology, I think it also opens up new avenues because if it takes you a few days or a week to learn just one model, it’s very hard to imagine how we take huge data collections and just convert all of them to nerve for presentations, and then performing learning directly in NerF world, which brings me to the other missing ingredient, which is generalization, because for now a lot of these methodologies for neural fields, especially neural radiance fields is just almost like going back to early days of we’re using deep learning machineries.
But really what we’re doing is we’re solving optimization problems. There’s no priors. It’s just, I give you a collection of images and you’re trying to do multi view reconstruction, and that’s all the information you assume to be given, but that’s insane. Right? What have you been doing for the past 15 years?
Right? , we’ve been developing huge machinery to gain priors from the world. Why aren’t we using it? And for now, those priors are introduced through images or through the representation that we use to introduce priors through. But I think once we develop capabilities to fast and efficiently convert, huge data collections in, into a NeRF collection, then enters the point where again, we ready to call it the primary representation.
Namely, we’re ready to learn directly on NeRF and gain insights from NeRF and have the right architectures that play nicely with radiance fields and, and the right priors. So that when enters a new piece of data that we haven’t seen, we’re actually able to generalize and, and do useful things with it.
So we really bring the kind of the power of AI into that. So for now, we’re just using the power of pytorch
Gil Elbaz: Which is an amazing engineering effort, but yes, it’s very interesting to look at one of these future steps as not only generating NeRFs, but once we have really millions of NeRF data points in a way that each represent a scene with this neural representation that is very dense, being able to then learn those and manipulate them, given a large set of NeRF representations.
This is very interesting intuition. I, I think that it has a lot of promise to be one of the next big things that we work on and
Or Litany: even learning to generate them, yeah. And
Gil Elbaz: generating them. Definitely. It’s gonna be very interesting to be able to generate full 3D or 4D photorealistic data at scale. I think that this is one of the, one of the holy grails probably.
Or Litany: Yeah. And I think one early example of this is the really beautiful work that was just now in CVPR called EG 3D again, some collaborators of mine at Nvidia. And that’s really maybe the first example that I’ve seen at least of a 3D generative model that immediately generates a neural field you can render from multi view.
So that’s, that’s really, I think just scratching the surface with what this opens up.
Gil Elbaz: Amazing. Yeah. And I think at least intuitively for me, these neural fields are great representations of 3D information. I do feel that they’re not the ideal representations for 4D information. It does seem like there is something missing because or something missing in order to represent these 4D information in a more natural way.
Do you have a sense of if they are actually good representations or if there are different things that need to be created in order to represent this 4D information
Or Litany: that’s touching on one of the, I think exactly like one of the frontiers of this field, right? How do we take this into dynamic scenes?
And in a way there’s no problem with the underlying concept of neural fields to represent temporal and spatial data, right. In a way we’re all, what we’re trying to do is we’re trying to build a up top optic function, right? That given some Ray direction and a timing the day, and a coordinate just gives us all the information.
We need to be able to render that Ray into our, into our camera, give us the intensity. So if you had all that information, if someone, you know, magically gave you that data, you could store that data. No problem. You know, you could have a hash table and maybe even try to overfit that hash table using a huge MLP.
There’s no like a fundamental flaw in that. I think the problem starts where, where does our data come from? Right. We have, when we usually, when we have temporal sequences, if we use a monocular camera, we give up on the multi view images and suddenly we’re, we have this problem, right? I’ve seen you from one angle, but when I went to the other angle, you already moved and that kind of ruins my ability to match those, those pixels and say, oh wait, I can’t tell my model.
Now that that same pixel looks, you know, I can match that pixel and say that the same location in space had density because maybe it’s not there anymore. So really I think you are touching a fundamental point of that. Again, brings up this issue of priors, right? When we select data, we compensate with priors.
So we need to see enough things from, from before. Can we really learn motion priors that are completely in the wild and are not class dependent or class specific? That’s I don’t know. I think if we take inspiration from, to the, you know, people have studied optical flow, people do object detection, that’s open set, right?
So you can detect objects beyond the closed set of labels that you had. So I feel like there could be a lot of biases and priors. We can learn from data to be able to complete that information now, how to represent it. Do we really have a good way to take those priors? Cause a lot of priors, you know, there’s data priors, but where do we store them?
When we have an encoder, a lot of times you’d say, oh, the encoder stores them. You know, we take an image and I dunno, 2D to 3D is a good example. You take an image you’ve seen enough or, or monocular depth estimation. You’ve seen enough examples of 2D content and they’re corresponding depth. And now you can take a novel view or a novel scene that you’ve never seen before, but still do quite a good job in, in estimating.
So that gives a good promise. But if you now look at the way, you know, neural fields are being used, a lot of them are based on this auto decoder framework. And then if you threw away the coder, then where’s my prior, where do I take my prior from? Well, enter is another option. Maybe there’s metal learning that can help there, maybe network initialization is another way to bake priors.
It’s a, I feel a much less understood way of baking priors than encoders, but a very useful and, and, and elegant way. Nonetheless, other ways to bake priors is by adding them into the architecture. Mm-hmm so thinking of one huge MLP that represent the entire tic function sounds also very strange. And we are actually seeing evidence from past years that breaking down this global MLP to some hybrid representations of some spatial voxels that inside those voxels contain some features that with the proper trial linear interpolation gives you the field that you can query a coordinate and get the, the feature of that field.
Is really efficient. How do we know it’s efficient? Because the decoder has now become super small mm-hmm so we can, you can use just a two layer decoder and take that learned feature and decode it into a nice rendered image. But you’re asking and, and rightly so, what about temporal information? Now that temporal information is going to move those features around.
Now my vs are not even fixed in place, so will it be dynamic voxels? Are we, you know, do we want to see things move around? Will it be point clouds again? Cuz we also see evidence that, you know, point clouds maybe are more easily, easily manipulable in, in that sense, right? They can temporarily move. If you had those point clouds coming from an actual sensor, like if you’ve driven through a scene with a lighter and you had a pedestrian walking around, you actually have a moving point cloud.
Maybe that’s a good place to store your feature on. And then you have kind of a natural dynamic, explicit dynamic moving scene. And on that explicit representation, you can bake some latent features from which you can decode on your representations, but. I don’t know.
Gil Elbaz: Yeah, it’s an open. I mean, I love these intuitions because they really push the way that we’re thinking about how to do these representations.
I, I think that there’s, it’s definitely an open question and there’s no. One right answer, at least that we found so far, but I do think it’s a very fundamental problem that could be, and that we’re in the right direction of solving, but it’s very much unsolved right now. I see. Ideally one day we’ll have just one giant MLP that will, you know, get X, Y, Z, T, and be able to produce it.
But like you’re saying, there are a lot of priors that come in when dealing with temporal information that are semantic priors, that might be also just spatially or temporally, local priors, right. Like we don’t necessarily move from one side to the other side of a 3D space until we’ve invented teleportation of course, but in a natural way.
Right. And so there are some very interesting priors, I think that we can insert both into. Into our networks, but ideally also into some kind of representation. Yeah. This is yet to be solved from my perspective. I think we’ve reached this kind of interesting maturity in 3D information, which has yet to be solved with 40 information in my opinion, but maybe you can solve this in the Teknion so just asking a little bit about, and I know you had roles in Google, in meta and in Nvidia, I’d love to hear a little bit about the culture and kind of what you’ve learned there a little bit about the differences between them.
Or Litany: Yeah. So I wore different hats in those different places, right? So I entered with Google kind of like, I guess, towards the third year of my PhD. That was my, I think like my first big experience before even talking about Google, it was just experience to go abroad for a certain period of time and experience.
Research outside of Israel was really educating. Actually, the first time I did that was when I just started my PhD, my advisor, Alex Bronstein, used to travel every year to Duke university. He had some visiting position there. And first when I started my PhD, took me with him. And that was really a huge eye opener.
Just seeing how global this thing is, especially like I spent a lot of years in the military and a lot of things I was doing, I wasn’t allowed to talk about even at home and then suddenly going around the world, talking freely. Other people who are excited about the same things as I did and read the same papers as I did was just, you know, an eye opener.
Or Litany: Right. I really loved that. And then coming to, to Google was for a slightly longer period of time. It was in New York, which is, you know, a really nice kind of place to solve this culture. And the type of researchers that decide to do research in New York, I feel is also a type of brand that you don’t see everywhere.
And I worked this with Ameesh Makadia was my host. There was just a brilliant researcher. And that was, I feel in a lot of ways, the first time I asked myself, is academia really the place to be in, because it seems Google has everything, you know, they solved it. They have the resources and they have the smart people and you don’t have to write grants to hire students.
And they have the compute, which is, you know, and, and a lot of times they have. I think this was beyond the experience of just doing an internship and absorbing, you know, the culture of the city and, and the company. It was this first time that I really asked myself this, this question. And I would try to also, I tried to ask around my host, other employees there, like about the differences, you know, some people have different people have different opinions.
You get different answers to that question, but it was the first time I asked it, it was, it was a really nice atmosphere. The 3D group there specifically New York. I mean, Google does a lot of 3D obviously, but that specific group was a bit small. So I couldn’t say that this was comparable to one of those large groups, like, like Leo’s group, for example, where everyone is, you know, excited about just, you know, 3D.
So that was, I guess my experience with Google, that was one of the places where I could also see some first signs of differences between academia and, and industry, where suddenly you have to write nicer code mm-hmm cuz other people need to look at it and review it. It has to compile on their systems.
So you can’t just make sure it runs locally on your machine and then never again. So that was also like an educating experience. On the other hand, I could say that the ramping up was slower. So also like a curve there. That you need to kind of learn how to work with
Gil Elbaz: like a whole suite of tools that they have internally.
Or Litany: Yeah. Yes. Yes. And remember, this is not like a research oriented company. I mean, the company in, in it’s large, I mean, they have lots of lots and lots of researchers, but in large, the company is not focused on research. So most of the tools they develop is not tools that are designed specifically for research.
So if you need something that’s completely outside of the scope and, you know, if you need to suddenly run things that don’t compile on their kind of local tools that they have, I’m sure they have solutions. I’m sure you can just order another computer or something. but it’s probably been less trivial or you can get less support and, and things like that.
Or Litany: Mm-hmm . Yeah. And then the experience at Meta was also very interesting. So I came to meta as a postdoc. I was, it was almost a mistake, like a happy mistake. I went to Leo to do a postdoc, but then he realized it’s been too long since he took a sabbatical. And I think he had to take it that year or the year after.
And he came to the point where most of his team was mature enough that they could kind of run solo. Not that he left too far. He, you know, the industry was in Menlo park, which is one city up north from Palo Alto. So he was still there every Wednesday, making sure his students are taken care of and everything, but he had to take the sabbatical.
And then instead of me coming to Stanford, when he’s, well, he’s away, he just invited me to join him at FAIR. So. That was beautiful because that year FAIR was just had so many 3D researchers that were parked for the year. You know, people that already had academic positions waiting for them, but they decided to, to take a gap year and, and spend it at fair people like Manolis Savva and Angel Chang and Justin Johnson and Judy Hoffman, just amazing researchers that are all in different universities now around the world that it’s was almost like doing one postdoc with like six different 3D expert professors, you know, at the same time.
So you can imagine what lunchtime felt like. It was just, just being a fly on the wall. There was beautiful and FAIR is very academic too. So you could barely feel it’s part of Meta or, you know, they called it Facebook at the time. yeah, you could barely feel it’s part of the company.
For me, it was a very thing, cuz I was only looking for publication. I only wanted to do research that’s publishable and I made sure to like every time I write something, to make sure that this is going to be like publicly available. We could release code all of that. That was crucial. And that’s relatively easy to do with math.
So that was the good thing, I guess again, going back to that question that keeps coming up and I keep hearing from students as well, this, you know, relationships within industry and academia in a way I feel there. I started to develop my own answer and to, you know, how to choose between the two.
And I feel like this it’s a personal answer. It doesn’t fit anyone. But I feel like for people who are driven by excellence and want to reach the top or wherever they are, you can’t ignore the organization. And if you want to reach the top of a company that earns from ads or earns from, you know, users using their platform, You can’t be the superstar of that company.
If you’re doing research that’s maybe 20 years away from being productized, some people don’t mind. It’s actually like, there’s something charming about that. These companies allowing those islands of innovation to happen. And, you know, maybe once in a blue moon, some emergent technology becomes useful. I could tell for myself that there will be maybe some point of frustration where software engineers, that’s their dream job.
They’re developing tools that are used by millions and, and millions of users. That’s amazing. And here I am like writing papers and I probably have to thank them when I meet them in the hallway, because they’re paying for my salary. You know, they’re doing actual, they’re bringing the value. And I think academia solved that if you want to write papers and make them, and, and if those papers become impactful and you’re doing this in the university lab, that’s your job.
You’ve done your job. You could be the superstar of the Institute. You’re part of. So that’s a very personal answer. I don’t know if, you know, people can easily relate to that, but in this field where I see so many people who are excellent and are driven by excellence and they just want to be the superstars of wherever they go.
I think it’s important to ask where do I go? And if I do amazing research, will I actually be a superstar at the place I go to? Yeah. This is
Gil Elbaz: really helpful framing. And I think it’s super interesting to hear it, you know, from you that, and you’ve had these experiences, you’ve tried, you’ve been on both sides of the aisle.
And so understanding how you look at this is very, very insightful for me.
Or Litany: So enters Nvidia, which confused me again because at Nvidia, in a lot of ways, the leadership cares so much about research, maybe in particular, the type of research that I do, that you actually do get a sense of a weight.
That’s a place that has all the benefits. But of big companies and, you know, access to compute and data and, and resources and amazing talent. And at the same time, you’re being very much appreciated.
Gil Elbaz: this is a, this is very interesting.
Or Litany: Yeah. And one should be careful when I say Google, meta and Nvidia, because really, I only pinpointed one particular group at each one of these companies.
So it’s very hard to say what those companies look like outside of my very kind of narrow PR of what I got to see, but that was my journey. So that’s what I can, what I can talk about. And then Nvidia confused me because then I joined, Sanya Fidler’s lab, who’s an amazing researcher and also still university professor.
Gil Elbaz: Right. Very much active. I just met her at CVPR and she gave an amazing talk, showed off crazy research, just paper after paper. And, it was very, very interesting. And I got the opportunity to chat with her a bit and I can say that. The insights that they have, because they’re so focused on research, but also so close to the compute and the actual resources themselves that we use.
Right. In order to create these papers, I feel like it’s a very unique place.
Or Litany: Yeah. And I think. I try to learn from Leo about the dos and don’ts of becoming a professor. I’m also like, you know, watching Sanya’s moves and, and trying to pick up because I feel like she’s this amazing researcher has some really unique superpowers, like, you know, identifying some of them are just, I feel like just luck.
I mean, she was born within and what, what can I do? Like she has an amazing eye for identifying young talent, for example, so she could pick up amazing researchers before they became famous. I think that’s huge if I could have that as a professor. But I don’t know if that’s something one can develop, but also she has an amazing eye for impactful research, because I think that there comes a point where interesting ideas and things that haven’t been done before and that you think you can do and make a paper out of becomes almost easy.
And then you enter stage two, which is okay. But out of all those things that I can do and make papers out of given that all of them will consume my time. What do I focus on? And focus is probably the number one problem of our generation in general, but definitely for researchers who are, who get so easily excited about new ideas.
I mean, I’m sure that you and me can talk for 15 minutes and come up with three new paper ideas and, and
Gil Elbaz: definitely that’s what we’re doing right after the podcast.
Or Litany: exactly. And, and the first thing that comes to mind, in the researchers spirit and mine is like, okay, I’m gonna go home and throw away all of what I’ve been working on so far and just focus on this new, shiny new paper.
And that’s risky, that’s dangerous. And I think Sonya has this ability to take impact into account. And actually one of the things I may be wrong, but one of the reasons I think she developed this, or, you know, one of the forcing function to have disability is to actually hold these two positions. You know, she’s still a university professor, but she also works in video.
And I think she’s forced to ask herself this question. When starting a project, where should I host this project? Should this be a university project or should this be Nvidia project? And they can’t be entangled. So it’s an interesting question. Why do a project in a company like Nvidia, maybe those, that project needs to be somewhat impactful to the industry rather than, you know, necessarily academia or maybe this project should be rely on heavy compute, which is something that’s easier to get in video, maybe harder to get in university.
Whereas probably when she starts a university project, she thinks, I don’t know, maybe something that could be like more in the wild, more out there is not a three or four month internship with a student as smarter. That student is, you want to send them home happy, right? With like you can’t guarantee success in research, but with high chances of publishing something, but when you hire a new PhD student and they have a couple of good years to invest in a problem, they could really go and explore like uncharted territory, something that’s very much in the wild before it have.
Has any signs of success. And, I think this exercise that she probably has to do on a daily basis really helped her in having a company telling you what are the problems really reminding you day in day out. And video has a very open culture. That’s something that I really like. People tend to like send out the top five things they work on, and then you can read that and see, oh, that person is working on.
That seems like an interesting problem. You can reach out, talk to them, and learn more about their problems. And you see a lot of things that as a researchers you assumed are solved are actually very far from being solved. And that’s a huge resource for working on, on impactful problems. It doesn’t mean that those problems that solving those problems would actually emerge in an impactful paper.
because it’s not just you, that thought that the problem is solved, right? You’ll get reviewer number two, telling you that the problem is already solved as well. but at least it’s, it’s a great resource for inspiring and impactful problems because Nvidia’s product is so broad. So many people are using GPUs for AI.
Then it’s very easy to like, or the other way around. It’s hard to find a problem that cannot contribute working there. That was some source of confusion. Luckily, I was already convinced that I’m going to academia at that point. So I didn’t really didn’t really change my mind, but I, I think now that I’ve, this partnership is actually becoming a very useful tool, a very useful resource for making this decision, given that I can work on 20 different, interesting problems, which are the top five, not just interesting, but also most impactful problems that I can invest my time in.
Gil Elbaz: Wow. Very interesting. Yeah. It’s incredible to get a lens that looks inside a little bit of these research teams within these amazing companies that have grown to enormous sizes as well. Right. And really being able to see how at least you perceive these, these different companies is I think both interesting for our audience and, and helpful in many ways.
So I do wanna ask a few last questions to close this off. There’s one kind of big question. Some researchers started talking a little bit about consciousness and what it means, and if we can create something that simulates consciousness, I’d love to get your sense on the topic. And I know that it’s a bit far out there, but at the same time, I think it’s important that we do start conversations on these far out topics.
Or Litany: I’ve been largely ignoring those discussions so far. So, I’ve watched, you know, I’ve read some of them. I’ve watched some interesting Ted talks about this dystopian future where robots like could be a good consciousness of yeah. Staple, or, you know, our faces are lips. So that cuz the loss function was making a smile and then that’s easiest way to do it.
I think planning ahead is a good thing to do. You know, maybe consciousness doesn’t even come from AI. Maybe it comes from a different source. You know, maybe it’s aliens that come to earth and, and we need to, we need to have those discussions. I feel this is, these are problems that are beyond the scope of researchers.
Yes. Researchers have responsibility, but a lot of it is driven by curiosity. You know, maybe it’s a good pull for Twitter, but say we knew that keep developing this technology we’re developing is gonna bring us to our doom. You know, this is going to end the world. Would we stop say we knew it for a hundred percent.
Gil Elbaz: I think a hundred percent we might wanna stop.
Or Litany: I don’t know. And like you can ask, like what time into the future, like when is this going to stop? There’s a really interesting book called the three body problem. That kind of sheds light on this that I read recently the first two books out of three and it sheds light on exactly on this type of problem.
And, and it seems like we need to, on the one hand, have some agenda, but we also need to educate people. It’s not an intelligent, you know, speaking of intelligence, right. It’s not an intelligent discussion to just have some article in the newspaper saying like random stuff. People need to start understanding.
It’s funny when you say companies call what they do AI, cuz it gets money from investors that that’s funny. But when they, when they call things AI and it confuses the public, the general population. Yeah. I think we have some responsibility there first and foremost. Just get everyone educated and what that actually.
There’s something. When, during my PhD, I visited Munich for a while and there’s something really nice that they do in the city. They take university professors and they ask them to give lectures to the general public in the city hall, or I forget where, and lots of people attend there and it forces the professors to get rid of all equations, make everything very intuitive, but still be serious about their research and talk seriously about what they do.
And I feel like this is something that we’re, we’re lacking, you know, like this is not a lot of this research. Shouldn’t be kind of capped behind the academic walls, you know, and yes, everything’s accessible to archive, but that’s not enough. Right. We can have like more, I know maybe podcasts yeah, definitely that people who haven’t read those papers can listen to and, and discuss this.
So I think only then we can really start engaging in these discussions. It’s not a bad thing that people are panicking about. This it’s good. It’s a forcing function.
Gil Elbaz: I mean, I honestly think that in the near future, relatively near future. With the improvement of chatbots as they kind of progress. We recently saw Google’s Lambda make the news.
Of course, we’re gonna see large amounts of population, at least talking with these agents as if they are actual kind of conscious beings. I think the perceived consciousness is very much possible to reach within the next few years, real consciousness. I don’t know. I don’t know how to define it even, but I think it’s an interesting concept, maybe less on the computer vision side, but, you know, as we said before, everything’s coming together in many ways, but both for embodied AI and in general, I see it as something that is very much interesting to explore and understand.
Or Litany: I agree. And maybe even before we get to that and, and I agree with you about the perceived consciousness and, you know, it’s already tempting to, to talk to bots as if they’re human beings, we’d lack the tools to design those barriers. right as we develop this, right? So you could complain to a scientist or an engineer training on tons and tons of data.
Like, stop, what are you doing? Your, your baking consciousness into your computer. , you know, but really the fault is even if that’s true, the fault is that, you know, we don’t know how to do things otherwise.
Gil Elbaz: Yeah. We don’t have metrics for it and we don’t not measure it. Or even how do I not bake consciousness?
Or Litany: Let’s say that this thing is happening. How do I stop myself from doing it? Is there a tool for that and enters explainability, enters constraints? Like how do I, you know, let’s say I design a bot and I wanted to talk to my kids. I don’t want them to introduce content, you know? And then where’s the PG 13 kind of defense that I can, that I can give my kids.
You know, that’s something way beyond, way before conscience that, that I’d like to encourage and have in my models. And I don’t yet have the tools to. Very
Gil Elbaz: interesting. So last question for the newer generation, the people that are right now, just finishing, let’s say their first degree, or just starting their second degree that are very much interested in the computer vision space.
What are some of the recommendations that you would give to them?
Or Litany: Yeah, so I think, like I said, in the beginning of our interview, this whole content has become much more accessible. I feel like we’re now at a time where just taking university courses is important of course, but really if you can, self-educate, there’s no limit to what you can do.
And there’s just crazy things that I see out there. You know, when, even when we work on researchers, we suddenly will discover some GitHub repo of this guy that we’ve never heard of. Right. And then this person is building those amazing tools and, and something, and you ask yourself why didn’t they publish it?
And just, just because t’s accessible and maybe, you know, getting stars on GitHub is more important than getting your H index up. So really getting your hands dirty. And, you know, being curiosity driven, read twoo minute paper, I’m really recommended this, these things, even though they’re like, “shallow”, because what I learned in the past year and a half, working with interns in Sanya’s team is that when people are curiosity driven, there’s just no stopping to what they can do.
And it’s funny because whenever we hire an intern and we try to think of like, oh, okay, what projects will work on? And maybe we’ll have them meet with a few people and then decide, or, or, or not, or maybe you can prototype something and encourage them to do really what wins in the end is someone who, you know, everyone is excited by different things of, of course.
But if you could find that thing that excites you or in my case, if I can find that thing that excites the intern, it’s almost guaranteed that they’ll do like magical work. And that’s amazing. And I feel like the reason I’m recommending this is because there used to be a time where you had to finish your degree to ask yourself now that I understand what I’m doing. Is this interesting to me?
But because everything is so accessible. Now you can even if not 100%, but like 80% understand what this two minute paper guys talking about or what that paper I just read from arXiv is trying to do. You can very quickly get to a point where you could identify fields that excite you, whether it’s by the applications or the methodology and anything really, or just even listen to talks online and see, oh, that person is exciting you know, some people are just driven by their colleagues.
They’re just, oh, that person looks exciting. I wanna work with them. Where do they work? Oh, they happen to work for Nvidia. I wanna get hired. you know, so that’s my, my recommendation to really find your thing. That it gets you excited and that you can hopefully also do well or believe you can do well. And it’s so accessible now that I think it’s much easier to find than first committing to, oh, I’m gonna do that.
Cuz there’s lots of money in this field now.
Gil Elbaz: Amazing. So get excited, self educate, learn as much as possible and you know, find your passion. That’s an amazing piece of advice. Thank you. Or very much for being on the podcast.
Or Litany: Thanks Gil. This was super fun.
Gil Elbaz: It was a lot of fun. Thanks. This is unboxing AI.
Thanks for listening. If you like what you heard, don’t forget to hit subscribe. If you want to learn about computer vision and understand where the future of AI is going, stay tuned. We have some amazing guests coming on. It’s gonna blow your minds.