Solving Autonomous Driving At Scale – With Vijay Badrinarayanan, VP of AI, Wayve

Listen & Subscribe on

About this Episode:

In this episode of Unboxing AI, meet Vijay Badrinarayanan, the VP of AI at Wayve, and learn about Wayve’s end-to-end machine learning approach to self-driving. Along the way, Vijay shares what it was like working for Magic Leap in the early days, and relates the research journey that led to SegNet.

 

TOPICS & TIMESTAMPS

[00:00:47] Guest Intro

[00:02:38] Academia & Classic Computer Vision

[00:08:56] PostDoc @ Cambridge – Road scene segmentation

[00:18:42] Technical Challenges Faced During Early Deep Computer Vision

[00:20:24] Meeting Alex Kendall; SegNet

[00:25:15] Transition from Academia to Production Computer Vision at Magic Leap

[00:27:09] Deep Eye-Gaze Estimation at Magic Leap

[00:33:21] Joining Wayve

[00:36:09] AV 1.0: First-gen autonomy

[00:40:08] On Tesla LIDARs and their unique approach to AV

[00:46:37] Wayve’s AV 2.0 Approach

[00:48:42] Programming By Data / Data-as-Code

[00:51:02] Addressing the Long Tail Problem in AV

[00:53:13] Powering AV 2.0 with Simulation

[00:58:30] Re-simulation, Closing the Loop & Testing Neural Networks

[01:01:44] The Future of AI and Advanced Approaches

[01:11:50] Are there other 2.0s? Next industries to revolutionize

[01:13:48] Next Steps for Wayve

[01:14:59] Human-level AI

[01:16:35] Career Tips for Computer Vision Engineers

 

LINKS AND RESOURCES

On The Guest – Vijay Badrinarayanan

Wayve

Papers:

AV2.0

Eye gaze Estimation

Multi-task Learning

Indoor Environments and Homes

Object Detection

Segmentation

Mentioned Paper –

Mentioned in the episode

DALLE-2

StyleGAN2

NERF

Guest Speaker:

Vijay Badrinarayanan

VP of AI, Wayve

Vijay Badrinarayanan is VP of AI at Wayve, a company pioneering AI technology to enable autonomous vehicles to drive in complex urban environments. He has been at the forefront of deep learning and artificial intelligence (AI) research and product development from the inception of the new era of deep learning driven AI. His joint research work in semantic segmentation conducted in Cambridge University, along with Alex Kendall, CEO of Wayve, is one of the highly cited publications in deep learning. As Director of Deep Learning and Artificial Intelligence (AI) at Magic Leap Inc., California he led R&D teams to deliver impactful first of its kind deep neural network driven products for power constrained Mixed Reality headset applications. As VP of AI, Vijay aims to deepen Wayve’s investment in deep learning to develop the end-to-end learnt brains behind Wayve’s self-driving technology. He is actively building a vision and learning team at Mountain View, CA focusing on actively researched AI topics such as representation learning, simulation intelligence and combined vision and language models with a view towards making meaningful product impact and bringing this cutting-edge approach to AVs to market.

Episode Transcript:

[00:00:47] Gil Elbaz: It’s great to have you here, Vijay. I’m excited to present Vijay Badrinaranayan, the VP of Autonomy at Wayve where he joined in 2020.

Wayve is a cutting edge autonomous vehicle startup taking a bold new approach to solving autonomy, called AV 2.0. They’ve raised over $250 million to date and are quickly becoming the key for autonomy that can scale globally.

Before that, Vijay spent five years leading a team at Magic Leap, where he was the Director of Deep Learning and AI, pioneering research for advanced computer vision capabilities in the mixed reality domain. Vijay has an extensive background in the academia, getting his first degree in Bangalore University in India, his Master’s at the Georgia Institute of Technology, PhD in INRIA Center in Rennes, France, and finally, a postdoc at Cambridge.

It’s really exciting to have you here. We’ve known each other for quite some time. We met for the first time over three years ago when you were back in Magic Leap and Datagen was a very, very young startup.

It’s really great to reconnect, especially on this type of platform and hopefully get some interesting insights and dive into some of the cool things that you’ve been working on. It’s really a pleasure to have you here.

[00:02:09] Vijay Badrinarayanan: Thank you so much for inviting me here onto Unboxing AI, a really cool name by the way.

You know, I’m so happy to be here, share whatever I can, and talk through my journey through computer vision and deep learning. Of course, I’m also extremely happy, delighted to see Datagen grow so quickly into something so amazing affecting so many different markets.

[00:02:35] Gil Elbaz: Thank you. Thank you very much.

So let’s dive in.

[00:02:38] Academia & Classic Computer Vision

[00:02:38] Gil Elbaz: In a sense, it seems like you were initially running towards an academic career. What were your thoughts were back then and how have things shifted?

[00:02:49] Vijay Badrinarayanan: I did my undergrad in Bangalore, which is sort of the IT city in India, and then moved to grad school at Georgia Tech. And both of these were completely unrelated to computer vision. It was more electrical engineering, things like that. Although I’d sort of done a project in computer vision during my undergrad; that’s probably stayed with me.

And for my PhD, when I got an offer from INRIA in Rennes, France with a sponsorship from a company called Technicolor, I said yes, because I had a very cool advisor. And I got introduced to some very cool problems.

I worked on the problem of visual tracking of objects in very lengthy video sequences. This was a problem posed to me because Technicolor is a company which works with movies and post-production in movies. Adding objects, removing objects, changing colors of things, the look and feel of the movies once they are shot, that’s post production. And so they posed this problem because it’s a very key, very common problem.

This was in the period from 2006 to 2009. So it was definitely in the hand-engineered era of doing computer vision.

[00:04:02] Gil Elbaz: Up to today, those are still challenges that our industry is trying to deal with: long sequences. Especially with deep learning-based approaches, it’s very hard to get good data. So how did you guys approach it back then, and what were some of the limitations?

[00:04:17] Vijay Badrinarayanan: Well, certainly the approaches we took were almost non-learning-based in some sense, or even if there was learning, there were very few parameters we were really trying to learn. Data was extremely scarce; I remember actually hand-labeling sequences myself to create datasets, downloading things from wherever you could, getting long shots, etc. Very sparse.

So that also influenced our mindset and techniques. Our thinking was, how can you somehow engineer the best features? And in my case, I was thinking of color and geometry or texture and geometry features. And how do you really extract them? How do you track them independently? How do you fuse them? And how can you bring in techniques like graphical modeling, Bayesian inference for doing fusion, and things like that. That’s was kind of the thought process.

I must add, there was another stream of work coming up then, which was mostly boosting and random forest applied, and eventually tracking did turn into more of a tracking by detection kind of problem, I must say, but I did quite enjoy the techniques I learned and then applied it later on as well.

[00:05:28] Gil Elbaz: Classic algorithms is sort of a lost art in a way, but it’s extremely interesting, extremely challenging. And I think that nowadays combining the two, combining both deep learning with insights gained from them is a very powerful thing. So I have a lot of respect for folks that have started way before kind of the deep learning era, but also I think that it’s very, very valuable.

[00:05:51] Vijay Badrinarayanan: Yes. I think there are a number of interesting domains, particularly geometry, where a number of insights are actually used quite deeply in training deep models. A very good example I can cite here is Deep SLAM, something some of my illustrious colleagues at Magic Leap worked on, trying to train these models to do pose inference straight away in the pose prediction; given a pair of images, can you say what the pose change is between the two, how much the camera has actually moved?

[00:06:22] Gil Elbaz: Like 3D rotation and translation?

[00:06:25] Vijay Badrinarayanan: Exactly. And what we did realize after quite a bit of effort, if you sort of modularize it in some ways, it actually works a lot better. We would still want to do end-to-end learning at some point, but we sort of broke it down into feature detection, also learned feature detection, learned feature matching, and then learned post estimation, almost identical to what you would actually do in a classical style, but then you replace it with learned components.

I haven’t been tracking this very closely in the recent past, but many approaches are trying to do this these days with success.

[00:07:02] Gil Elbaz: Yeah this is very interesting. And I think it also connects nicely to one of the big topics that we’ll talk about, around AV 2.0, a bit further on in the conversation.

Before that, I’d love to hear like how you moved to Cambridge and what you took on there.

[00:07:17] Vijay Badrinarayanan: In those days, as you rightly pointed out earlier, computer vision wasn’t really a hotshot industry. In 2009, there weren’t as many opportunities in the industry, particularly focused on research as we have now; we weren’t even thinking about it. It was mostly academia. Where’s our next publication going to come from? What are the cool things?

I was definitely focused on an academic career. There was almost a rite of passage that you had to do a postdoc to get into academia. You had to gather all these publications, so really, the question was where and what kind of topics. When Roberto Cipolla, who was my postdoc advisor at Cambridge, proposed to me some of the really cool projects, particularly a project sponsored by Toyota Motor Corporation from Europe, that’s where all my, road scene image work really began for the first time. And the project was more segmentation and tracking of objects through time and things like that. So there was really a nice sort of connection from what I had done before and what I wanted to do.

There’s always this itch when you finish up your PhD that there’s still something left there which you haven’t really exploited, and you want to continue and develop things further. Cambridge was always a very fascinating place for me from my childhood days. I’d read so much about famous scientists there. In fact, I’d even gone on holiday to Cambridge once, just to look at the university, and little did I know I would actually end up there, but I did.

[00:08:45] Gil Elbaz: I hear it’s also beautiful there, right?

[00:08:47] Vijay Badrinarayanan: Yes. It is a magical little place with a lot happening. So if you’re at the center of it, you really quite enjoy it.

[00:08:56] PostDoc @ Cambridge – Road Scene Segmentation

[00:08:56] Gil Elbaz: Yeah, I’d love to hear more about what you did in the postdoc itself.

[00:09:00] Vijay Badrinarayanan: I’ve always worked on projects at the intersection of pure research and applications.

And so in this case, they were really beginning to think about how do we offer intelligence to make the cars safer, so that’s when the whole thing about road segmentation, road scene segmentation started. I had two illustrious seniors, Gabriel Brostow, who had worked on this, and who’s now at Niantic, and Jamie Shottan. who’s now the Chief Scientist at Wayve, a colleague of mine. They had done some very interesting work on segmenting road scene images, created this wonderful dataset called CamVid, and this was an old dataset, but really the first of its kind where they actually hooked a camera onto the car and went around Cambridge, collecting the dataset, hand-labeling it with 32 classes. It was a grand dataset of 700 images.

[00:09:49] Gil Elbaz: 700?

[00:09:50] Vijay Badrinarayanan: 700.

[00:09:51] Gil Elbaz: Wow.

[00:09:52] Vijay Badrinarayanan: Yes. It seems somewhat, almost cute now. So three generations of folks actually worked on those. It was Jamie, it was Gabriel Brostow, it was me. And in fact, another colleague of mine as well. So they had started on segmenting these things and the techniques they were using were boosted random forest. They were looking at small patches, trying to hand-engineer features, or really, some of the stuff Jamie did was more about actually learning some features through random forest.

The technique was, if you’ve done pixel-wise class likelihoods, then how do you create a holistic segmentation using an energy based model, where you would have some priors, you would say that neighboring pixels should have similar classes for instance. So there should be very few jumps in terms of boundaries, etc. And the way they were an energy would be defined using some prior selectors and the output of a classifier, which just was the likelihood. And then you will optimize it using techniques called graph cuts.

So this was a whole era where things went on and, performance sort of inched from one generation to the next. There was lot of blood and sweat, but very little real jumps in performance. It was mostly like, who can engineer a better feature? That’s where it was when I started with this.

[00:11:11] Gil Elbaz: In a funny way during my masters as well, I focused a lot on a superpixels, which are local clusters of pixels, trying to take a problem that’s not segmentation, but a bit similar, meaning creating sub-segments within the actual real segments. And then breaking down the image to these superpixels that had uniform meaning, but weren’t necessarily perfect segments of the objects themselves as a preprocessing step towards a good segmentation.

And so I really connected with a lot of what you mentioned, because we use very similar methods to try to create these superpixels in a meaningful way. The cool thing was we later tried to expand it to 3D, which turned out to be quite a challenge before we moved to kind of deep learning based methods.

So how did you go forward with this?

[00:11:58] Vijay Badrinarayanan: I, like many, like you, was also part of the superpixel era as well. SLIC was the name which comes to mind. One realization was that we need to aggregate context over a larger area to label pixels or in this case, superpixels.

So we’re sort of in a radial way, aggregating context or features from other nearby superpixels to describe a particular superpixel and then feeding it to an SVM, for instance, in order to figure out what the labels would be, and then probably mixing it up with energy based models.

So you can see, this was the period between 2009 and 2011, and three things began to stand out for me. One is, generally the size of the datasets were really small. Two, the the methods were extremely messy. To deal with, and you would have code all over the place. There were some libraries for SVM, some for HOG, some for superpixels. It’s just a mess from just a development perspective it was increasingly clear that if you wanted to do good segmentation, you need good context. You need to aggregate context.

In between, in this period, I sort of switched tracks a little bit and started doing things like, how do you segment consistently over time? And this is where a lot of my work on labeled propagation started. Let’s make this problem a little simpler.

If I give you labels of the first frame and the last frame of a video sequence, can you actually fill out the rest? Can you propagate these labels? This was really the data generation problem we were thinking about: how do you create model labels and then train stuff?

[00:13:32] Gil Elbaz: Yeah, the scale-up problem. Definitely.

[00:13:35] Vijay Badrinarayanan: So we were like, how do we use the same supervise methods? How do you use again, Bayesian inference, graphical models and creating undirected graphs through videos, and then propagating labels from their first frame? So some of these very interesting things we did, but over a period of time, it was quite clear that the three main things sort of stood out. I was almost ready to do something else at this point.

And after messing around with all of these things, interestingly, something happened. In 2011, my project sponsors started providing us with LIDAR data. And the first time I’d actually dealt with LIDAR data. Now you have LIDAR as well as RGB; can you create better segmentation of the road scenes?

Now, my first thought was, yes, we could do the same old story of taking patches and coding them using some hand-engineered features. And then training a classifier and somehow graph cutting the whole thing. Maybe 3D graph cuts, because that was also sort of coming up.

I sort of held back a little bit at that point and said, I got to do something interesting. And we still have an issue of unlabeled data. We have way more unlabeled data than labeled data. So how can we start doing things semi-supervised? At that point I was not thinking of the word pre-training, but can we actually learn these features using just unsupervised data? Just from observations and then somehow make better segmentations when you have the few labels which come in?

So, one thing which I always did through my research days was, always read beyond just the computer vision literature, probably because it was a Cambridge influence. And the Cambridge machine learning group was very strong, led by Zoubin Ghahramani. We would always sit through their talks and it felt like a different universe completely, but I would often read some machine learning papers, and that’s where I stumbled upon this paper, just really interesting, by Marc’Aurelio Ranzato and Yann LeCun.

It was called Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition.

It was a 2007 CVPR paper. Not many people actually ever heard of this stuff, because I was trying to look for methods to train, I knew about autoencoders. So I was looking for methods to train autoencoders. And so my idea was if I had a patch of a depth image, if a had a patch of RGB, my question I was asking myself was how do you embed them into a feature by training using some kind of task, a pretext task as you call it now, but then it was an autoencoding task.

So this paper was very fascinating and this was the first time I saw the name Yann LeCun on a paper. He was the last author on that paper. And the first time I saw the word,

[00:16:19] Gil Elbaz: Not the last time, of course.

[00:16:22] Vijay Badrinarayanan: And the paper is very interesting, very fresh at that time. Very different to everything I’d done in the past. And I thought may work may not work, but I want to just implement that. And I started doing that. And that was really, really the turning point 2011, late 2011, 2012, where I started looking at learning features rather than hand coding them. That was when the first autoencoders started building where you take a 32 x 32 patch of an RGB.

32 x 32 patch of a depth image created from a point cloud, which is projected. And then you would train one layer of a network, which was just matrix sigmoid function, and then decoding it back. Right. And you’d just train one encoder, one decoder.

And after many attempts, I got it working because then there was no ReLU, no batch norm. So you had to fiddle around with a weight initialization and stuff like that. And the optimization methods were also not SGD then, it was something else. It was L-BFGS. Some stochastic versions of conjugate gradients, things like that.

Once it started working, then the idea was you would freeze that, and you would introduce another layer of encoder, decoder and train. And that’s how we would make the network. That’s when I started using the word network deeper, right. It was really fascinating because you began to see my whole objective, which is overfit.

[00:17:46] Gil Elbaz: So to clarify, over-fit on, on what,

[00:17:49] Vijay Badrinarayanan: On this little dataset of patches of RGB and corresponding depth.

[00:17:53] Gil Elbaz: So you didn’t really separate a training set and a test set, and you just said, let’s first make this work on the training set.

[00:18:00] Vijay Badrinarayanan: Yeah, that’s it. Because it was so raw, so new. I just needed to learn how to make this thing work, how to get the optimization to converge. And, uh, I was extremely delighted to see, like I could make it overfit and I was very happy to show off this to my product sponsors, but of course it was, uh, it was not too impressive just to see an overfit job.

Once it began to overfit I began to ask myself this question of we’re doing autoencoding. Can we actually do transcoding in the sense, instead of sending in an RGB or a depth or a surface normal patch and reconstructing them back, can you just feed an RGB and reconstruct depth or surface normals on the other side?

[00:18:42] Technical Challenges Faced During Early Deep Computer Vision

[00:18:42] Vijay Badrinarayanan: Again, we’re 32 x 32 patches. And that actually happened. And that’s when I got really, really excited, because now that means any labeling problem could actually be done that way. You know, I knew that this was possible now, how do you actually go from these really small patches to full image size things?

And I didn’t have a GPU at that point in time. It was all CPU based processing. And so I had to learn about GPUs and ask my advisor to get me a GPU, because this wouldn’t fit. And no Caffe or anything at that point in time. So had to develop the code base to actually write out all these layers and optimize them. Write some CUDA kernels, etc., etc.

[00:19:24] Gil Elbaz: So you actually wrote the CUDA kernels during your post-doc?

[00:19:27] Vijay Badrinarayanan: Yes. Some layers had to be optimized very specifically. So yeah, until there was then the scaling phase of going from patches to images.

And that’s when we got the full, the SegNet basic work, which has four layers of encoding, four layers of decoding, still sort of trained in some weird way. And, you know, once these things began to work, it was really fascinating because all of the past four years or five years of work, which, you know, many generations have done, could easily be beaten by just hitting train on these very, very simple architectures.

[00:19:59] Gil Elbaz: It was a game changer and it’s amazing that you were at the forefront of this really experiencing it from first person perspective. Implementing it yourself implementing the CUDA kernels and really driving this forward. It’s amazing to hear.

[00:20:15] Vijay Badrinarayanan: Yeah, it was quite a turnaround and it just felt like being at the right place at the right time and focusing on the right things.

[00:20:24] Meeting Alex Kendall; SegNet

[00:20:24] Vijay Badrinarayanan: At that point, something very interesting happened. There was this very young, extremely smart PhD student who came and knocked on my office door, and his name was Alex Kendall. And he was quite fascinated and wanted to join this project and, you know, try to push this further. And he did in a very big way.

That’s when we moved to Caffe, and then there, again, the dataset was a limiting factor because we were still working on those 700 images. At that point, VGG networks had come out of Oxford and they were pre-trained on ImageNet classification tasks. So our idea was, let’s use that backbone, and let’s use the same kind of segment decoder with these max-pooling indices transfer from encoder to decoder, to make it covariant.

[00:21:07] Gil Elbaz: And did you guys have the intuition back then that using a backbone from a different task for a different computer vision problem on a different dataset would be valuable for this computer vision task? So, using a backbone that was trained for detection would be valuable for segmentation in this case?

[00:21:26] Vijay Badrinarayanan: I think struggle to recollect what exactly went on in our minds at that point in time, but probably, one reason was the encoder of VGG looks so similar to what we had done. It was basically instead of having one conv, one value, one max pool, it’d be like three convolution layers and then one downsampling step with a max pool, and so on. So it just looked very symmetric to what we were doing. So it was like, yeah, you know, and adjust a weight file. And there was still literature coming out where things were getting more robust in terms of rate initializations. ReLU and batch norm had come in on that period as well.

Starting from that point on, I mean, it was just fascinating because I still remember seeing results just go up straight away. 20% in one day.

[00:22:14] Vijay Badrinarayanan: That was a good day. Because even though you don’t see that often in your research career, it was many years of work. And to see that was really amazing.

So that was when I was slowly transitioning out of Cambridge after writing the paper (SegNet). And you know, it was getting a lot of attention in the media and somehow probably seeded in the origins of Wayve in some sense as well.

[00:22:38] Gil Elbaz: Did you understand back then the of scale or the importance of this paper?

[00:22:42] Vijay Badrinarayanan: Definitely not. One reason being, in fact, the reviewers were not happy with it. At the same time there was the FCN paper from Berkeley, which also had amazing performance. That paper was a lot larger in terms of number of parameters. So I had to figure out like, what was so different about this as compared to that.

And it was mostly talking about efficiency and also sort of benchmarking on not just simple class averages and global average accuracies, but I started benchmarking things on F1 scores, which was how good was the precise delineation of the boundaries, for instance. Because that really matters for a number of segmentation tasks.

So that’s where efficiency, in terms of the training time, in terms of the parameters were far fewer for SegNet it was very symmetric and you could train it. Of course you could train it faster and it was performing better on other sort of metrics. So it had like an overall a better performance. So that was really my pitch to pitch back to the reviewers.

[00:23:43] Gil Elbaz: It’s important to know how to pitch. I mean, it’s always challenging of course, to change a metric. The team in Berkeley is very, well-respected did an amazing job with their paper, but this is also a really fundamental, I think, a milestone in computer vision history with respect to its advance into deep learning.

So amazing and kudos to you. Really great job.

I’d love to shift the conversation a bit towards the next steps in your career. So you were in post-doc after your PhD, incredible journey through the academic the top of the academic industry.

You know, it seemed like you were in the perfect trajectory to become world renowned professor. And so I’d love to hear how you pivoted and made your way to Magic Leap.

[00:24:25] Vijay Badrinarayanan: Before I dive into that, I’d be remiss if I didn’t add that during all of these phases, my PhD and postdoc, I was extremely lucky in the sense that I had two wonderful advisors who I remain very close with even today.

And I was also extremely lucky to work with Alex. He was such a breath of fresh air to work with. And so smart, so quick as well.

[00:24:48] Gil Elbaz: It’s really the people that you work with throughout the way that really helps. If you want to shout out to a few people feel free to.

[00:24:55] Vijay Badrinarayanan: Well, I’ve definitely got a shout out to all my colleagues in the lab who put up with my code and who had to do many of their first deep learning projects on my code base, which was all MATLAB-based by the way, which was just absolutely crazy to hear now, but they all added so much and extremely happy to see them all do really well today.

[00:25:15] Transition from Academia to Production Computer Vision at Magic Leap

[00:25:15] Vijay Badrinarayanan: On my academia to industry transition: I spent over six and a half years at Cambridge. There came a point in time where I think the industry was really gearing up to invest quite deeply in research. Right at the edge of applications and also staying in touch with the actual academic community. And this is something which has been a common train of thought and work for me.

So one fine day, I got a call from Gary Bradski, who was a very well-known person in computer vision who was the founder of OpenCV. And he was running, filling up this group at Magic Leap, a company I’d never heard of.

I’d seen this flyer somewhere on the halls of Cambridge. It was this incredible description that was inviting biologists, rocket scientists, mathematicians, whoever. Come and join this thing. And yeah, we just got talking and Gary is an excellent salesman and a mentor. It was a fantastic journey for the next six and a half years, you know, living in the Bay Area where I continue to live and work now, taking deep learning from academia into production.

And that was really the whole journey of Magic Leap for me. In 2015, when I joined there was still a significant degree of skepticism about deep learning in the industry from many well-known people. But with good reason, I must add. It was extremely expensive to train. Nearly, at this point, impossible to envision them being deployed on things like glasses or head-mounted devices which are extremely power constrained. So the notion of where it could actually help in terms of boosting performance was still in competition with simpler, well-engineered, hand-engineered methods, particularly in SLAM and things like that. And end to end for everything didn’t really work as well. What we always saw where things were more robust in terms of the predictions, but never as precise as the job really wanted.

[00:27:09] Deep Eye-Gaze Estimation at Magic Leap

[00:27:09] Vijay Badrinarayanan: So there, I chose to work on a whole bunch of projects, like hand tracking, eye tracking, SLAM. So I worked on the eye tracking part because one of the things that was lacking was robustness. You really need to figure out the boundaries of the pupil, the cornea, etc. You could just do a segmentation. And that’s where I got into doing eye segmentation, which was sort of an easy transition and really a quick win in the industry coming in.

Then the real challenge was getting those networks to be a thousand times more efficient. And we actually made that happen: you can see that in the Magic Leap 1, Magic Leap 2, all of these actually have those things.

[00:27:44] Gil Elbaz: Wow. That’s incredible. Just to interject here, like I’ve met Gary in the past, founder of OpenCV. Incredible guy, really blew my mind when I talked to him and really the depths of his understanding both on business and on technology, and on open source and heading up amazing projects throughout his lifetime. I think OpenCV is probably around 25 years old.

And also what’s interesting is OpenCV is really what comes to mind when people think about classic computer vision today. That’s the tool. And so saying that he actually convinced you to come to this deep learning role in the industry it’s interesting. And a bit unexpected.

And so seeing the success with the eye gaze estimation, I actually think that it isn’t that easy of a problem. There are a lot of challenges with eyes, refraction of light, and there are challenges that have to do with many different aspects of, of the eyeball itself.

I’d love to hear if it was easier or harder than you expected, if there were specific challenges that you thought wouldn’t be possible, or you would need an enormous amount of data to solve.

[00:28:48] Vijay Badrinarayanan: I think it was the first time we really organized a very large-scale data collection effort at Magic Leap. We hired an external company to collect data, and it was a very messy process. I remember setting up these big boards of markers and you really had to image the eyes (of the subjects) when the focus was on each point in this big board. And to know that they were actually focusing on that point, because we were also drawing other kinds of measurements, particularly for the gaze, we had them hold onto a laser pointer and look at the point and also point the laser pointer at the thing.

So it was a very slow, painstaking, and erroneous sort of data collection process. And so a lot of the work, the time consuming work was actually to filter this data, clean it out, and discover this in the process of training that this was actually wrong and then clean up. So really, MLOps began to show up in a very big way.

In those days, I do remember seeing some samples of eyes, synthetic eyes, which Datagen had created at that point in time. That was extremely fascinating to see how the quality of that. I wish we could have used a lot of that to actually get rid of these problems because it took such a long time.

Really, data collection was planned for six months, nine months. Bringing people in, getting to the process, signing NDAs, whatnot. And it was a Frankenstein unit because we had to cover up the whole Magic Leap hardware appearance so that it shouldn’t leak. So it just really comical thing in some ways. Painful. In fact, I would say like the actual, the cool stuff I’ve actually trained. The network was really a fraction of the time. Most of the time was actually getting the data ready. And the other part was of course making things more optimal. That’s what the embedded engineers really did.

[00:30:46] Gil Elbaz: That’s incredible. Yeah. In hindsight, or in 2022, the way to go about this is just, okay, where do I find my synthetic data? We talk to folks working on eye gaze estimation all the time, and they’re like, okay, show me your synthetic data solution. Looks great. Let’s start. Perfect. And we have this very simple platform for them to just click a button and download the data in a few minutes.

So it’s very cool to hear kind of where we were. And I know that if Magic Leap was doing it today, it would look very different of course. But back then, that was state-of-the-art, it was pushing the boundary. And I think that there was really a bold stance and Magic Leap back then was really attacking problems that even the academia didn’t know how to solve and that DevOps didn’t evolve yet to MLOps, and you were already tackling this problem. You know, back when I was in university, Magic Leap was considered the number one most advanced computer vision place in the world and we had so much respect and so much excitement around it. I remember when I first tried out the Magic Leap, it was incredible. It blew my mind. I felt like I was looking at the first iPhone, maybe three, four years before it was actually released.

So it’s really a kudos to you guys; it’s very, very cool work.

[00:32:02] Vijay Badrinarayanan: Thank you. I was privileged to have a number of people who are definitely called visionaries including Gary, who wanted to invest in things like deep learning so early on in the industry. My manager, Andrew Rabinovich, and his vision to do so many different things with deep learning, including deep SLAM. That was something he really started.

And we could really push them the number of amazing publications that have came out hand tracking, trying to do end to end and tracking and not this very messy way of fitting hand models to do things there. Again, synthetic data having access to it, it’s phenomenal. I think we all came to the realization that ultimately, how much time you spend on just cleaning up data or even collecting data and still trying to figure out how much bias is in those data, learning the hard way that it will all be more controllable from just simulating data.

[00:32:56] Gil Elbaz: Definitely. So taking this forward a bit, towards the AV space again, into the whole space of autonomous vehicles at Wayve. What made you decide to go onto this next journey, to the opportunity that you see at Wayve?

[00:33:21] Joining Wayve

[00:33:21] Vijay Badrinarayanan: Certainly, after sort of the release of Magic Leap 1 and, beginning to make progress, toward Magic Leap 2, the kind of problems or the fundamental features were beginning to get more mature. You know, we be able to do eye tracking, hand tracking, SLAM. In the last couple of years, I did more of scene understanding as well, like, you know, object segmentation detection from multiple views, etc., etc.

In my mind, I was looking for the next step. What is it? So all of these approaches were passive datasets. You had large passive datasets, you train models, you optimize, and then you get them to work on these headsets and you do inference for a number of different applications.

But the next obvious thing would be, how can you learn through interactions in the real world? How can you learn continually? So there, the first thing which comes to mind is data. Where are you getting this kind of data today in volume? What are the kinds of applications? And that’s where AV comes into play.

[00:34:24] Gil Elbaz: Definitely.

[00:34:24] Vijay Badrinarayanan: And of course, this was just a background process until one day, Alex reached out to me and said, Hey, you know, Magic Leap was going through some troubled times between ML1 and ML2. What do you think about Wayve? And would that be an interesting thing? Cause we have a very fresh approach to the problem, which is much more based on machine learning and less on other things.

And we can talk about that. It was a really bold move in some senses. Wayve was then less than a 50 employee company based out of London; I was in the Bay Area. But the problem space itself was really challenging and it was becoming increasingly clear that, at least in my mind, embodied intelligence–where you’re learning continually through interactions with the world–is necessary to create robust AI, to train robust AI models, which can learn good priors about the world. And it could be used for multiple different applications. That’s how my jump happened.

[00:35:23] Gil Elbaz: It’s very easy to idealize. But to be at the forefront of practical development of real computer vision at the bleeding edge and really pushing it forward together with very, very strong teams.

It’s incredible.

Let’s start by discussing the concept of AV 1.0, and then move into AV 2.0 and the promise of the new concepts there.

[00:35:47] Vijay Badrinarayanan: Yeah, definitely. Before I dive in, I’d really give a shout out to visionary founders. Like you, Gil, and Alex Kendall, people who have these really strong beliefs to create this ecosystem where people like me can actually come in and contribute. It’s really a massive challenge and I’m so fascinated so thank you to all of you amazing founders.

[00:36:08] Gil Elbaz: Thank you. Thank you so much.

[00:36:09] AV 1.0: First-gen autonomy

[00:36:09] Vijay Badrinarayanan: AV 1.0 is probably the first generation of autonomous vehicles on the road. And today I’m extremely happy to see many amazing trials happening here and in many other parts of the world, where you’re seeing companies like Waymo and Cruise starting to launch the ride-hailing in big cities. It’s a massive achievement and massive kudos.

It’s been a hard slog for over a decade to get to this point. It was influenced by ideas of the past. Certainly I think that AV 1.0, it’s really predicated on the fact that we’ve got to get perception to a really high degree of accuracy, and the rest of it, prediction and planning, which are really the three main components of any of this AV stacks, will sort of follow through.

[00:36:56] Gil Elbaz: Can we break down those three components for the listeners a bit?

[00:37:00] Vijay Badrinarayanan: Perception is really the labeling problem. You have a bunch of images coming from multiple sensors. It could be radars, could be LIDARs, could be RGB cameras.

How do you identify the key agents and the static structures in a scene? For example, you want to know where the lanes are, where the lane markings are, the traffic lights, cars, other vehicles, pedestrians, trees. The classic problem of detection and segmentation.

Prediction is really understanding the trajectories of objects, moving objects. So you want to know, for instance, where will that agent be in the next few seconds. And based on that, plan your trajectory. When you say your trajectory, it’s really the eco vehicle or the autonomous vehicle’s trajectory, or it gets more interesting when there are interactions, things like that. You know, when you’re nudging a little bit in heavy traffic at intersections.

Planning, of course, is taking all of this information from perception and prediction, and then coming up with the next trajectory, or options of multiple trajectories at which the execution happens using a controller.

There’s a lot of history behind AV 1.0. Really the DARPA Challenge, people will always refer to this as origin of autonomous vehicles. And interestingly, the first DARPA Challenge was won by Gary Bradski and Sebastian Thrun, a nice sort of connection for me as well. So the idea was, let’s focus on perception, and that’s when it got into a LIDARs, as LIDARs make it a lot easier to perceive and detect objects. AV 1.0 was driven by those kinds of ideas.

[00:38:36] Gil Elbaz: So it’s very sensor-heavy, very perception-heavy trying to solve a lot of different deep learning tasks, through many, many different perception-focused neural networks. And then the planning happens after that. And this is part of where we see things starting to break down. We see an enormous amount of engineering there. Right?

[00:38:56] Vijay Badrinarayanan: Yes. I would really say that the key characteristics of really being hardware-intensive, are driven by say, perception metrics. Really, their end task is planning and control, but in practice you’re optimizing for another task, which is perception–with the justification that you need to see an object quick enough that you have enough time to react. Now, that creates these modular stacks.

So there are a bunch of engineers working on making lots of different neural networks for perception, as you rightly pointed out. Then there’s a prediction group, and then there’s a whole planning group. And now obviously you need to stitch them all together. So there are interfaces. So as you change one, then you need to rework the interfaces, right?

So there’s just more engineering effort. The other issue is, how do you propagate uncertainty in terms of inference, downstream from perception all the way to planning, right? It’s not very clear. How do you do it?

In general, you get caught in this microcosm of techniques to improve perception, lots of different loss function engineering, architecture engineering, all of this kind of stuff happens there, right? So these are all issues which motivate AV 2.0.

[00:40:08] On Tesla LIDARs and their unique approach to AV

[00:40:08] Gil Elbaz: One thing that I saw that caught my eye was when Tesla took off their LIDARs, they said that there is some uncertainty that’s being propagated, that we don’t know how to deal with the uncertainty from the LIDARs as well as the uncertainty from the vision. But it seems that it’s not adding value in the end of the day, or it’s not adding substantial improvement in performance. And so they removed the LIDARs.

Now there’s a lot of heat that they took on the internet for this. And it’s very hard to prove out that it is, or isn’t valuable of course, because with LIDAR, on one hand, you are adding 3D geometric information.

On the other hand, if in practice, they weren’t able to prove out the performance improvements, then it does make sense to remove it and lower the costs and maybe simplify the training stack and everything. So I’d love to get your take on this. Is this kind of AV 1.5? Or should they really maintain the LIDAR as long as they are in this AV 1.0 stack?

[00:41:11] Vijay Badrinarayanan: Well, it’s sort of two things in what you’re asking. One is the need for LIDARs, which I can address, and the other one I can’t address, regarding Tesla itself and what they’re trying to do. From a point of view of whether we need LIDARs or not, to be honest, the jury’s still out there. Now my view about this is as I pointed out earlier, the use of LIDARs is predicated on the fact that you need really high perception under different conditions, in the nighttime, for instance. Now what we’re really missing here is when you bring in deep nets, deep nets are extremely good context aggregators, and a lot of the decisions they make are based on overall context. Can you compensate for lack of SNR with just pure cameras-based decision-making by using deep nets, which are really good context aggregators, and then get rid of LIDARs? There is an argument to be made that, yes, it is indeed possible to do it. And also remember: the cameras are getting better year after year. You’re getting 4k, which is pretty common and will keep increasing, you have HDR, things like that. So there’s an argument here, which is the balance for both and it needs to be proven out, of course.

I think Tesla is showcasing some of this quite well. They’re able to run their FSD in many, many different parts of the world. Whereas most of the other players are really focused on map building, which we’ll come to, and in very limited areas. Just part of really the AV 1.0 stack as well.

Now, of course, robustness is a whole different story there. The other thing to remember is LIDARs are not a cure for everything. For example, snow, heavy rain, fog. All of these are not really great situations for LIDAR and they need to improve there as well.

So in principle, an argument can be made that with cameras and radars, you really can have complete coverage over a vast operating design domain, or ODD as we call it. So that’s where we are focusing our attention on mainly, because the AV problem, I personally believe it’s a combination of business and technology.

[00:43:13] Gil Elbaz: Definitely. To get it to the masses. You need it to be affordable. You need it to be something that people trust. It’s well beyond just a computer vision problem.

[00:43:24] Vijay Badrinarayanan: Exactly. One good example which motivates not going the LIDAR route. Let’s say, if you want to partner with a fleet for data collection, a fleet which is already existing out there with a very successful business, if you go to them and say, I need lots of different LIDARs, I’m going to mount on your fleet, they’d be like, Hmmm, that seems like a lot of work for an unproven technology. Right. So things like this also need to be kept in mind when you start to think about and plan for your entire stack, which is sensors plus the brain.

One thing to realize is Tesla is not an L4 autonomous vehicle. Tesla is a commercial vehicle and the liability lies with you as the owner of a Tesla. So it is somewhat AV 1.5, I think you rightly pointed out because you really use your camera imagery to get to a local high-definition map, purely constructed from cameras. And they don’t even use radars these days. And then from them, you plan that is somewhat similar in a sense to AV 1.0 where the local HD map at any point in time is just downloaded from the cloud because you localize. And then you have your HD maps build out there. And it saved when you call HD it’s high definition, meaning, you know exactly where the lane lines are. The lane markings are, you know, which lane you will need to be even before you can actually see it. One thing I would add on Tesla’s approach is Tesla creates local HD maps from imagery, but there’s no downloading. So they don’t really map any cities.

[00:44:56] Gil Elbaz: So to clarify, in AV 1.0, one of the important parts is having a super high fidelity, 3d environment that you can then download while you’re driving or download the area that you’re driving in and then localize within it using GPS.

[00:45:14] Vijay Badrinarayanan: Yes. I think the localization is an interesting one, goes beyond GPS because you also use LIDARs there. Because you want very precise localization, you’re really thinking about centimeter accuracy, and then you can just superimpose the downloaded map, and then you know exactly where the lines are, which lanes you need to be. And you can plan ahead even without seeing it.

While this can be made to work and some companies like Waymo and Cruise have shown that this can be made to work, there’s still the requirement that you actually need to go and map out cities and continue to maintain these maps. This also introduces this extra requirement on hardware so you can ensure very accurate localization. Now, imagine you’re a meter off, you could be in the wrong lane.

So that’s basically AV 1.0, and it’s really akudos to all the companies who have bought them onto the road. And I think it really is turbocharge in some ways, all this news just coming out is extremely good for the industry and definitely fantastic for AV 2.0 because really the next generation.

[00:46:18] Gil Elbaz: So the challenge with the AV 1.0 approach is scalability. This approach can work in specific cities, and you have to invest a lot of capital in maintaining it, but it’s not really a scalable approach across the entire world.

[00:46:37] Wayve’s AV 2.0 Approach

[00:46:37] Gil Elbaz: Now we’re going to talk about the AV 2.0 approach and why it seems that the other approaches are going to need to converge to AV 2.0 in order to really scale across the world.

[00:46:48] Vijay Badrinarayanan: AV 2.0 is a bold approach. Two highlights are a very lean sensor stack–just cameras and radars–and really unlocking the potential of machine learning where you don’t have to rely on crutches, like localizing to an HD map, but really making decisions based on how the scene is really evolving. And of course, by training on enormous amounts of data.

Fundamentally, AV 2.0 is something which is being pioneered by Wayve. It is driven by the very strong belief that, in the future, is indeed possible to train very robust, very performant and safe neural networks, which can really control the autonomous vehicle and not be so reliant on the specific geometry itself. So, I mean, it could be applied to heterogeneous vehicles, whether it be cars or vans or trucks, for instance, with different geometries.

There are certain key differences, but intelligence wise, it could be shared, and that’s really the whole idea.

Now there are two principles which we work on within AV 2.0. One is, how do you do planning. We spoke earlier about the modular kind of stack, perception, prediction and planning which happens in AV 1.0.

In AV 2.0, we’re really thinking about how to do this in an end-to-end way. You have images from multiple cameras, which go into a neural network, and then outcomes, control decisions, meaning, this is how much you should be speeding and this is how much you should be turning. These are the kind of control decisions we make.

We develop many different techniques within this end-to-end planning approach. And that’s where things like imitation learning, reinforcement learning, all of these very new things come into play.

[00:48:42] Programming By Data / Data-as-Code

[00:48:42] Vijay Badrinarayanan: But what is really fascinating, and we see this every day on the roads in London, where we test on our internal fleets, is you can get the robot (we call these cars robots) You really mess around with them so much that you can teach them something really new. Like for example, lane changes or moving to a higher speed, etc., just by what we call programming by data, you carefully create data curriculums and you train these neural networks, do this, and lo and behold.

They’re amazing. And in London, traffic is no main task to drive through. It’s messy. The weather changes constantly, but you can have these networks do this. So it really powers our belief even more to see this happening.

[00:49:29] Gil Elbaz: This is incredible. I want to connect to some of the, the points that you mentioned. The programming by data, at Datagen, we call it Data as Code. We see data as the way to encode logic into neural networks. I think very similar to the approach that you’re describing. And so this is really something that we, as a company that generates data, see as the ideal, the way to develop computer vision applications in 2022. And so we’re very connected in this approach.

To recap a little bit, AV 2.0 is trying to learn how to do the control problem, the perception problem, and the planning problem all together in an end-to-end system and really leveraging the power of the deep learning representations that aren’t necessarily broken down into modules, but are actually part of one big system, similar to how we see things in our brain, right? We don’t have three parts of our brain that are separately trying to drive the vehicle, but they’re trying to create one system to drive the vehicle in a coherent way. And by creating this end to end approach, you’re able to actually train new behaviors or new capabilities directly from the data, which is extremely interesting.

I’d love to dive into, okay, how much data, what kind of data? How do you get your data? I think this is also one of the big questions.

[00:50:53] Vijay Badrinarayanan: Right. I mean, it’s a great question. It’s really part two of what we do. Part one being investigating methods to train efficient planners. The other one is data.

[00:51:02] Addressing the Long Tail Problem in AV

[00:51:02] Vijay Badrinarayanan: Now really the heart of the autonomous vehicle project is, how do you tackle this long tail of scenarios? And you’d hear this all the time.

[00:51:12] Gil Elbaz: The long tail in the AV setting. Could you describe that and the challenges there?

[00:51:16] Vijay Badrinarayanan: A long tail is basically more and more rare scenarios you could encounter. The cartoonish examples would be a piano falling down from an apartment somewhere.

[00:51:27] Gil Elbaz: Happens all the time, happens all the time in Tel Aviv.

[00:51:30] Vijay Badrinarayanan: That’s sort of an exaggerated way, you know, but for example, a child running behind from behind a vehicle into the road, or suddenly encountering roadworks, with new signs, etc., or completely different weather conditions, these are all examples of long tail scenarios. So a long tail is whatever you can imagine that could possibly happen on the road when you’re driving from point A to point B.

From an AV 2.0 approach, which is basically data driven. And as we discussed, using Data-as-Code or programming by data, you have to create this data. Now, the way we do this is really a multi-step approach. Tthe first step is we created a hardware platform. We also supply hardware platforms, meaning sensors and the compute stack all together, to our partner fleets. And the partner fleets are constantly driving all around UK. So we’re really gathering a lot of data of scenes and also actions: what a human does.

[00:52:24] Gil Elbaz: So this is human-driven data: real humans driving in London.

[00:52:29] Vijay Badrinarayanan: Right. This is just human driven data. And you can get a ton of this and this really powers our training of neural networks. But you can also think of this as powering your simulators and let’s come to that.

The other kind of data is what we call on-policy data, which is a model actually driving the vehicle. And you have human operators sitting behind the wheel, watching out for any mistakes. And as soon as the model makes a mistake, they take over and correct it. This is also very valuable data. This basically tells us what the model does not know to do very confidently. So we have expert data and we have on-policy data; these are two data sources from which we’re bootstrapping the training of the models.

[00:53:13] Powering AV 2.0 With Simulation

[00:53:13] Vijay Badrinarayanan: The other ways you could use this data which we believe will be the key for us is simulation.

There’s an extent to which you can collect real data. But ultimately, if I could plot a graph and say, on the Y axis is useful data for training, and on the X axis is time, at some point, you know, you’re bootstrapping with real data and you sort of plateau at some point, but then simulation, which is being driven by some of this logged (real) data will eventually take over–that’s where you can really exponentiate the same scenario. You could create counterfactuals: What if this car was not there? What if there was a child running behind it? Things like this, counterfactual data. You really want to get to a point where simulation is really the main chunk of your training corpus. That’s where you really hit the long tail really quickly. There are a number of challenges here, and this is where the focus over the next few years will be for simulation.

[00:54:08] Gil Elbaz: I completely agree with you, the only way to deal with the exponentially long, long tail is really with exponential data. And that’s only available through simulation or synthetic data. We at Datagen, also, we call it simulated synthetic data. Imagine that you have this simulated scenario of a car driving down a road, very high quality. and you want to multiply the variance.

So you’ve created the simulation, Now you can suddenly change all of the colors of the vehicles and play with that and create a thousand different variants of colors of the vehicles. You can change the colors of the road; a thousand different variants. You can change the line markings, you can change the sky and you can add the moon.

We saw a bug a while back, in Tesla where it saw the moon as a yellow light. So this is a funny one, but in practice, each of these different variants doesn’t stand on its own, rather it’s a multiplication of them. So you can actually create all of the different permutations of these different variants.

So you have a thousand times a thousand times a thousand and onwards. This is actually where the exponential factor comes from. This is what allows, in the end, Vijay and his team to deal with the long tail to capture very high quality data from the real world and use it to empower their simulations.

And this is also similar to our approach, we collect a lot of very high quality human-centric data, super high quality scans, super high quality motion capture, and then incorporate it into our generators, into our simulators and leverage this power of multiplication and put it into a very simple platform for computer vision engineers. What you guys are doing is very similar in the AV space. But you’re solving two very challenging things here, both the end to end aspect of driving and also simulation, which is an enormous challenge in and of itself; we at Datagen are only doing the simulation part. So this is extremely impressive. And seeing it connected together is really the basis for a substantial company, a multi-billion dollar super successful company. it’s amazing to see.

What do you see as the next steps of the AV 2.0 space?

[00:56:23] Vijay Badrinarayanan: Let me take this opportunity to thank many of my colleagues who are in the simulation team and also my colleague, Jamie Shotton, who’s really a champion at Wayve for simulation. Simulation is our superpower, because the use of simulation is not zero or one. It’s not binary. We use it for many different purposes. Simulation can also be used for validation. Even today, you train your models in the real world, and then you test it in a simulator and see how it performs and gain insights and drive model development. So that’s one of the use cases. It’s really helpful in sensor design, geometry design, for instance.

How do you approach newer markets? You don’t have any real data collected there, but can you procedurally create environments and prove out that, whether it be a new platform, being a van versus a truck, it will work?

So there’s many different use cases for simulation, even before we hit this ultimate pot of gold, which is completely sim-trained models which are super performant in real world where you constantly power it with any number of long tail cases, way faster than any fleet could possibly collect. That’s really the ultimate thing.

Really the exciting thing, talking about next steps, is machine learning-aided simulation. I think this is really the amazing sort of meeting point. For instance, you have procedural simulators, which are fantastic, but how can you scale them up? How can you add diversity? How can you add realism? How can you add dynamism and how do you add interaction? So these are all really the challenges there. So this is where machine learning could really help. You have, for instance, your logged data from real runs. Now to that you could, for instance, do 3D reconstruction and import them into your procedural simulators on which you can start adding different assets and then exponentially multiply scenarios, right? That’s one of the things you can do. So you can use that to examine the validity of the models under different counterfactuals. What if kind of situations, you can drive through this, there’s a nice connect between machine learning and procedural simulation.

[00:58:30] Re-simulation, Closing the Loop & Testing Neural Networks

[00:58:30] Vijay Badrinarayanan: Then there’s the other concept which we call resimulation: you take an incident and you simulate it back again, and now you want to figure out that your new model would not have gotten into the same mistake, you want to test it. So that’s the resimulation aspect once again.

[00:58:47] Gil Elbaz: Amazing. So this reminds me a little bit in classic software engineering. Of the concept of testing, right.

Doing unit tests around specific scenarios and validating that the code runs correctly in those scenarios. So in computer vision, it’s a bit harder to test of course, but what we understand is that data is both the code, but it’s also the code that’s needed to test. Part of the Data-as-Code for us at least. And I think it’s also similar for you guys, really to be able to test your networks through specific scenarios. I love the re-simulation concept because you see a real world failure case and you don’t just create a lot of other data that’s around that data, you simulate that exact scenario and then test various models over time and make sure that that same mistake doesn’t happen again. I think every test that you built, you were creating a stronger and stronger infrastructure for the entire company and fleet.

[00:59:47] Vijay Badrinarayanan: Exactly. This is where I’m really excited about the future developments. How can you use these resimulated data, how do you make them editable? So you can create lots of different counterfactuals. The ultimate dream for me would be, you take a long piece of data, real data through machine learning. And let’s say some bit of semi-automated human sort of involvement. You can create exponentially more scenes. And really the ultimate thing would be if you have model-in-the-loop, which is telling you that I can deal with this scenario, but really at the point that it actually fails is what you want to add back to your training set. So it’s really model in active learning loop resimulation.

[01:00:28] Gil Elbaz: Would the model drive the simulation? So the model would automatically say, please simulate for me this scenario and create it with variance and then retrain. And it would also create automatically a test from that scenario, from that failure case?

And then you would have both a new test, and you would have data to retrain with, and you could iterate this in an automatic sense. So that could really be exponential even in relation to the amount of engineers or the amount of actual input that you guys are putting. It’s kind of like a flywheel that you’re developing.

[01:01:04] Vijay Badrinarayanan: Exactly. so this is really where we’re making investments. In fact, I like to point out that for the past couple of years, I’ve been leading autonomy at Wayve, and now I’m transitioning to be the VP of AI for Wayve. In our new office in Mountain View, we’re really bringing up teams who invest in really some of the cutting edge of data generation model-in-the-loop kind of training, active learning, representation learning, using both sim and real data.

But the focus is primarily these two things, the long tail. And how do you really deal with this? How do you compose new scenes using all of the tools which machine learning is providing us with today, and maybe we’ll have to develop new ones, for sure.

[01:01:44] The Future of AI and Advanced Approaches

[01:01:44] Vijay Badrinarayanan: But there’s a lot of interesting science from the community, which are holding our beliefs as well. I mean, you may have seen some fascinating results from DALL-E 2 as recent as yesterday, which just blows my mind of what you could actually do

[01:01:57] Gil Elbaz: It’s absolutely incredible, amazing team. Amazing scale. And of course the visual results are really mindblowing. Just two years ago, this would have been unthinkable and now you can put a flamingo anywhere you want in your house. And it looks very realistic and adapts according to the context. For anyone that wants to check it out, definitely go and check out DALL-E 2. When recording the podcast, it just came out yesterday. So I’ve yet to read the paper and full, but on my list. It’s definitely on my list.

[01:02:28] Vijay Badrinarayanan: So I’m very excited about some of these machine learning aided simulators, are really generally generative models because if we can get them in model-in-the-loop active learning, there’s really no limit to what you could do with just large scale models, with very simple control signals to train from, which is a massive supervised learning problem then.

[01:02:49] Gil Elbaz: There are various ways to go about simulation itself, right? You have the graphics-based approach and you have NeRF-based approaches, the neural radiance field representations of 3d and also 4d scenes using neural networks. There are also other kind of hybrid approaches, of course. And they’re also amazing generative methods that are separate to this like StyleGAN, for example, which generates amazing photorealistic images. So I’d love to hear where you guys are currently focusing with respect to simulation and where you see things developing over time.

[01:03:27] Vijay Badrinarayanan: Currently, our focus is still very much on precision simulation for validation, which is more human-generated, more classical simulation, but they’re definitely bringing in machine learning into that for a number of reasons, including training jointly with simulated data and real data. And to just sort of kickstart that flywheel and getting influenced by some of the things you’ve mentioned, just the GANs, for instance, for realism and things like that.

But independently as I pointed out, resimulation is one of the big things. And there are very interesting connections between our planning approaches and simulation as well, that our methodology is like imitation learning, which are thought to be, you need a lot of data from different viewpoints, what we call states and actions, right. States are like different images from different viewpoints of where the ego vehicle is. But all of this can actually be now rendered using latest technologies like ADOP or NeRFs; if you want to know what what would happen if the car was placed in a different position on the road, you can actually get that.

So these things are also beginning to power our approaches to actually plan, to train our planners. So the investments are getting deeper into both adding realism, diversity, and dynamics to personal simulators, but also using resimulation for both validation, counterfactual, but also actually training partners as well.

[01:04:52] Gil Elbaz: So the connection between leveraging NeRF-based technology for planning and leveraging the same technology for simulation, it seems like a very elegant solution, makes a lot of sense on the conceptual level. We also in Datagen, to give some perspective on that as well, leverage a lot of generative models in the 3D space. Generative mesh algorithms, generative models on high-quality textures, generative models on motion. So motion capture. And so by doing this, we actually are able to expand in an exponential way a lot of classic graphics pipelines.

And we also have a lot of research on GANs that add the realism at the end of the pipeline, but we do see NeRF-based approaches as very powerful as well. Some of the research that we’re doing is really based on the latest advances in NeRFs. We do have one NeRF-based capability in production which creates descriptive vectors that describe the various assets in our asset library using NeRF-based or NeRF-learned descriptions that include both the geometric descriptions, as well as the textural descriptions, the color descriptions.

And so by leveraging GIRAFFE in its previous iteration with a few extra steps, we were able to create these very nice descriptors that allow us to search in enormous asset libraries of millions and millions of assets to find similar assets, to find different assets, to use this to then later downstream challenge the models.

So it’s very interesting to see all of this come together. From our side, we have seen so much advancement in NeRF. We feel like it’s moving so fast that it’s still not mature enough to bring to production, but it’s critical to start investing in it as much as possible right now, because the promise is so enormous.

[01:06:50] Vijay Badrinarayanan: I that’s really fascinating to hear that, Gil, that you’re really able to use NeRFs to search and embed these assets and to create exponentially more data here and render them with as much fidelity as you would in the past days with advanced volume rendering techniques.

Yes. I completely agree with you from the state of NeRFs. I’m really impressed by how much 3D structure, texture, and modeling of elimination reflections that can be done with an MLP. It just blows my mind that you can a lot. Secondly, I mean, how, with just a few sparse views, you can get complete 3D structure.

It’s amazing. That means it’s really learning structure, and somehow, not “cheating,” let’s say. Significantly simplifies the volume rendering approach, which is great. And it’s also getting cheaper to do this. Everything points to, okay, let’s focus on this and start making investments because this is going to be there.

And all of this is learned fully self-supervised, which is also fantastic. So there’s no real labeling involved. You know, all you have is posed images and you can actually learn this way. I think we need more, let’s say, non-aesthetic applications of NeRFs coming in. And one of the areas I’m really interested in is how can you learn good representations about the real world using things like training methodologies, like what NeRFs actually use.

You want this end to end neural network to know many things about the world. It needs to know about the 3D structure of the world. It needs to know about objects, that they are permanent. They don’t just vanish between frames. There’s a smoothness of motion, but you also want to know that the geometry consistency across poses, if you look at objects from different ways, there’s a regularity to all of these things.

So I’m really excited to see all of these different tasks of reconstruction, self-supervised tasks come up because it feels to me like we can use that to train even better representations on which you can actually train planners very quickly. So there’s another perspective of where NeRFs can be quite influential as well.

[01:08:54] Gil Elbaz: And once the simulation itself is differentiable, this is going to open up maybe a 3.0, it’s pretty much the next step. I think really, if you have a fully differentiable simulator, as well as a fully differentiable autonomous vehicle, you can reach amazing closed loops, I think.

[01:09:14] Vijay Badrinarayanan: Yeah that’s absolutely how we all feel.

[01:09:16] Gil Elbaz: just asking in short about explainability and editing, it seems like with the amazing advantage of differentiability, there is also a challenge that arises around explainability.

[01:09:31] Vijay Badrinarayanan: It’s really a very important question we think about all the time, particularly with end to end approaches being fully black boxes.

Now, I think you’re right to use the word explainable. Explainability is a concept where if there is an incident, can you actually figure out why such a decision was made in that particular instance?

So there’s this adjacent word: interpretability. Interpretability is trying to understand what the neural network is learning.

Really, for me, the problem was down to good representations. How do you find correlations between good representations and decision making in a neural network? Some of the things we’re investigating are, what are the different benchmarks we can come up with to understand how good a representation or embedding for a set of images is for the trained neural network? So for instance, given a similar scenario, does it make a similar decision? Or to develop a suite of methods where you try to break the neural network and try to understand what are the concepts which relate to decisions, etc.

It’s really a work in progress, and it will probably develop where you would have a whole suite of ways where we can investigate or poke and find out correlations between representations and decision-making.

[01:10:50] Gil Elbaz: So in your eyes, the path towards interpretability, explainability, or both really, lies in the representation itself, in the embedding itself.

And so if the important information is held within the embedding, that means that it can be later on extracted, it can be disentangled and we can actually downstream understand what and why things happened, which is extremely interesting. It reminds me of similar challenges in what we’re doing around identities.

So similar to that human identities are hard to define, of course, and you really want to understand what is breaking the network. So if you have two people that are very similar, are they the same identity? Are they not? Can we actually poke, can we change the identity in a local embedding space in order to see how that affects the neural network? In a different domain, but very similar methodologies.

[01:11:47] Vijay Badrinarayanan: That’s a great example as well. Similar thinking.

[01:11:50] Are there other 2.0s? Next industries to revolutionize

[01:11:50] Gil Elbaz: Just to close off on the AV 2.0 space, do you see other spaces needing to shift into 2.0? On a fundamental level, the ability to take a step back, see all of these companies that have advanced a lot in the past 10 years and say, okay, I’m going to come with a different approach, a new approach, but that makes sense at a fundamental level at the essence of what we’re doing and really bring it forward and raise substantial funds and grow the team and create this new solution.

Are there other 2.0s that you see as very important?

[01:12:25] Vijay Badrinarayanan: I think at the heart of it all is really Data-as-Code or programming by data, So it really has very wide applicability. I really feel that the entire spectrum of robotics applications, they’re all going to get disrupted at some point, I think there are 2.0s to everything, including simulation itself. I call it ML-aided simulation; you could call it “Sim 2.0”, that would be an appropriate sort of moniker for it as well. These are the two spaces I see where this going to be a tremendous deal of disruption.

I also am really fascinated to see cogeneration. I haven’t really bought deep into it, but also seems extremely disruptive as a technology.

[01:13:11] Gil Elbaz: Definitely. Just thinking about that. It made me think that the graphics engines themselves, Unity, Unreal Engine, those will also need to be disrupted at a fundamental level going forward. So maybe Sim 2.0 is the next Unity or the next Unreal Engine.

What are the next steps for Wayve?

[01:13:41] Vijay Badrinarayanan: Thank you for all of these very interesting questions and allowing me to share my thoughts on this platform.

[01:13:48] Next Steps for Wayve

[01:13:48] Vijay Badrinarayanan: When it comes to Wayve itself and its AV 2.0 approach approach, our next step is to showcase that we are going to be platform agnostic. Really we are going from cars to vans, for instance. Same brain in a different platform, different control systems, but we can still make it work. That’s really where we are headed to, because that really expands our data collection ability in the real world at very low cost. And this will eventually drive both model training, but also simulation, and then try to create this virtuous flywheel of data and keep going. That’s really our next steps. And that’s what we’re invested in.

[01:14:27] Gil Elbaz: Wonderful. This is amazing. Really, A company that has grown in an extremely impressive way has really built its fundamental basic technology on such advanced, such groundbreaking technology.

This is quite amazing. It’s not the second iteration. It’s not the third iteration. It’s a company that was built from scratch. Based on true fundamentals and really looking forward into the future on how to create AV in the right way,

[01:14:59] Human-level AI

[01:14:59] Gil Elbaz: A classic question: General AI, or even simplifying it a little bit, human level AI; do you foresee human level AI in our lifetimes?

[01:15:10] Vijay Badrinarayanan: It’s a really interesting question for the reason being, one, defining what it is, is something which is really open-ended, but I definitely see beyond human level capabilities in our lifetimes in a number of different applications. I personally would be quite happy to see that happen.

Human level AI, where you are really thinking about a higher level consciousness, etc., that is something I’m not too certain about, in the sense that you have the models have a realization of what it is doing for instance. Those kinds of things, I’m not really certain about.

But in terms of its ability to do tasks, I think would definitely exceed human level capabilities in our lifetime. For sure.

[01:16:01] Gil Elbaz: I think that we see a lot of narrow AI today. Doing really well at one or very few tasks and that’s going to get wider and wider as we progress, I’m definitely certain about that. Yes, I know, the human level consciousness question is always one that we ponder, but is probably impossible to know today if it’s going to go forward or not. But actually, for me personally, extremely interesting to get the perspectives of top professionals in our field and get a nice poll and understand how they see this.

[01:16:35] Tips for Computer Vision Engineers

[01:16:35] Gil Elbaz: Last question. For the younger folks just starting off in the computer vision space: what do you recommend to do career-wise? It could be, what they should learn, how they should approach their first jobs. What do you think is very important early on in the career?

[01:16:52] Vijay Badrinarayanan: I think I’m biased, but they should all work on AV 2.0, that’s the thing! And Sim 2.0, I feel.

Really, my advice would be the thing which always held me in good stead, which is to read widely and connect widely because you never know what sparks your next idea and where it can come from. It could be some really far away thing. Whether you’re be a PhD student or you’re already in the industry, I think this is a practice you should continue.

And it’s increasingly harder today because there’s so much information being thrown at you. Find ways to stay on top without getting overwhelmed. And each person has different sort of views. I always tell my teams, you know, you should always follow all of these people on Twitter because you’re always going to get some nice edited flow of information which will keep you absolutely well-informed.

[01:17:44] Gil Elbaz: I think that’s a perfect way to summarize this. Read very widely, broaden your vision and, see what this industry is really doing beyond the narrow scope of a specific project. I think that’s amazing. And use Twitter. Twitter is an amazing tool I also get a lot of really useful information from the top ML engineers on Twitter. So that’s definitely a great takeaway from my perspective.

Thank you Vijay very, very much for joining me today. It was a really great conversation. I personally learned a lot. And I think that you guys at the cutting edge are really inspiring not just me personally, but an industry, and hopefully all of the folks that are listening here. So thank you very, very much Vijay.

[01:18:26] Vijay Badrinarayanan: Thank you, Gil. I think this was a fantastic podcast. I think we covered so much ground and I love the way you sort of summarize everything at different junctions. Hopefully our listeners, like what you’ve been talking about. Thank you once again for inviting me over.