[00:00:00] Andrew Davison: I think I was always interested in trying to build a system that could understand the 3D world around a device and enable intelligent action. Some people think of SLAM as purely that localization problem, whereas for me, it’s this ever-growing target where what we want is a robot to build a representation of the scene around it.
What we’re really trying to do in Spatial AI, I think, is to invert that process to go from the raw images back to that sort of scene graph, which I think is what in the end really enables intelligent action.
[00:00:34] Gil Elbaz: Welcome to Unboxing AI, the podcast about the future of computer vision, where industry experts share their insights about bringing computer vision capabilities into production.
I’m Gil Elbaz, Datagen co-founder and CTO. Yala. Let’s get started.
This week I’m happy to host Andrew Davison, Professor of Robotic Vision at the Department of Computing, Imperial College London. In addition, he’s the director and founder of the Dyson Robotics Laboratory. Andrew pioneered the cornerstone algorithm for robotic vision gaming drones in many other applications.
Yes. You know what I’m referring to? SLAM simultaneous localization and has continued to develop the SLAM direction in substantial ways. Since then, his research focus is in improving and enhancing SLAM in terms of dynamics. Scale, detail, level, efficiency, semantic understanding of real time video, and more.
Under Andrew’s guidance, SLAM has evolved into a whole new domain of spatial ai, leveraging neural implicit representations, and the suite of cutting edge methods to create a full, coherent representation of the world from video. Welcome to Unboxing AI. I’m so happy to have you here with us, Andrew.
Andrew Davison: Oh, a pleasure to be here.
Gil Elbaz: So I would love to maybe take a step back and really dive into the concept and story around SLAM in the first maybe versions of this starting out back in the nineties. Really, if you could tell us how you got into this and what were the first steps that you took in this? Sure.
[00:02:07] Early Research Leading to SLAM
Andrew Davison: So I started my PhD in 1994, having previously done an undergraduate degree in physics.
So I had a mathematical background. I’d always had an interest in computers. I’d done quite a bit of programming on my own time and I was thinking about doing research, but there was nothing particular in physics that attracted me to do a PhD. And I heard that there was a robotics group in the Department of Engineering in Oxford where I was studying.
So I invited myself over and ended up doing a PhD supervised by Professor David Murray. And really what he was working on at the time was active vision. So this means robotic cameras, which could be controlled and moved and pointed. They’d done quite a lot of work on using an active vision system to track moving targets.
And this was a stereo active head. It was scored called Yorick which was mounted on a fixed space and they would track things moving past it. But really the start of my project was a new version of Yorick which was a bit smaller and which was able to be mounted on a mobile robot platform. And really the start of my PhD was, what can we do with this?
And people had tried to do a few things with it, and most obviously, active vision, things like while the robot is moving, use the active cameras to track some kind of goal or target and therefore have a steering behavior. But what David and I were interested in was how could you enable more sort of long term intelligent action from a robot that had this visual capability. And that really led us quite soon into mapping how was this device actually understand the scene around it in a more persistent way? And so how could it use these cameras to, to make some sort of map of landmarks around it? And that was what really led me into to SLAM.
Gil Elbaz: That’s super interesting. And I think that bringing together both localization and mapping is not necessarily trivial, right? Until the robot can actually move around complex environments. It’s not, you don’t need to do both at the same time. But once you actually combine this new hardware paradigm in a way, it leads to new use cases and needs in the algorithm side of things, which I think is quite fascinating to see it also happen together.
And so to get a bit of context, I would love to understand. What was the first version of SLAM? What was it able to do? And I’d love to maybe dive from there into MonoSLAM in 2003 and afterwards we can talk more about the modern SLAM variance.
[00:04:49] Why SLAM
Andrew Davison: Sure. Yeah. So I think as soon as we started thinking about this robot building a map of the space around it, you realize quite soon that these two problems of localization, mapping are very much tangled together.
There’s loads of work on either part of that. So localization is the problem of estimating where you are, given that you have a map and you can think about that. It’s a triangulation sort of problem. Problems. If you have something like a camera that can measure the angles to some points in the world and you know where those points are, you might also, so we were actually using a stereo camera, so you get some observation of depth as well from stereo.
So it’s a question of then working out where you are. And then there’s the converse question of mapping. If I know where I am, let’s say I’m walking around a space and I have a gps sensor, giving me accurate position, and then I notice some things around me and I want to try and build them into a map, I can read off where I currently am, I can maybe add on the distance that I’ve measured onto that, and then I can put things into a map.
But when you are trying to do both of those two things at once, the two estimates become very tangled together. [00:06:00] And to really make progress on that, you need to think about it as a joint estimation problem. So we started to discover the literature. Essentially in probabilistic estimation.
So at the time things like Kalman filtering were very much so the tools that we were using. There were other great researchers around the world starting to get interested in these problems. We were inspired by people like Hugh Durrant- White John Leonard, who’d worked on similar problems in the past and really started to try and put that sort of system together.
And so the earliest versions of the SLAM systems that, that we built were very sparse feature based maps of systems. So we would, when we were mainly working on indoor robots, we would try and make a map of a room that a robot was moving around with maybe 30 or 40 landmarks. So these were very sparse points, like the corner of a door, automatically found by the robot as things that might make good long term landmarks.
Many measurements of those as the robot moved and try and get a good probabilistic estimate of their locations. So we produced one, one of the very first systems that actually worked in real time to do this. Certainly one of the very first that was using vision as the primary sensor.
But then there were several other groups around the world back in the nineties that were doing quite similar work, usually with different sorts of sensors like sonar or, early versions of laser range finders as.
Gil Elbaz: Back then, did you think about vision as the kind of ultimate sensor in a way because of how similar it is to how humans perceive the world?
Or was that by chance that you were working on vision?
Andrew Davison: In that aspect? I think I always had an idea that it would be the long term thing that we would end up using. So yeah, clearly there’s that biological argument that humans and animals seem to be able to navigate spaces using primarily vision.
So that was very convincing.
Gil Elbaz: And the spaces were also built for us in a way. They were built for, Yeah. Visual systems in many ways. Yeah.
Andrew Davison: Definitely if you’re thinking about indoor robotics and systems that, that will operate alongside humans, of course. And then I was working, I’ve always had one foot in computer vision and one foot in robotics, and I was working in a group where most of the research was in computer vision.
[00:08:20] Computer Vision Based SLAM
Andrew Davison: I actually happened into this amazing environment in. In Oxford where there was our active vision group. We had Andrew Zisserman visual geometry group in the room upstairs we had Andrew Blake’s group doing visual tracking in the room next door. So I was surrounded by people interested in computer vision and in particular the sort of geometrical computer vision that was going on at the time, which was very much the same sort of problems of cameras moving through spaces, estimating camera positions, estimating the positions and geometry of thing progressing and how it was such a big thing. And I think our kind of niche and interest in particular was how to make that real time, how to actually put it onto a robot. That was where we thought there was a special opportunity.
Gil Elbaz: And can you tell us a little bit about MonoSLAM? So the algorithm from 2003?
[00:09:18] MonoSLAM Breakthrough
Andrew Davison: Yeah, so after my PhD, so I finished in, in 2003, we built this robotic SLAM system. I then actually went to do a two year postdoc in Japan, where I continued to work on Active Vision First Lamb in one of the national research labs over there. And I think it was really during that time that I thought what we developed was very interesting.
I think one thing was it was hard to get people interested in it because it was obviously quite a complicated robotic system. It was not something that other people could easily. Re-implement or even try something similar to, cuz it seemed like it, you needed this really [00:10:00] complicated hardware.
And so one thing that was very much inspiring me towards MonoSLAM was can we strip away the hardware here? Can we do it with the simplest hardware system possible, which would be just a camera and a computer. So that was always in, in my mind, could you do real time SLAM just by plugging in a webcam to a standard computer?
And I think the other thing on my mind was, having interacted with all these computer vision people trying to move into a world that they had to take an interest in, had to pay attention to, and there they were, working purely with images and showing what you could do. So yeah, the reason to simplify the hardware was for both of those reasons but anyway, so actually making that work took me quite a few years.
So between finishing my PhD and then the original version of MonoSLAM published in 2003 was a five year period, cuz there were quite a lot of technical challenges to solve. So because we didn’t have a robot, first of all, some of the things that make it easier, whether the robot were gone. So a robot was moving on a 2D ground plane.
That was one strong assumption we always had. Even though the map was always in 3D, the robot motion was always in 2D also, we had robot optometry, so you can count wheel turns and you’ve got some pretty reasonable estimate of where your robot is before you even look at the images. So in order to replace those things, first of all, I had to learn much more about, 3D
How do you properly model 3D motion of a camera rotations? We used a kind of smooth motion model in mono SLAM. So what do I know about the motion of a camera without the dormitory? I’ve got some knowledge just based on an assumption that it moves in a fairly smooth way. And then we wanted this system to also work fast, so the cameras moving around, at some handheld speed, or it might be later on attached to a humanoid or a wearable system.
So we needed to run 30 frames per second was always my target. So how are we going to do that? And that was interestingly where an idea from the earlier work with active cameras really came in. So even though we were not controlling the direction of the camera in MonoSLAM, like we were in the active camera system, what we could control was the processing.
So we had this concept of active selection of measurements. So whereas with the active camera, you had to really think about where you were going to point the camera, which feature to measure at each time, step in MonoSLAM we could control the sort of processing focus. So we would on each new frame, think about which features are likely to be visible here, and then make a prediction of where they’re going to be.
And then specifically search small regions of the image. And that was the thing that really allowed us to make this system a run fast. So the outcome of all of that was, yeah, this system we could demo, which I could just hand hold a camera, wave it around in a room. Do real time SLAM, build a map. We, it was, somewhat more detailed than in the earlier days.
So we maybe, we would have a hundred landmark features in a room. We could track the position of the camera maybe to, a few centimeters accuracy, something like that. But that was good enough to really prove, I think, that we were onto something interesting. And one of the things I really did with Mono SLAM was it was all about the live demo.
And I would try and show this demo to as many people as I could. And that was where a lot of the application interest really came from. So people realized, Oh, I could put this on a vacuum cleaner robot. I could put it on an augmented reality system and track its motion, that kind of thing.
[00:13:47] Applications of SLAM
Gil Elbaz: I’d love to dive even more into the applications themselves because they are very wide and it is extremely interesting. The first applications that you guys were looking at, what were they were. Really around cleaning robots or were they around something?
Andrew Davison: Something else? So at the time we were building MonoSLAM, I think I had a fairly broad view about what the applications could be. So certainly robotics was always the primary interest.
And in Japan, for instance, when I’d been working there, there was a big group working on humanoid robots and we were really thinking, okay, how would we give this robot a scene understanding. Capability and a general 3D SLAM system seemed like what it needed, but there were other people that were around that were very inspiring in, in terms of applications.
So one of my colleagues in Oxford was while Terry Myel- Cuevas, and he was building a wearable vision system. So he actually built these collar based, So it was a collar that he wore that had a little camera system on it, and it was actually even motorized and it could move in pointing in different directions.
So the forerunner of a lot of recent, reinterest again in egocentric vision. So that could be some sort of assistive device which could help a human, maybe some non expert in a domain was getting advice on how to build something or how to dismantle something from this assistive device.
So that was another thing, and. Yeah. And then as I said, as we went around just demoing this system, more and more people would come up to us with ideas about what could be done with it. So for me the most concrete thing that came out of that was when I spoke to researchers at Dyson who at the time were working actually on robot vacuum cleaners.
So they’ve actually been interested in robot vacuum cleaners for a long time, but had this idea that their robot should be better than a random bounce system, which was what you could get at the time from companies like iRobot. They wanted a system that could really clean room systematically, know what it had cleaned and what it hadn’t know when it had finished and be able to pass each part of the floor exactly once.
So that was very motivating for SLAM and the fact that I was able to show them this real time SLAM system that worked with a single camera, again, just allows you to think that could be something that we could make cheap enough. Actually put into a consumer product. So I ended up then working very closely with Dyson for a number of years after that to help them design their SLAM system within their first robot vacuum cleaners.
[00:16:27] Modern Versions of SLAM
Gil Elbaz: Yeah. In many cases, lowering the barrier of use for some of these larger algorithms, these more complex algorithms, suddenly enables these new waves of applications. And it’s completely amazing to think that beforehand it was just too heavy, it wasn’t practical, wasn’t good enough. You invested five years, a lot of effort in really improving and iterating and creating this initial algorithm and showing it to the public in such a kind of front facing way, in such a forward looking demo.
That’s really powerful. I think. It’s also quite interesting that you did it so early on back in 2003, I’m sure that not every algorithm had a fully working demo that was with a handheld camera that other people could also use and play with. That’s very much akin to things that we’re seeing now in 2022, that you have these online demos that can be worked with and played with and you can actually see the results.
So doing that 20 years prior is incredible. I would love to maybe with respect to Slim Touch a little bit on the modern variance of Slim Back then we had a small amount of key points in a room, single camera. We don’t have that many assumptions, but you did mention assumptions around a uniform height for some of the applications.
What are the more modern variants of SLAM? What do they give us? What are the assumptions there? I looked at a dense semantic SLAM prior to this SLAM based on event cameras, which is also super interesting for me. And then the real time learned representations for real time monocular SLAM, that also was extremely interesting.
If you could touch on some of those, that would be great.
Andrew Davison: So there’s been so much going on since that time. So one thing that’s clearly happened is SLAM has become productizable in various areas. Elements of SLAM technology that have definitely gone into things like autonomous driving. I’ve never personally worked in that area, but things I know more about.
Consumer robotics, AR/VR systems, drone is pretty much, Or use SLAM Yeah. Drones. And all sorts of emerging robotics type of applications. So interestingly, the way that those systems work, mostly, if the main interest is, really accurate motion estimation, the methods I would say are not that different from the ones that we were using 20 years ago or so.
So of course there, there have been many kinds of developments and improvements. The way that you [00:19:00] detect and track the key points, the way that you do the probabilistic estimation behind the scenes, the way that you will fuse it with other sensors and especially inertial sensors have turned out to be super important in doing general visual SLAM.
Those are all developments which have come in the more recent years, but they’re High quality SLAM system that is, for instance, now built into an iPad or is running on a drone from Skydio or DJI or something like that. The main sort of stuff behind that is quite similar to the sort of feature based geometric estimation type SLAM system.
Meanwhile, in the research world, and I think gradually moving towards applications is. Vision that SLAM can be about much more than just a estimating position, and increasingly about discovering more and more useful information about the world that can actually be used for higher level [00:20:00] behavior or intelligence of things like robots.
So then you have the concept of dense mapping, where you’re not just trying to build a GPS set of landmarks, you’re trying to find a full detailed geometric map of the scene, and then also semantic scene, understanding where you are trying to understand objects at their locations and context and all of those things.
I would say work in those areas is still very much. Ongoing, how to do it well, how to do it accurately, how to do it efficiently. There’s lots of people trying things out over time. Yeah.
[00:20:34] Gil Elbaz: We’ve seen in the past. So we’ve seen both a big push from the companies working on AR/VR, for example, pushing very hard on the capabilities of SLAM.
We, of course, generate synthetic data. We’ve also, in the past worked with some of these companies and we’ve helped them create different 3D synthetic environments that are fully labeled and a full 3D ground truth to help also improve their SLAM systems. So we [00:21:00] started working on it like around probably 2018, we had our first projects around SLAM.
Since then, we’ve seen a lot of different amazing progress happen especially with the kind of combination of these SLAM based algorithms and deep learning coming together. In a kind of coherent way. So like key points now are found with deep neural networks in many times the descriptions of these key points are described with features that are extracted from deep neural networks.
And so this is an extremely kind of interesting domain. When we talk about SLAM and we’re talking about localization and mapping. So the map, you mentioned semantic understanding as well as the 3D environment, but what is like the ideal map when we look forward in time? What is the ideal SLAM system supposed to be?
[00:21:50] Spatial AI
Andrew Davison: Yeah, certainly some, something I’ve, I’m thinking about all the time and still don’t have a good answer to. So recently I’ve started using this term spatial AI to describe the thing that I’m interested in, which in my mind is not really different from what I was always interested in. It’s still, I think I was always interested in trying to build.
A system that could understand the 3D world around a device and enable intelligent action. It’s just, yeah, some people think of SLAM as purely that localization problem, whereas for me, it’s this overgrowing target where what we want is a robot to build a representation of the scene around it. So that word representation I think is crucial.
So what are the properties of that representation of the world that you would really like to have? So I think it should be geometric and semantic. I believe that a lot of the things we would like robots to do require some sort of persistent close to 3D representation so that this is always a kind of point of discussion in AI, maybe an embodied AI.
In general, can you just train a system end to do tasks? And maybe you can in the very long term. But I would strongly say that I think about something like a robot that might be able to clear up a kitchen, think about what it would need to understand about that scene in order to, let’s say repetitively move backwards and forwards, pick up all the cutlery and put them in the right place.
It’s very hard for me to imagine a system like that operating without something, close to a 3D model of that scene being stored and updated. And that’s not to say that I. it should be, a millimeter precise representation. I actually think of it as something much more, maybe locally precise, but globally more sort of soft and and graph like, which I think maps to our human sort of experience of understanding places.
When you are looking at that scene right in front of you, I do think you have a very detailed, maybe even millimeter precision understanding of where [00:24:00] those objects in front of you are. And you can think about, really tricky novel things that you could do with the objects in front of you and you can mentally simulate that scene.
Whereas as soon as you turn away from part of the scene, I think it fades down to something much more, soft and graph. Like my vision of the sort of representation we’re trying to build is something like an object kind of scene graph with probably, locally quite precise information.
Globally something a bit softer. So one, one of my former PhD students actually ran NATO Moreno. He had this interesting thing, something you’ll often hear is computer vision is inverse graphics. And he would say, no I think computer vision is like inverse video game development. So given that, so he’d actually worked in the video game industry before coming to, to do a PhD.
And if you are a video game designer, laying out a level in some [00:25:00] 3D game, you have a big set of 3D assets, you would design the rooms, you would lay out which sort of objects are gonna be present in this room. And then the system can render it and give you photo realistic views somehow.
Really trying to do in Spatial AI, I think is invert that process to go from the raw images back to that sort of sync graph, which I think is what in the end really enables intelligent action.
[00:25:26] Gil Elbaz: Very interesting. Yeah, there are two points that are two things that I wanted to dive into.
One is that you mentioned together with localization and mapping, you also mentioned simulation as something that was coherent and it actually struck me as an interesting point, like it could be that part of really understanding the scene as you mentioned before. Part of really understanding it is also the ability to simulate different scenarios within that scene and maybe that holds a place in the future of these kind of SLAM systems or semantic understanding of [00:26:00] the environment in a way that was just an interesting point that struck me.
[00:26:04] Implicit vs. Explicit Scene Representations
[00:26:04] Gil Elbaz: I will say that it seems to me that. There is a big question around do we want to really define an explicit scene graph representation, or can it be an implicit representation that doesn’t have any real queryable structure? So if we take, for instance, the graphics engines, for example, way that they hold all of the information in them is super structured, right?
It’s something that we can program. It’s something that is very well defined. And so we have a tree of, let’s say, a person and that has eyeballs and it has irises, and you could find the entire graph and understand all of the assets that make up that 3D simulated person, for example. And then on the flip side, we have systems like NerF that have these internal representations of 3D information that are represented within a neural network in a way that is much softer, much [00:27:00] harder to query.
But does actually hold semantic information in a strong way. And so maybe the question for you is, do you see that there’s a need for an explicit representation or is this some kind of combination between NerF and SLAM in a way, as the future of this Spatial AI space?
[00:27:17] Andrew Davison: Yeah, that, that’s an excellent question.
I come from a background of sort of estimation and designed algorithms, probability and everything. Machine learning is actually a relatively new thing for me, but which we’ve very much been embracing recently. So yeah, this question of explicit versus implicit representation I find fascinating.
And we’ve actually been working with these neural field methods like, like NerF and trying to build SLAM systems out of them exactly for that reason, that this whole question of how do you represent scenes has been very difficult and the [00:28:00] obvious, explicit ways of representing scenes like meshes or point clouds or, grid based sign distance functions, or, maybe explicit maps of CAD feature CAD models of objects, for instance.
They all clearly have problems. So this idea that you can represent a scene in a very general way with within your network is fascinating. The work that we’ve been doing, so we’ve had a couple of systems, one called imap, which was a real time neural field. SLAM system and then building onto from that.
Recently we had a some papers called Semantic, NerF and Eye Label. So in SLAM and don’t necessarily think that you want to build a photo realistic model of the scene. So when we saw Nerf, it seemed overkill somehow for scene understanding. But what we investigated was what if you try and simplify that a bit.
So take a smaller network. So that’s the main thing that IMAP does. It uses a network very similar to Nerf, but it’s a bit smaller. It makes it [00:29:00] also a bit easier by using instead of raw color input, we use our G B D input. So we have a depth camera moving around. Scene. And then to enable this, the NerF thing that would take hours to train, to be trained in a few minutes and to work in real time.
We also do some active sampling. So in each frame we’re choosing a set of pixels to render. So the outcome of that is a NerF like system that can build a map of a room with, reasonable quality. It’s not super high definition, but within a couple of minutes. And then what turned out to be in interesting with that is there’s a sort of automatic compression and coherence that happens when that network learns to represent a room.
So one of the examples I remember, so in my, this was my student Edgar Suka in his room that he was running this in, there was a football on the floor. And as this network that’s representing the room trains, that the whole kind of representation kind of wobbles because stochastic gradient dissent is going on around the weights of the network.
But that is not causing just uniform noise around the whole map. You actually see that certain things in the room wobble in a really coherent way. So the football on the floor was like getting bigger and smaller in a very coherent way. So that sort of really indicated to me that there are probably not many weights in this network that are involved in representing the 3D shape and position of that ball.
So maybe there’s even one neuron, which is like the size of this, of the spherical part of that. That really indicated that there’s a kind of automatic compression into a low dimensional kind of decomposition of this scene going on here. And so that led in our later work called Eye Label into a scene understanding system where all that you do is you add an extra output to your neural field that’s representing this scene to represent semantic classes.
So in fact, you add n extra outputs to represent one hot semantic classes. And then at run time you give them very sparse supervision. So as a user you put a few clicks in the scene and say, this pixel is part of a wall, this pixel is part of a table. And then you ask it to predict, what’s the highest of all of those outputs for every part of the scene.
And you get this remarkable spreading of those semantic indicators across objects, obeying boundaries between. Objects. So somehow this decomposition into a, something a bit like a scene graph has automatically happened in that system. That’s ongoing research and I wouldn’t immediately want to throw all my eggs in that basket.
I remain very interested in more explicit representations as well. But that’s been really fascinating for me to see. Yeah, it’s really
[00:31:57] Gil Elbaz: fascinating. I think that also on the [00:32:00] graphic side of things, you can see, these amazing capabilities coming out of companies like that are creating the programs for Unreal Engine.
So Epic games, for example. Or blender, which is this open source capability or unity. And you have these amazing graphics engines today, but they’re not really leveraging the Nerf capabilities at all. And like you said there, there’s very interesting inherent kind of compression that we get from it and semantic control maybe that we can create using these Nerf systems.
But right now there is a gap between the graphic side of things, these 3D explicit representation and the implicit representations that we get from Nerf like systems.
[00:32:41] Andrew Davison: I wonder if there’s a sort of forking through most of my career I would’ve said that this capability to move a camera around a scene and build a 3D representation is very common.
It’s a single sort of technology and you should use essentially the same methods whether you are interested in. Modeling accurate [00:33:00] scenes that you are then gonna use in a telepresence operation, or whether it’s a drone building, a map of a scene, cause it wants to fly through the trees. And now maybe I see there’s a bit of a, an interesting fork in the road where there, there’s people who are interested in, photorealistic modeling of scenes.
And then there’s the things that an embodied real time device would want to do. So to build a representation which doesn’t, it can’t capture all of that detail, but it captures somehow this semantic. And geometric properties that are useful for what it actually needs to do. So a Nerf is really trying to go very hardcore It’s representing the, the color and intensity of all of the light in the scene. And I don’t think a robot needs that. So something much more implicit and abstract. So another thing we’ve been working on is Nerf, like networks that render feature maps rather than rendering explicit color could be a much more reasonable representation for robotics.
[00:34:01] Gil Elbaz: Yeah, definitely. And maybe diving back into robotics a bit, I’m always interested in what are the kind of state-of-the art capabilities of today and where are they progressing in Dyson? Pretty much the state of today’s commercial robots. Can you tell us a little bit about what is the state of the current generation of publicly available robots, and then maybe a bit about how you see this progressing in the next five years, next 10 years.
[00:34:32] Impact on Robotics
Andrew Davison: Yeah, so as I said, there, there are, certain robot products that exist. Drones, robot, vacuum cleaners, things like that. I actually think then there’s quite a big gap to other products that you might think about making. So in, in indoor robotics, especially area I’ve been interested in, let’s imagine, general home help type of robot that could tidy up a room or something like that.
So various companies have. I’ve been interested in that. So d Dyson that I continue to collaborate with are actively working in that area and doing some yeah, amazing stuff. A bit a big internal research team that we collaborate with, but I think they and others would openly say these are very difficult problems.
So manipulation especially is what you need to Domo almost anything in robotics that isn’t just patrolling or cleaning the floor. I would say progress in, in manipulation has been Harder than people expected. So definitely there’s been some progress recently and machine learning, reinforcement learning coming into robotics, simulators being used to train algorithms, that’s very promising.
Actually, there’ve been some real breakthroughs in things like quadruped walking in, in the last year or two based on RL in simulation that have really surprised me. A actually. But I think manipulation is still so hard cuz it’s this meeting point of tricky hardware with advanced scene understanding that, that you need.
So you just have to think for a second or mentally. Inspect yourself as you do something like pick up a pen and just all these kind of compound complicated motions that you do just to remember how hard manipulation actually is. So some of the things we’ve seen in manipulation, reinforcement learning has enabled, for instance grasping of objects in quite a varied, cluttered situation.
So things like the arm farm type of training that Google did, showed that you could really train a robot to pick up lots of different objects, but mostly what that was about was picking up objects and dropping them. Whereas what if you want to pick up objects and actually use them? So I want to pick up an object and place it precisely in some place, or use it as a tool to operate something else.
I still think that motivates, seen [00:37:00] understanding capabilities that we don’t quite. Have yet. So I think that may still be the hardest part of of robotics. So tons of progress and I think in more robotics has always been useful in very controlled situations like factories.
And I’ll see that, that concept of what a controlled environment is will gradually become more and more general. And we’ll see robots that can have more freedom and roam around, but the really general robot that you could expect to operate and do general things in your home. I think there’s still a lot of lot of work to do on that.
[00:37:37] Gil Elbaz: So that’s super interesting. Pretty much in these unstructured environments that have a lot of uncertainty, a lot of noise, a lot of challenges. What I’m sensing is that we really need to be able to create a coherent. Understanding scene understanding, understand really the whole full 3D space and the semantics of it, as well as some [00:38:00] kind of physics based understanding.
So it could be, going back to the point on simulation, it could be the ability to simulate different changes in the scene or different manipulations to the scene before actually doing them in practice. But yeah, it seems like there needs to be also some kind of physics understanding here. And RL through simulation was able to take this knowledge and insert it into these neural network-based systems in a way, yeah.
Do you seel as really the leading way to go forward with robotics? These, for example, humanoid robotics, a few steps in the future, R. Going to be one of the more promising directions, or should we look towards other kind of methods as well.
[00:38:46] Reinforcement Learning (RL)
[00:38:46] Andrew Davison: So I’m not an expert in RL that I’ve worked with people who are, I think it’s a super important component.
I, I think when you’ve got a really well defined local tasks for which it’s hard to [00:39:00] design a solution like a quadruped walking over bumpy ground, or let’s say, putting a peg in a hole that requires a lot of kind of force feedback and stuff, those are definitely the right areas to apply RL in.
My, my instinct is that people have been reaching for RL a bit too generally in some other areas. So to think, for instance, the RL could solve a problem like moving a robot, all the way across a room and tidying up some objects on, on, on a desk, seems. Too much to expect it to do. You just have to think of, what do I actually expect this network to learn?
And I think you’re expecting that network to essentially solve a SLAM type problem build a persistent representation of the scene. And while that may work in the long term, at some scale of network, some scale of training data, my instinct is let’s do something more explicit for now for that kind of scene understanding part.
[00:40:00] Let’s call our RL for lots of these, more local, highly sort of contact rich sort of tasks that are just really hard to design controllers. Simulation, I think is just would, will continue to be more and more important. I’d say that’s another thing in my career that has crept up that I didn’t necessarily expect just how useful and how good simulation can be.
I came from a background of most roboticists that I would say a very mistrustful of a, of simulation. And simulation can be not just about run a simulation beforehand, train something on it, and then run that network, seem to real in the real world. I like the idea of real time simulation.
So as you are running your system there’s the real world and what’s going on, and then this kind of digital twin of that road, it’s your hopefully always up to date simulation of whatever it is that’s really there. And then that may allow you to map into [00:41:00] behaviors that you’ve learned previously.
Offline in the simulation, or you can actually do real time simulation. So here’s, some completely new object I’ve picked up and I want to figure out can I use this, to manipulate this other object. Simulators are so good and so fast now you could just run a thousand decks, real time simulation, try out a thousand different things, see which one works, and then actually execute it.
So yeah I’m very bullish about simulation in general, but for me still, probably the hardest part is that mapping from here’s my real observations from a camera of the real world to get good enough knowledge to actually in instantiate and keep up to date that, that simulation for me, that’s the kind of SLAM part.
[00:41:45] Gil Elbaz: That’s super interesting and definitely there are so many parts that come together and definitely the SLAM is one of the major pieces. Without it, you can’t instantiate this simulation like you’re saying. And then these capabilities that require manipulation [00:42:00] are gonna be extremely hard.
It’s not impossible to create. And so just on the like, last question around robotics, Do you see, let’s say by 2030, for example, do you think we’ll be able to reach the coffee test? What will a robot be able to walk into our house and make coffee for us? Even if it’s never been in that house?
[00:42:20] Andrew Davison: I would hope so. Yeah. It’s always hard as a researcher to jump out of that this local problems that linear thinking that you see and try to remember that the long term. Usually exponential progress. Yeah. As I sit here right now, that feels like long enough that we might see some quite good progress.
[00:42:39] Gil Elbaz: I made a bet with one of my friends who’s also a researcher on a bowl of hummus. That it’s gonna be possible by 2030. So I’m hoping for it. He thinks it’s gonna take at least till 2035. We’ll see. He’s right. Maybe just diving back into the algorithms a bit, talking a little bit on classic versus deep learning, just recently went [00:43:00] over global belief propagation and I’d love to dive into it with you if possible, a little bit on both when it is useful, what it does and how it works.
[00:43:10] Belief Propagation Algorithms for Parallel Compute
[00:43:10] Andrew Davison: So as you may have seen from some of my recent talks and publications, yeah. Belief propagation is something I’m incredibly excited about at the moment. So that really came from thinking about the future of robotic applications and this interaction between algorithms and hardware, I think is so important and is un under discussed actually.
So through my career, I’ve seen. Hardware coming along has completely changed the game. So I’ve seen that a few times. So one was definitely GPUs. So we’d been working with serial processes for all, most of my career, up, up till about 2010, a little before that. And then we started to see people using GPUs for vision.
And in particular it was some of the students in my lab. So Richard [00:44:00] Newcomb, Steve Lovegrove at the time started to delve into things like realtime stereo using GPU acceleration. And that became the key elements of some of the first real time dense SLAM systems that could actually build really detailed 3D representations in real time.
So another technology that came along around that time was depth cameras. So the connect camera came out around that time and it’s hard to remember. Now that depth cameras were things we’d heard of before, but they cost thousands of pounds and they were frankly not that good. And then suddenly you could go down to a local shop, a gaming shop and buy for a hundred pounds.
This really am an amazing device. So that will continue to happen and I think people are very much in a GPU mindset, especially at the moment. But that is a passing phase. ? I think so. So specifically talking about processes, I think we’re at the dawn of a real sort of bursting [00:45:00] onto the scene of lots of different ideas in how processes should work and especially processes for AI.
And I think the long term trend, it has to be towards parallelism. So you can’t. Single processes faster anymore, or it’s too power hungry to do that anyway. So you have to embrace parallelism. But I think it will be generally a much more, general heterogeneous parallelism than we are used to on GPUs at the moment, which are very good at doing the same thing or at the same time.
So I think that, graph like processing of thinking of many, Quite independent processing cause with their own memory and processing capability. And then graph like connectivity that enables communication and message passing. I think that’s where computing is going. And that can happen, within single chips.
So there’s a company in the UK that I’ve been collaborating with called Graphco, which has these amazing they call it an I P U and intelligence processing [00:46:00] unit. So it’s a many core chip that has this kind of all to all communication capability and very general parallelism or across many devices.
So another area we’ve become very interested in is many robot systems. So think about hundreds or thousands of independent robot devices or sensors or wearable units that might be within a space or with their own independent sensing and compute and storage capability. And then some mesh like communication that joins them all together.
How do all of those devices actually coordinate to do things together? So my interest in belief propagation came outta thinking, what sort of algorithms for spatial AI can I imagine working with that sort of compute infrastructure? So you have to give up on the idea of, central memory you have to give up on, building big matrices and inverting them and that kind of thing.
So the standard algorithms that we use to solve geometric [00:47:00] estimation problems in SLAM that they can usually be thought of in terms of a factor graph, which is a big, a graphical model that essentially describes all of the variables of interest and how they’re connected together by observations.
And there are very well. Algorithms for doing inference on fact graphs, which usually in involve, efficient ways to invert big matrices but belief propagation and allows you a way to do inference on fact graphs in a purely distributed message passing way. And what I’ve become especially interested in is, and belief propagation, which is where you make the assumption that most of the probabilities you’re interested in are gian most of the time.
But actually it’s generally enough to be really useful because you can use it with non-linear factors and you can use it with robust kernels, which are the same sort of assumptions we use all the time in SLAM type problems.
[00:47:55] Gil Elbaz: So this paradigm, where are we like right now [00:48:00] with respect to the research?
Is this one of. Earlier methodologies that you see working in such a way? Or is this something that you’re bringing back from a while back?
[00:48:11] Andrew Davison: Yeah. In my view it’s an algorithm that’s existed since the eighties, has definitely had periods of strong interest from researchers, but has been out of favor recently.
And that may be for good reasons that it doesn’t work that well in the end, and we don’t quite know yet. I think it should work well, but I think one of the key reasons could be just that the hardware paradigm we’ve been in has not suited that, that style of algorithm and other things come to the fore.
It’s not a sensible algorithm to use on a CPU necessarily. There are more efficient things to do and not necessarily on, on a GPU either, except for some kind of special cases. So at the moment it’s, I think, reemerging as a niche. It’s largely been [00:49:00] me and a small group of people who’ve been trying to tell this story and show more and more things that we can do with it.
Yeah I’m just really interested to see how it will go and, what we’ve been trying to do recently is just try. Different things with this. So we did bundle adjustments running on, on, on this graph. Core I P U. We’ve done many robot systems. We had a paper called Robot Web, which is many robot localization system.
We’ve also done many robot planning. Application to show that you can use this for instance, for traffic flows through a junction using peer-to-peer communication where all of these individual robot or vehicles you want them to do coordinated planning so that they move through each other smoothly.
And to show that you can actually do that really well without necessarily needing a central base station. Another thing that we’ve done in trying to just push this idea out there, so with my [00:50:00] student Joe Ortiz, we wrote an online in interactive article. It’s called a visual introduction to gian belief propagation.
So in the style of the online journal distill, yeah I think the way publishing is happening at the moment is also pretty interesting that there’s always various ways to get your ideas out there, which aren’t necessarily just scientific papers.
[00:50:20] Gil Elbaz: Definitely. We’ll also link to that.
I also checked it out beforehand and it’s very cool, very visual, easy to understand. So I highly recommend checking that out. Definitely. And do you see any connection here with cellular aha in any way? This is also one of those kind of crazy and super interesting ideas and concepts that have not really made it to a lot of real world production capabilities, but is also fascinating to think about what is possible through those kind of systems.
[00:50:51] Connection to Cellular Automata
[00:50:51] Andrew Davison:
Yeah, I think there’s a really close link. What interests me about cellular automata is it, it’s an interesting [00:51:00] distributed computational machine, so every cell is doing its own thing. And on each step is receiving some information about its local environment. So for me that’s exactly this, The same sort of assumption that excites me about about belief propagation.
So there’s been these brilliant again, online distilled papers recently on, on learning cellular automata from Alex Mordvintsev especially that I’ve been absolutely fascinated by. So what he’s really been looking at is how can you learn the local rules that would work on a cellular automaton, such that interesting global behavior emerges, like being able to create a picture or be able to classification.
And actually we are actively working now on, on similar things, so possibly in a somewhat more general way based on general graphs and thinking about how belief propagation really folds into that as well. So can you actually. Design a general [00:52:00] learning capability into these fact graphs that we then optimize with a bit belief propagation.
So I actually think there might be a very interesting way to, to redesign or to think of a new type of deep learning, if you like, that copies to some extent the structures that we know are useful in deep learning. So I think deep learning, what’s really interesting about it is this massive over parameterization of what you’re trying to learn.
And when you train and your network, a lot of what’s going on is pruning away stuff that doesn’t do anything useful to discover the bits that do. I don’t see why we shouldn’t design a factor graph that has a similar sort of structure, but where instead of weights that we train, we have variables that we are trying to infer.
And as you optimize that factor graph, you switch parts of it on and off in a way that’s quite similar to neural network training, but which might have really nice [00:53:00] properties. And I think the property I’m most interested in is how that might enable. Much better continual learning.
Cuz I think in a lot of our experience in SLAM and Spatial AI of trying to, put neural network methods, together with the estimation stuff that we normally have. So we normally don’t just have a standalone neural network, we are trying to connect it to some SLAM system. So we might have a neural network that’s predicting the depth of single images, but we want to then use that over multiple views and fuse them all to, to estimate the 3D shape of a room.
It’s always been hard to do that. You partly cuz we don’t really understand the uncertainty coming outta neural networks properly. So everything in my background says if you want to do. Continual estimation properly. Be busy and be probabilistic. Understand the uncertainty on everything, then you know exactly the right way to weigh everything properly.
Such that you don’t become overall under confident. And that’s been difficult because your networks [00:54:00] are not Bayesian and then if we think about systems that the properties we would actually like for our robot are not that you would pre-train it completely and then just set it in the world and it has to run.
We would like it to be able to learn in situ. There will be things about this scene I’m in now, objects I’ve never seen before. They didn’t exist in my training set, priors about the shape of this scene that I should really learn and use locally. So how can we build networks that can learn continually in that proper way and making them more Bayesian seems like the right thing to do to me.
And there’s a possibility that this sort of factor graph and belief propagation type idea might enable that. So that’s something we’re working on.
[00:54:44] Gil Elbaz: That’s incredible. And you can clearly see this unique background and a lot of like vast knowledge of the space and all of the modern methodologies coming together towards.
This new potential paradigm. And I think [00:55:00] that it’s amazing to see these things at an early stage before, everything works and it’s all just applications. To see it at such an early stage, I think is extremely powerful. Conceptually, the ideas make a ton of sense.
And of course, the hardware lottery that we got today is one big parameter on, on the methods and the way that the methods were developed ups and out. And so looking forward, looking at how the hardware is gonna be evolving and then together with that, understanding the future of these algorithms and what can be done, what should be done, I think this is incredible work.
I’d love to maybe finish off with a final kind of question. We ask all of our guests to provide a recommendation for the younger generation. The next folks that are coming into the space of computer vision of machine learning, what should they. Maybe doing studying, focusing on that will help them out in their careers.
[00:55:55] Recommendations for the Next Generation of Researchers
[00:55:55] Andrew Davison: Yeah, great question. I’ve listened to a couple of your former in interviews [00:56:00] and heard some really great suggestions there. I definitely agree with the idea that keeping your background broad and maybe doing something unusual early on is a great thing to do. So study physics or some type of engineering that doesn’t necessarily seem machine learning, but that will give you a unique angle.
Definitely. I think there’s some sort of group think going on at the moment in, in a, in our field, particularly focused around deep learning where, yeah, there are a lot, quite a lot of people who can only think of one way to, to attack a problem and giving yourself more breadth can really help there.
I think it’s important to try and keep an eye on things that will actually be useful. One day. And I feel like I’ve always done that, but I don’t quite know where that instinct came from. So a vision that you’re working on, something that you can imagine at least being part of some, long term [00:57:00] solution.
And I think another thing that I often say to students is, are you sure you’re working on the hardest part of the problem here? So there’s really long term challenges like making a robot that could tie to your kitchen. There’s many things that need to be solved there. And yes, maybe it’s true that the localization system, for instance, we could make it a few percent better.
And I’m not saying that’s not worthwhile work, but it’s not really the deal breaker at the moment in terms of actually allowing that. So probably the deal breaker is manipulation for instance. So that’s probably the most important thing to work on and where you’d have the most chance to make Impacting your career or even there might be quite things to do that might seem quite ornal to what other people are doing.
So we are used to seeing these tables in papers that are all saturated on benchmarks for some sort of accuracy. But then you might consider, but for that algorithm to actually be useful, let’s say in an [00:58:00] application like ar, vr, it’s probably gotta be a thousand times more efficient than it is now.
So why, why not work on that in instead and it can be hard to work on things like that cuz other people might not really agree that it’s important or interesting. Say you have to the faith, I think to stick up something like that for a while. But I do believe that if you are really working on something which ultimately is important to the application, then the time will come when people will be interested in that and you’ll get the credit you deserve if you’ve done something good there.
[00:58:34] Gil Elbaz: Amazing. Amazing. So keeping it broad and really focusing on the hardest parts of the problem. Focusing on the most interesting bottlenecks right now and understanding that things take time, I think is also a big takeaway that I have from this. Things take time, but if you are working on the interesting part, it will converge and come together.
Andrew, thank you very much for your time. It was a pleasure talking.
Andrew Davison: [00:59:00] I’ve really enjoyed it. Thanks very much.
[00:59:02] Gil Elbaz: Thank you. This is Unboxing AI. Thanks for listening. If you like what you heard, don’t forget to hit subscribe. If you want to learn about computer vision and understand where the future of AI is going, stay tuned.
We have some amazing guests coming on. It’s gonna blow your minds