Our guest this episode is Glenn Jocher, CEO and founder of Ultralytics, the company that brought you YOLO v5 and v8. Gil and Glenn discuss how to build an open-source community on Github, the history of YOLO and even particle physics. They also talk about the progress of AI, diffusion and transformer models and the importance of simulated synthetic data today. The first episode of season 2 is full of stimulating conversation to understand the applications of YOLO and the impact of open source on the AI community.
TOPICS & TIMESTAMPS
2:03 First Steps in Machine Learning
9:40 Neutrino Particles and Simulating Neutrino Detectors
21:09 History of YOLO
25:28 YOLO for Keypoints
29:00 Applications of YOLO
30:48 Transformer and Diffusion Models for Detection
35:00 Speed Bottleneck
37:23 Simulated Synthetic Data Today
42:08 Sentience of AGI and Progress of AI
46:42 ChatGPT, CLIP and LLaMA Open Source Models
50:04 Advice for Next Generation CV Engineers
Glenn Jocher is currently the founder and CEO of Ultralytics, a company focused on enabling developers to create practical, real-time computer vision capabilities with a mission to make AI easy to develop. He has built one of the largest developer communities on GitHub in the machine learning space with over 50,000 stars for his YOLO v5 and YOLO v8 releases. This is one of the leading packages used for the development of edge device computer vision with a focus on object classification, detection, and segmentation at real-time speeds with limited compute resources. Glenn previously worked at the United States National Geospatial Intelligence Agency and published the first ever Global Antineutrino map.
ABOUT THE HOST:
I’m Gil Elbaz, co-founder and CTO of Datagen. In this podcast, I speak with interesting computer vision thinkers and practitioners. I ask the big questions that touch on the issues and challenges that ML and CV engineers deal with every day. On the way, I hope you uncover a new subject or gain a different perspective, as well as enjoying engaging conversation. It’s about much more than the technical processes – it’s about people, journeys, and ideas. Turn up the volume, insights inside.
CEO and Founder, Ultralytics
[00:00:00]Glenn Jocher: So for real life applications, you want an AI model to always be living and breathing and improving just like we are. So you get outta bed, you accidentally learn something new that you didn’t intend to, but you learn it anyways. And that’s how the human brain works. And so we want models to behave the same way.
[00:00:17]Gil Elbaz: Welcome to Unboxing AI, the podcast about the future of computer vision, where industry experts share their insights about bringing computer vision capabilities into production. I’m Gil Elbaz, Datagen, co-founder and CTO. Yalla. Let’s get started.
Glenn Jocher is currently the founder and CEO of Ultralytics. A company focused on enabling developers to create practical, real-time computer vision capabilities with a mission to make AI easy to develop.
[00:00:50]Gil Elbaz: He has built one of the largest developer communities on GitHub in the machine learning space with over 50,000 stars for his YOLO v5 and YOLO v8 releases. [00:01:00] This is one of the leading packages used for the development of edge device computer vision with a focus on object classification, detection, and segmentation at real-time speeds with limited compute resources.
[00:01:13]Gil Elbaz: He has previously worked at the United States National Geospatial Intelligence Agency, the NGA, and published the first ever Global Antineutrino map. Glenn, welcome to Unboxing AI. It’s really great to have you here. [00:01:27]
Glenn Jocher: Yeah, thanks a lot Gil. Thank you for having me.
[00:01:29]Gil Elbaz: Great. So maybe to kick this off, I actually dove into your GitHub and back in 2018, I saw that you had your first kind of big public commit, and I wanted to ask you a little bit about your journey into the world of open source.
[00:01:42]Gil Elbaz: So, first public commit in 2018. If you could take us back there. What was going on back then? It was part of the velocity estimation, computer vision algorithm for cars developed in matlab, all that good stuff early on. So I’d love to get your take on what was the sense and, and what was the idea [00:02:00] in really releasing this open source and, and motivation and all that.
2:03 First Steps in Machine Learning
[00:02:03]Glenn Jocher: Yeah, thanks for asking. That’s a great question. The particular repo that you’re mentioning about, it’s called a velocity repo, and it was my take at using visual methods to determine the speed of cars. I used to get a lot of speeding tickets back in the day, and I thought maybe we could do this a better way since mostly we used expensive radar and hardware.
[00:02:20]Glenn Jocher: But this all started because my efforts at the NGA, the National Geospatial Intelligence Agency, where Ultralytics was a subcontractor. There we did particle physics and it was very exciting. We did some work with anti neutrino detection, which is a whole nother story. Very interesting. Particle behaves very strangely, and we’re trying to understand it a little better.
[00:02:43]Glenn Jocher: Of course, the government, especially the intelligence agencies, don’t like their work being published. So back then we didn’t have any open source repositories. We didn’t even use Python, actually. We used MATLAB because it was the sort of language that they teach at university in the US and when you get out in the job market, it’s all you know, so you need to use that.
[00:02:59]Glenn Jocher: But [00:03:00] once all those contracts wrapped up, I had a lot more free time in my hands and a lot of the graduate students that we worked with were using Python, and so I started investing in that. And I’d taken an interest in AI. So on the particle physics programs, I started using AI for regression and classification of anti neutrinos in detectors.
[00:03:20]Glenn Jocher: We wanted to know where these particles were coming from, what their energies were and their directions. And so the interesting thing is that this type of problem has really clear analogs in the vision space. And when I looked over at the vision space, I saw that it was immensely more popular than the antineutrino world.
[00:03:38]Glenn Jocher: So antineutrinos are a very specific type of science, and what I saw was that when we’d go from one conference to another around the world, We’d have a conference in Honolulu and then another one in Paris, and the same hundred scientists would show up each time, the same exact people. And I thought, wow, this is a very niche science.
[00:03:57]Glenn Jocher: And I noticed also when I published a paper, it’d get maybe like [00:04:00] 12 citations. So I just wasn’t having the impact that I thought I could have. And more importantly, the tools that I was designing, we couldn’t put them in the hands of regular people. It wasn’t changing lives. And I thought I could have a bigger impact in open source.
[00:04:14]Glenn Jocher: And the most popular architectures of the day were these YOLO models that were coming out. And I took a good look at these. And these were originally designed by a student called Joseph Redman and his academic advisor called Ali Farhadi. And they designed and released three publications, which were YOLO and then YOLO v2 and YOLO v3.
[00:04:35]Glenn Jocher: And I started replicating these when I got into open source. And so this was all new to me. So I’d never used Python before. I had never done vision models before. I never used PyTorch. And I’d never even used GitHub. And so all these new things were happening to me at the same time. This was 2018, like you mentioned just about four to five years ago when I got into this.
[00:04:56]Glenn Jocher: So I think the outsider perspective really helped me and [00:05:00] really gave me perspective for what the AI field was good at and what they were possibly needing some more help in.
[00:05:06]Gil Elbaz: Definitely. And like coming at it with such an interdisciplinary and such a unique background, I think is definitely a good grounds for creating [00:05:16] unique innovation that might not be as trivial to the folks that are already active in the field. If we maybe take a few steps back and we’re gonna get back to YOLO, and we’re gonna get back to a lot of these different topics. If we take a few steps back. I saw a while back that you were a simulation engineer in the army, simulating and building defense systems using common filters like [00:05:37] Monte Carlo methods, threat tracking and identification. I’d love to maybe if you could take us through a little bit of the interesting topics here, both from a motivation perspective and also an algorithmic perspective.
[00:05:50]Glenn Jocher: I went to the Naval Academy, straight outta high school. I wanted to be a fighter pilot, and it turns out that I’m terrified of flying.
[00:05:57]Glenn Jocher: So it was a, it was a bad combination.
Gil Elbaz: Did you [00:06:00] get a chance to fly a few times?
Glenn Jocher: I did. It was the scariest thing I’ve ever done.
Gil Elbaz: Amazing. Okay.
Glenn Jocher: I realize this is probably not the right career track for me, but I did get this aerospace engineering degree and they heavily focused on simulation there, simulating aircraft.
[00:06:16]Glenn Jocher: And this is when I got into simulating some army ground systems as part of my first job out of college. So I was working on this tank, then we built the simulation and we had some missiles flying around, some missile defense. And it was kinda interesting, but I ultimately didn’t like it. It didn’t have a very warm, fuzzy feeling inside about what I was working on.
[00:06:33]Glenn Jocher: Cause people might get hurt because of it. And when I had an opportunity to transition to particle physics, I. And this was sort of an aspirational goal of mine. I was sort of enamored by the potential for learning more about the universe. It seemed a noble goal. And luckily I could take the simulation that I’d or the simulation tools, and know how that I’d developed on the Army systems, and I took that straight into party physics.
[00:06:59]Glenn Jocher: It’s kind [00:07:00] of a strange transition, but it turns out that modeling tanks on the battlefield is really similar to modeling particles bouncing around a physics detector. They’re actually not that different. Pretty much any simulation is a numerical simulation where you start from some initial conditions and then you start incrementing and iterating towards the future.
[00:07:18]Glenn Jocher: And this is really the whole reason you run a simulation is cause you don’t know what the future is going to hold. There’s some other systems and particles that are linear and for those you can define their behavior for all time into the future. And these are nice. This is that part of the world kinda fits into this cleanly well understood model.
[00:07:36]Glenn Jocher: But the world that we live in is not like that. It’s unstable. Sometimes it introduces random aspects. And because of this, we end up in this stochastic system where you don’t really know what’s gonna happen, say, tomorrow or the next day. And I think that resonates a lot with all of us, cuz that’s true in real life for us also, we don’t really know.
[00:07:54]Glenn Jocher: If we’re gonna be here, a bus might run over us when we’re crossing the street. So to figure out if the bus is gonna run over you, [00:08:00] you can set up a simulation. And to do that, you put the players in the game where they start and you put some initial conditions like where they’re moving, what the weather looks like, things like that.
[00:08:11]Glenn Jocher: And then you simply start to track them using well-known kinematics if you can say, this guy is walking in this direction. In one second from now, he’ll be a little further in that direction and so on and so forth. And so you take these little baby steps and it’s almost like a movie where you have these frames and you sort of propagate that forward in time.
[00:08:31]Glenn Jocher: It’s really interesting. And like using this numerical simulation method, you can really simulate pretty much anything as long as you understand its behavior, you can create its initial conditions and you can create the steps that it takes into the future. And you can sort of hit a play button and watch that happen.
[00:08:46]Glenn Jocher: And so for particle physics, that’s really useful because you. Simulate an interaction, simulate what comes out of it, and then try and put those outputs together and figure out if you can make sense of them to say, reconstruct that particle, which is [00:09:00] what you really wanna do. And there’s just the whole point that you’re building the detector for.
[00:09:03]Glenn Jocher: If it turns out you can’t reconstruct the particle very well, well no problem, and you just abandon the simulation, simulate a different type of detector, maybe a smaller one, or maybe a more accurate one. And run the simulation again, and you might get better results. And this way you can start to iterate a design for the actual hardware that you’re gonna spend a lot of money on, maybe millions of dollars before you actually have to spend the money.
[00:09:25]Glenn Jocher: Cuz you don’t wanna discover that it’s doing a bad job once you’ve already built it. Cuz then you’re kind of backed into a corner. You’re stuck with the investment you have that sunk cost. [00:09:34]
9:40 Neutrino Particles and Simulating Neutrino Detectors
Gil Elbaz: Cool. So what is the motivation of detecting neutrinos or anti neutrinos? You can give us a little bit of insight on that.
[00:09:40]Gil Elbaz: I mean, we don’t get a chance to talk with someone in the space every day. So I think it could be interesting.
[00:09:47]Glenn Jocher: Yeah, this is actually the most fascinating part of my life. So neutrinos are a type of fundamental particle, and by fundamental, we think that theoretically they can’t be broken down into anything smaller.
[00:09:58]Glenn Jocher: Other types of fundamental particles, for [00:10:00] example, are electrons, but not protons, which are composed of corks. So some particle interactions were sort of like missing energy, and this is when a new type of particle called neutrino was proposed that was running off with some of the energy that scientist couldn’t see.
[00:10:15]Glenn Jocher: So I think originally Enrico Fermi proposed the neutrino. He’s an Italian physicist, and he said something interesting. He said, I’ve done a very bad thing. I’ve proposed a particle. It can never be proven. And in particle physics, that’s a very big sin because you’re essentially making a proposition then nobody can ever verify.
[00:10:35]Glenn Jocher: They can’t ever say that it’s true or that it’s false. And if you can’t do that, then whatever you say doesn’t mean anything. It’s really just people talking. So Enrico thought that nobody would ever be able to prove these nutrient notes, but we fast forward about half a century to 1970. He’s a scientist in the US and he’s creating a big experiment to try and capture neutrinos.
[00:10:54]Glenn Jocher: And he turns on his experiment and he predicted he should get so many. And then at the end [00:11:00] of a few months, he does a few calculations and he sees that he got one third of what he expected, almost exactly one third. And it is very strange, thinks he must have done something wrong. And so he goes back and he tweaks his experiment and he takes another measurement for a few more.
[00:11:15]Glenn Jocher: And again, he gets like one third of what he is looking for and he keeps doing this. And like over the course of several years, he starts to lose his mind because he’s double checked. Every part of the experiment, everything seems to be fine, but he just keeps collecting one third of the amount of particle should be getting.
[00:11:30]Glenn Jocher: And the reason this was happening was because unknown to him, neutrinos actually arrive in three different flavors that behave in three different ways. And his experiment was only capable of capturing one of the flavors. And so this is a tremendous discovery. This meant that this particle was actually a type of multifaceted particle that depending on how you look at it, and when you look at it, it’s gonna appear differently.
[00:11:55]Glenn Jocher: It’s gonna have different properties such as mass speed and so on. [00:12:00] And so from this, now we have the modern understanding of a three flavor anti neutrino matrix. And what happens is that three different flavors interact differently. And so the way I usually explain it to people, If you imagine that you’re leaving an ice cream store with a vanilla scoop and you start eating it and it’s vanilla, it’s great cause you like vanilla and then you start walking down the block.
[00:12:21]Glenn Jocher: And as you get to the end of the block, you look down and suddenly you have a chocolate ice cream scoop in your hand. And not only that, but it’s a different size. It can be like 10 times bigger. And so this is, this is very strange. So mass is in the same particle,
Gil Elbaz: That sounds like a good thing, right?
Glenn Jocher: it could be a good thing depending on what flavor you’re interested in.
[00:12:39]Glenn Jocher: And then we have the third one, which we could say strawberry. And so these three different flavors come and go somewhat randomly, but also somewhat stochastically. So if you understand very well the laws that govern then you can start to gain an understanding for where and when these neutrinos were created, and ultimately that’s what the [00:13:00] intelligence agencies were interested in.
[00:13:01]Glenn Jocher: They wanted to build a detector, place it somewhere in the earth and just start to gain intelligence about where neutrinos were coming from in the earth because. One of the biggest sources of neutrinos on the surface of the earth is nuclear reactors. Nuclear reactor is a political hot topic these days, and if you can sort of gain intelligence from afar about who’s building what and what they’re doing with it.
[00:13:21]Glenn Jocher: Then you might not have to send in inspectors from the IEA and so on, so forth. So ultimately the idea was way ahead of its time, because our detector ended up being a disaster. It was too small. The neutrinos were too hard to catch, just like that guy in the seventies. And the electronics that we put together to try and do this were a bit too immature.
[00:13:40]Glenn Jocher: So we bit off, way more than we could chew. We spent several million dollars in government funding and after several years, A lot of promises and little to deliver. They sort of got really frustrated and they realized this is more of a 10 or 20 year horizon technology, and that’s something that they could whip out in a couple years.
[00:13:57]Glenn Jocher: And so that’s when they gave the acts of the program and [00:14:00] Ultras was cast free and I had more free time on my hands.
[00:14:03]Gil Elbaz: Amazing. Okay, so that’s super interesting. Sounds like a very unique project, a very unique directions and, and also a lot of interesting additional information that you kind of. Learned along the way, which is, which is quite cool.
[00:14:18]Gil Elbaz: So can you take us into Ultralytics a little bit? The goal of Ultralytics? What problems are you solving today from simulation to machine learning? Maybe talk a little bit about this process.
[00:14:30]Glenn Jocher: Yeah, absolutely. So after our particle physics efforts went down, I was cast off and I had all my eggs in that one basket.
[00:14:39]Glenn Jocher: All the Ultralytics contracts were with the NGA effort, and so now I developed a lot of AI expertise. I had a lot of simulation work under my belt, but ultimately a bit of dissatisfaction with the impact that I’ve been able to have. So I was always frustrated that the work I did was stuck in a laboratory. It wasn’t part of a cool product that could just take in my hands and show my friends, but I saw that the [00:15:00] vision AI space in particular was different and it is really similar.
[00:15:03]Glenn Jocher: To the particle AI inference that I was trying to run. Um, so in the particle physics space, we didn’t have a lot of real world examples and so I relied on simulated data to train AI models, and then I turned around and I applied those trained models onto the real world data. And the result was a disaster because there were subtle differences in my simulation and in the real world, and in retrospect, this made a lot of sense.
[00:15:27]Glenn Jocher: Since the problem is not solved, we don’t understand completely what these anti-trans are doing, and because of that, we can’t create an accurate simulation. But again, the vision space is different in this respect. We’ve got. Very mature companies building very accurate simulations of the real world, including in the game space with engines like Unreal.
[00:15:46]Glenn Jocher: And we also see a lot of companies investing heavily in that. Companies like Waymo under Google, and I think even Tesla is in the simulation space for edge cases. So it’s always best to train on real world data if you have it. But once your [00:16:00] model is relatively mature for, for most cases, you are gonna end up with a long tail where it hasn’t been exposed to a lot of examples.
[00:16:07]Glenn Jocher: So this could be like a kid running across the street in front of a Tesla car, and rather than collect real world examples of that, which they probably have some, but not enough. Simulation now is probably at sufficient fidelity to contribute to that dataset and to training that model. Since, of course, AI models need many, many examples to get the point across.
[00:16:26]Glenn Jocher: And in that respect, they’re not like us at all. So a human being, we can see one example and we get the picture pretty well, and we can extrapolate that really well. An AI model is not like that. It needs thousands of examples and to generate that many examples, this is where simulations can come in and lend a hand.
[00:16:43]Glenn Jocher: But now that I’d transitioned to the vision space, I didn’t need simulations anymore because we had a world of examples in terms of big data sets that companies have put together. Like Microsoft, have put together COCO, and other companies have put together other driving data sets, for example.
[00:16:58]Glenn Jocher: We’ve got drone data sets like Vince [00:17:00] Drone, and there’s just a, a whole great amount of, of open source data sets you can get started from. And so I dove right in. I started recreating YOLO in Python, which is a big shift because up until then the authors had been doing it in a framework called Darknet, which is written in C which is very impressive that they put that together.
[00:17:17]Glenn Jocher: But it’s a bit hard to work with if you’re another user. So the original author. Interestingly enough, it split ways and sort of abandoned the YOLO project. And this is sort of an opportune moment since I had also arrived on the scene just about the same time. And so my open source efforts started to pick up traction, which wasn’t my original intention.
[00:17:36]Glenn Jocher: My original intention was simply to build examples for myself. And I thought GitHub was an interesting experiment, and I should try and open source my work. Just tell me to track you better. Because before that, I had no version control. It was just sort of Glenn code files all over my laptop. And if I spilled some coffee on that at Starbucks my career was gone.
[00:17:53]Glenn Jocher: So I thought this was a nice backup plan for my work and help might help me organize myself also, which I [00:18:00] always need help with. And so that was the original intention of creating repos on GitHub for me. But what happened, of course, is in the open source world, people stop by and see what you’re doing.
[00:18:11]Glenn Jocher: And it looked like a few people got interested. They started lending a hand submitting a poll request, which was like a new thing for me too. So usually I hadn’t had anybody looking over my shoulder at the code. So this is like a little humbling. People were sort of pointing this out, they’re like, Glen, this is bad.
[00:18:25]Glenn Jocher: You did this wrong. Here’s the typo. And first, it was a little frustrating, you know, just exactly. But then I realized, well wait, this is, this is actually really good. Like, so they’re helping code. And we’re working together and I think more eyes on the same problem are always a good thing. And so I started tidying up my open source work and it turned from blank code into more usable code for the world.
[00:18:48]Glenn Jocher: And now there’s several hundred contributors, I think almost four or 500 across the different ultra repos that have jumped in on the effort here. And they’ve proven very valuable to us. Not only that, but they’ve also [00:19:00] brought in expertise.
[00:19:01]Gil Elbaz: Yeah. Just to add a little bit of context, you have a pretty impressive GitHub with like an a plus plus rating 12,000 contributors decently long streak, right?
[00:19:12]Gil Elbaz: Like 60 days in a row. All these cool parameters, right, that show the richness of the contributions and the amount of traction that this, this has gotten. And it’s quite impressive and it seems like there’s a real community built around these reposts, which is. Not easy to do, and it’s quite nice because when open source communities take a code base and embrace it, it really means that the code base provides real value, that it’s unique, that there aren’t a thousand copies of it, and that the code is written well enough to add to the code, right?
[00:19:46]Gil Elbaz: So it’s like kudos to you. Really, really nice job and definitely hope to see this.
[00:19:51]Glenn Jocher: Yeah, thanks Gil. I think you’re right, like now it’s turned into something that’s pretty impressive. I think we just crossed 50,000 stars as an organization a few days ago, but when I [00:20:00] started, of course, nothing like this was in my mind.
[00:20:02]Glenn Jocher: I think if I’ve known, say, three years ago what this was gonna take and the work I’d have to put in, I think I might never start it because originally I just said, Very small attention, just a few footsteps here, just a small project there and sort of one footstep led to another footstep and that turns into a journey over several years.
[00:20:19]Glenn Jocher: So what we have is, like you said, it’s, it’s a community that we built up that’s very useful and very mature code now for vision ai, for YOLO models. And a lot of people ask how we did this and like kind of what the recommendations would be if they wanted to reproduce something similar.
[00:20:36]Glenn Jocher: And at the end of the day, like usually my answer is always the same. It’s just hard work. There’s no way around it, there’s no shortcuts. And I think building software code and open source repos is similar to any other aspects in life. In that sense. That’s been pretty much like seven days a week for me.
[00:20:51]Glenn Jocher: Maybe like 12 hour days, like for the last like three years almost in a row now. So I don’t really take vacations and I don’t know, I try and answer [00:21:00] almost every issue that pops up too. So it’s not just writing code, it’s also kind of engaging with the users. Now that’s getting more and more difficult cuz there’s more and more questions, but, but I still try.
21:09 History of YOLO
and so forth. And it’s still widely used actually. So if you go to, for example, Facebook’s detection repositories you can see implementations of it, and now it’s been. Generalized over into [00:25:00] segmentation also. So it’s got a big user base and it’s widely used these days. I think it’s definitely being supplanted by the YOLO models.
[00:25:06]Glenn Jocher: And what we’re trying to do is cover all the bases with our YOLO models that fast or our CNN used to do. So for example, now we do segmentation. Now we’re doing classification, we’re doing tracking, and now we’re, we’re going over the key points. And so pretty soon there’s gonna be like really fast, really accurate, simple to use YOLO models that do everything that R-CNN.
[00:25:26]Glenn Jocher: But much faster and much simpler. Awesome.
25:28 YOLO for Keypoints
[00:25:28]Gil Elbaz: And can you dive into a little bit of how you take a YOLO model that was designed for detection and then leverage it for key point estimation, for example, or a different task like segmentation?
[00:25:39]Glenn Jocher: Yeah, that’s a good question. So it’s actually kind of simple. So the models are always composed of two things, training data.
[00:25:48]Glenn Jocher: And then the labels for that training data. So this, let’s see. This is a type of neural network training, which requires label data. And so the labels can really be anything you want. They can be boxes around objects, they can be [00:26:00] contours. They can be classes for the entire image, or they can be a combination of those.
[00:26:04]Glenn Jocher: And you can also add in some regression values. Like for example, the regression value could be things like maybe what time of year was this image taken? And it could be a value from one to 365. So you can get really creative here as long as you have data and it has labels. You can train on that and have the model really spit out any type of thing that you want.
[00:26:23]Glenn Jocher: And so key points are one of those things. And this is points for locations perhaps on the human body, on the bodies of animals, on the human face. And so when you apply key points in those different domains are called different things like when you apply for the human body, it’s called codes on the face.
[00:26:38]Glenn Jocher: It can be called facial landmarks. And there’s a lot of farming applications where they wanna do this for animals too. So if your cow is very sick, he’s not gonna raise his head above the. And if you can give him some medicine, then you’re gonna, you’re gonna make more money off his cow if he’s happy and healthy.
[00:26:53]Glenn Jocher: And so there’s just a lot of applications where you need to know. Pose for like animals, for human beings. [00:27:00] And that’s where key points come in. So key points you label the different vertices on the human body. So this could be people’s feet, like knees, hips, face. You end up with say, 10, 20, 30, maybe even more points on the human body.
[00:27:11]Glenn Jocher: And in every image, you label these on every person. And in the model itself, you simply append additional outputs to each output grid. So YOLO takes an image and it chops it up in, say, like a chess board. And at each square in the chess board, it can potentially identify objects, classify them, and spit out more information like the age of the person, for example, perhaps his emotions on like on an emotive range.
[00:27:38]Glenn Jocher: And also the, the key points you can get really creative here. And as long as you have labeled. You update a loss function for how you want that loss to appear for the new data and sometimes visualization tools, you probably need to update those also to see how you’re doing on that side. But if you update these three places, then a YOLO model is really capable of outputting anything that you [00:28:00] have label data for.
[00:28:01]Glenn Jocher: So, there’s no ceiling here. And of course, we swap these into well-defined bins like classification, detection, segmentation, just to keep things simple. But in reality, there’s a long tail of additional applications that you can do also. It’s really up to you in your imagination. And this has been another recurring theme in my work at YOLO and Ultralytics.
[00:28:24]Glenn Jocher: People always ask us what YOLO is good for? And the funny thing is like, we don’t actually create end solutions. We don’t have like, A pornography detector, like straight off the shelf that you can buy and like plug into Twitter or anything like that. But we create the tools to allow you to do anything.
[00:28:38]Glenn Jocher: And so it’s really the community that comes out with the applications and like all the fascinating use cases for it. And so that gives me this nice, really warm, fuzzy feeling. Like I mentioned before when I worked with the Army contract, I didn’t get that. Now I do. I really feel like I’m making tools that people are turning into magical products and really helping the world in a lot of different ways.
[00:28:56]Glenn Jocher: Awesome.
[00:28:57]Gil Elbaz: Sounds great.
29:00 Applications of YOLO
And can you actually talk to some of [00:29:00] these cool tools that have come out using the Ultralytics platform or the Ultralytics capabilities?
[00:29:06]Glenn Jocher: Yeah, yeah. There’s a few interesting ones. So sometimes we get messages from people and they say, Hey Glen, we just wanted to let you know we’ve created this really, like, amazing product.
[00:29:14]Glenn Jocher: Or like, we’re doing this with YOLO and it’s working really. And some of the more interesting ones are, I always prefer the ones that are making a positive impact on the world. So, there’s a student group that is building underwater submersibles, and they’re using YOLO to identify plastic in the ocean.
[00:29:31]Glenn Jocher: And fish it out with a robotic arm and scoop it up. And so, so if they build enough of these, then you could actually start cleaning up the ocean, sort of like reversing a lot of the damage you were doing to it. There’s a group called the Kashmir Conservation Society [Kashmir World Foundation], and they’re working in different countries like Qatar and Pakistan, and they’re building these cameras to detect endangered animals.
[00:29:52]Glenn Jocher: And they’re helping like track, for example, snow leopards and cashmere, like a type of like endangered animal over in Qatar. [00:30:00] And they’re tracking their movements. They’re helping keep them safe from poachers. And so this is another amazing application. There’s a student group. That’s created an app where you can count currency.
[00:30:09]Glenn Jocher: And so if you’re blind and you get a bunch of bills back at the store, you don’t know what’s sitting in your hand. And there’s an app that uses yellow to count that currency for you, let you know how much you have sitting in your hand. There’s people that are putting cameras into sort of walking sticks to let elderly people know if cars are coming when they’re crossing the street.
[00:30:27]Glenn Jocher: There’s just so many things that are happening. I would’ve never imagined myself.
[00:30:32]Gil Elbaz: Cool. Very, very cool. So we’re gonna jump into general AI in a sec. One more topic on the detection side of things that is quite interesting is kind of where detection is going and where we see these practical applications going with respect to the technical side of things.
30:48 Transformer and Diffusion Models for Detection
[00:30:48]Gil Elbaz: For example, the end object detection with transformers that was released by Facebook AI a while back seemed very promising. Diffusion seems like. An extremely [00:31:00] promising app, like, an extremely promising capability as well. So how have transformers and diffusion models kind of changed detection?
[00:31:09]Gil Elbaz: How do you see them impacting kind of the future of this trajectory?
[00:31:13]Glenn Jocher: Yeah, that’s okay. That’s an interesting question. So we saw a lot of buzz about transformers last year and they’ve been very successful, applied to different domains, language obviously, generative models also. But when you look at these types of applications, the flops that go in, for example, to an inference, into a training are tremendous.
[00:31:33]Glenn Jocher: And they’re much higher than the YOLO models that we have. And so it’s kind of simple to sort of like put the different models on a table and say, oh, these are AI models that do different things. But in reality, some of them are worlds apart and the resources that are required to train them, the training data that’s required.
[00:31:50]Glenn Jocher: And so transformer models are much more resource intensive in every stage, at the training stage, we need more data for them to generalize correctly. In the inference stage, they cost a [00:32:00] lot more to run and they’re generally much larger. And so we see models like DETR, which include transformers, but they actually underperform our own YOLO models in terms of speed and accuracy.
[00:32:10]Glenn Jocher: And the YOLO models that we have right now have not included any transformers. We’ve done a lot of r and d, so we just launched YOLO v8 and in the run up to that we did several months of r and. And we tried to introduce attention layers in different parts of the model. We tried to introduce transformers and we weren’t able to extract performance gains based on those, again, because we go back to the compromise that I talked about before.
[00:32:33]Glenn Jocher: So if we’re sitting in the laboratory and we wanna publish an academic journal, we want the very highest accuracy we can get. So we don’t care if our model balloons up really big, if it costs a million dollars to train. If it needs like 10 million images, like all that’s fine. We just want the very highest accuracy so we can publish this thing in nature and just sort of get some sort of award recognition.
[00:32:52]Glenn Jocher: But if I’m, say, an app developer and I read that and I say, oh wow, this model’s super accurate. Let me try and use it in my app. Suddenly I realize that’s [00:33:00] not possible. But for me to realize that’s not possible is kind of difficult. Like I might need to actually try and implement it myself. I might actually waste some time.
[00:33:06]Glenn Jocher: And we’ve never actually been in the business of like say, like pushing the maximum accuracy in a model. Because what we wanna do is we always want the intersection of say, accuracy, speed, and ease of implementation and like simpleness of using that model. And so for that reason, like we’re never gonna be on one extreme.
[00:33:25]Glenn Jocher: We always wanna be in the middle ground, nothing in extreme. Actually, there’s a temple in Greece and I think it’s a temple of Delphi. And there’s like three maxims. There’s a lot of maxims. But the most three famous ones, I think the very first. Is nothing in excess. And it’s true in life. And it’s also like true in the work that we’re trying to do here because we want these models to be in millions of apps and thousands of products and people using them for all kinds of things.
[00:33:50]Glenn Jocher: And so this is the same reason, for example, a Tesla car doesn’t have like an enormous battery that lets it go a thousand kilometers cause they could do that. But then it’s not optimized for real world applications. It’s [00:34:00] just optimized to sort of like win this competition in a sort of, abstract. So I think there’s a lot of potential there, but we need to explore it further.
[00:34:08]Glenn Jocher: And I think also kind of like understand the differences between language models and generated models, which typically live in the cloud, access to a lot of resources. And these YOLO models, which are usually on edge devices, like they’re right there in your iPhone and they’re not using much battery power and they’re running in real time.
[00:34:23]Glenn Jocher: So, so these are two different classes of models and there’s no right answer. It’s just that in different domains, different types of models, I think. Our best place. And so some companies are doing amazing work, I think in the generator space, in the NLP space, but a lot of that is in the cloud. And so a lot of these companies are well financed.
[00:34:40]Glenn Jocher: They’re well backed, and they’ve got access to enormous cloud resources that just your average developer or student doesn’t have. And so we want models that everybody can use. Not just as an API to request something, but to train the model, to place it in your app, to have it in your hand. That’s where CNN architectures continue to dominate.
35:00 Speed Bottleneck
[00:35:00] very, very interesting perspective. And speaking of speed, by the way, what is the speed bottleneck today? So like why isn’t yellow eight 10 times faster than what it is, for example? Or what do you think is the next breakthrough that might help us make it 10 times faster?
[00:35:15]Glenn Jocher: That’s a good question.
[00:35:16]Glenn Jocher: So we see a trend in the industry of larger, slower models, actually, not the other way aroteund. And the reason for this is that usually when you compare two models, you look at the accuracy. This metric is, head and shoulders like above everything else. And so if you wanna come out with a new model and you want it to make a splash, you want it to be shared on Twitter and LinkedIn, then the accuracy needs to be pretty high, like ideally higher than everybody.
[00:35:40]Glenn Jocher: And to get that, ultimately your model’s gonna have to be a little bigger and a little slower. But if it’s a little bigger and a little slower, people will forgive that. If it’s a little less accurate, nobody’s gonna forgive that. And so what we see is this sort of unfortunate trend where every new version of an architecture that comes out, you could call it in advance, but it’s more in advance in the [00:36:00] direction of accuracy and, and typically you’re gonna have something that’s slower and something that’s a little.
[00:36:04]Glenn Jocher: And we’ve tried to balance these out in our own efforts too. We’ve tried to put a cap on, on what we’re willing to accept, but also on the other side, the hardware keeps advancing, right? So every year you have accelerators that are capable of more flops, and that means that things are going in the right direction and the number of applications that see a larger model is capable of will continue increasing in the future.
[00:36:27]Glenn Jocher: So I think it’s, it’s the right direction to accept a bit of reduced speed and a bit of increased. For accuracy, but ultimately it’s a trend that you can’t have everything at the same time. So you can sometimes, but the typical rule of thumb is that if your model is faster and smaller, it’s not gonna be more accurate.
[00:36:45]Glenn Jocher: If you accomplish all three of those, you’ve done something really amazing, and that does happen sometimes, but those types of situations are becoming less common since we’re approaching a bit more of a plateau. So we’ve made staggering advances in the last few. Not just [00:37:00] on the software side, but also on the hardware side.
[00:37:01]Glenn Jocher: It’s the coupling of the two that’s really produced some of these amazing applications that are capable and possible today. But I think as we get closer to say a hundred percent accuracy models, of course, each new year is gonna bring incrementally less improvement. And the improvement that you do get is sort of like when you start getting closer to the speed of light, like every additional bit of speed that you get, costs a lot more.
37:23 Simulated Synthetic Data Today
[00:37:23]Gil Elbaz: That’s super interesting. Yeah. From my perspective, we’ve seen a few things that do impact kind of quality without impacting speed too much for the most part. That’s on the data side, right? So we’re, you know, at DataGen we generate synthetic data that’s tailored for, you know, our customer’s needs.
[00:37:41]Gil Elbaz: They use our Python, s d k and just generate whatever data they need with full ground truth. And so the cool thing. Using this, you could leverage, let’s say YOLO eight and together ideally create a much more focused dataset or a dataset that has much cleaner ground truth because it’s all synthetic [00:38:00] and pretty much generate the exact data you need to up the performance without changing the underlying architecture at all.
[00:38:06]Gil Elbaz: So we’ve seen different success cases here. Both on facial recognition, amazing performance, and on a bunch of different applications, like hair segmentation for example, where you have a kind of synthetic ground truth and then you have this, the ability to get the very fine details around the hair segments.
[00:38:22]Gil Elbaz: Mm-hmm. Yeah, the eye gaze estimation. So there are a bunch of cool applications, but we see like on the data side, connecting it with the right architectures, you can get really, really nice performance.
[00:38:33]Glenn Jocher: Yeah, that’s right. Yeah, I haven’t mentioned the dataset side too much until now, but if you go to the YOLO v8 repo smack in the center, you’ll see an iter of loop that shows you how you should ideally be training and deploying models.
[00:38:45]Glenn Jocher: And there’s only three parts. You start with a dataset, you train it, and then you deploy it. But then if this is a feedback loop, and this goes back to the dataset, the idea is that your deployed model, you should embed into it capabilities to detect, say like low competence, [00:39:00] detections, edge cases. Feed those back into your dataset to help you augment those areas and then retrain and redeploy.
[00:39:07]Glenn Jocher: So for real life applications, you want an AI model to always be living and breathing and improving just like we are. So you get outta bed, you accidentally learn something new you didn’t intend to, but you learn it anyways. And that’s how the human brain works. And so we want models to behave the same way.
[00:39:22]Glenn Jocher: Right now, they’re not like that, of course. They don’t actually learn unless you retrain. So this loop is the ideal way for a model to exist, and the tighter the loop and the more often you do this, then the better your results will be. So you’re absolutely right there. A lot of the accuracy metrics are dominated sort of by edge cases.
[00:39:39]Glenn Jocher: Like for example, in the cocoa dataset, there’s a million people, but there’s only say like a hundred hair dryers. And so in this cocoa accuracy metric that all the published papers use, if you could just improve your hair dryer detection a bit like, you could jump a lot in those metrics. And so, if you could simulate more hair dryers, say pictures from different angles, lighting conditions, quality of the image, [00:40:00] Then you’ll be doing yourself a great favor and having a much more balanced model.
[00:40:04]Glenn Jocher: A lot of people think that to get the best result you need the best quality pictures, but that’s not true at all. To get the best result you need the most representative distribution of images that the model’s gonna encounter in the deployed space. And that requires a lot of variations. So it’s not just about the number of images, but it’s about the variation.
[00:40:22]Glenn Jocher: So you want, like, if you’re talking about hair, you’d like to see all colors, all angles, different lighting conditions, people wearing different. And oftentimes that’s hard to collect in real life. So I think simulated data, if the fidelity is sufficient, fidelity is just how well it looks. If it’s close enough to the real world, then I think it can definitely contribute to solving real world problems.
[00:40:43]Glenn Jocher: Definitely.
[00:40:44]Gil Elbaz: Yeah, we agree. And one of the interesting aspects is that early on, simulated synthetic data was of relatively low fidelity, and so you saw researchers lose confidence in it very early on, back in 2018, et cetera. Graphics has [00:41:00] made enormous leaps in the last few years and, and in general, DataGen has pushed the boundaries of generating simulated synthetic data significantly, and so what we’ve seen really is that the domain gap or the visual domain gap is.
[00:41:15]Gil Elbaz: Almost negligible right now. It’s at the level where it’s not really a thing. We get exactly comparable results to real data. If we use the same amount of data and when we expand the data significantly, we can create a hundred x. The amount of data you can boost results significantly, which is quite cool.
[00:41:34]Gil Elbaz: And yeah, so just taking it back kind of to the future. So back to the future, I guess, looking at this space. AI at a much more abstract, high level, right? We see Chat GPT, we see the amazing work coming out of stability, ai, and out of the other. Amazing. You know, teams at Google, Facebook, and Apple are also coming out with really, really interesting things [00:42:00] on the Edge device.
[00:42:01]Gil Elbaz: What do you see kind of as some of the next big milestones coming out and what are you excited about?
42:08 Sentience of AGI and Progress of AI
[00:42:08]Glenn Jocher: This is interesting. So I sort of sleep with one eye open, looking at the space, but for the most part I have my head down in YOLO, and so we’re totally focused on improving YOLO models too. Everything that they can to make them the best in the world at what they do.
[00:42:25]Glenn Jocher: Um, but you’re right, there’s other domains where are really fascinating things are happening. And I think one of my ultimate goals in my work at AI is to be able to contribute to something that can further human knowledge. And so this is what I was trying to do in particle physics. Ultimately, I was hoping that the work we would do might lead to new discoveries.
[00:42:42]Glenn Jocher: We might be able to publish research and this might expand the boundary of human knowledge in the particle physics space. Help us underst. More about the universe and how it came to be and our role in it. And I think that AGI can have a role in that. I. That [00:43:00] naturally the human intellect is, well, it’s amazing.
[00:43:04]Glenn Jocher: It’s also limited to what can fit inside your skull. And there’s definitely a bottleneck from one person communicating to another person. I think Elon Musk has also mentioned the same, and I think he’s this is a big part of his effort in trying to connect human minds directly one to another, electronically.
[00:43:21]Glenn Jocher: And so I think AGI though does not have that same limitation. So I think you could put together, as I mentioned, Organizations that are working with enormous resources, and the only limit there is the amount of funding that they have and the amount of hardware advances that accelerator companies like Nvidia can make.
[00:43:38]Glenn Jocher: And so in that sense, if you can pair a model that’s trained on the corpus of human knowledge, like Wikipedia and all the published journals, And all the research data sets, then I think it, it can have possibly the capability to generalize and extrapolate more than any single human being or team of human beings.
[00:43:59]Glenn Jocher: And I think [00:44:00] this could really help expand our understanding of the universe. This is definitely like one of the trends that I’d like to see and I hope comes out of e g. I think a lot of the conversation is typically about would it be sentient? Would it be an actual individual? Would it have rights?
[00:44:17]Glenn Jocher: And so on. And I had an interesting conversation a few months ago where I was kind of surprised, but my own conclusion was that it doesn’t matter. Just in the same way as if you put an artificial sweetener in your coffee. A lot of people just don’t care. They simply care about the end result. And so if the end result of this product is that as a society helps us improve, I think that’s what.
[00:44:37]Glenn Jocher: And I think a lot of the other questions are sort of, maybe not up to us and sort of more on the philosophical side. They just don’t have the answers. But I’m excited about the potential. The technology has just increased so, so rapidly. Again, not just on the software, not just on the algorithms, but also on the hardware.
[00:44:55]Glenn Jocher: When I first started working with YOLO models, one of my main goals is to put it into an app. [00:45:00] And so we’ve got this really cool YOLO app. It’s an awesome way for people to just. And see what Yola does right in your hand. And when I created the first one, about three years ago, four years ago, Apple had just started getting into neural engines and the iPhone had, I think, the first generation neural engine.
[00:45:13]Glenn Jocher: And so I created the app. I exported a YOLO model to Komel. I put it in it, and I opened it up and it was so slow. It was ridiculous. I could, it was basically useless. I was like, nobody’s gonna download this on the app store. It was running about one frame per second. It was very frustrating. The lag was, And it was just that the user experience was just not there.
[00:45:33]Glenn Jocher: And fast forward just four years and now our largest models are running in real time on the latest iPhones right there in your hand, so that they’re running at in excess of 30 frames per second. And these are the largest YOLO models on edge devices. And I think the battery capability for. The latest iPhones will let them do that for at least like, maybe like an hour or two.
[00:45:54]Glenn Jocher: So, this is just the leap in performance just in a few years. It’s just incredible. [00:46:00] And I’m really excited to see where this goes, not just in terms of hardware and software, but just the synergy of the two is greater than some of their parts. And so we’re living in very interesting days and I think the field we’re in is very fascinating.
[00:46:12]Glenn Jocher: So I always wake up interested. What’s gonna happen that day? What I’m gonna read and what companies are doing.
[00:46:18]Gil Elbaz: Well said. Yeah, definitely. I’m also extremely excited about the progress. Every day there’s a new story, a new model, new amazing results that just come out. Both visual results that blow your mind, chat that are just.
[00:46:32]Gil Elbaz: Incredible. Definitely. What are your thoughts on chat G P T? Just for, for example? I mean, that one, I have to say surprised me. The quality of it.
46:42 ChatGPT, CLIP and LLaMA Open Source Models
aybe on the side, like a, maybe even just like a traffic light, like red, green, yellow or something like that. Like this answer may be totally incorrect, like, watch out.
[00:47:39]Glenn Jocher: But other than that, I think it’s, they’re doing amazing work there. On the other hand, I’d like to see more companies. Put technology out there for everybody to use. And so what we do with YOLO models is we try and make them just accessible to everybody. And so we’ve got students that have like no resources in poor countries doing projects with the smallest YOLO models.
[00:47:57]Glenn Jocher: And they can do that. They can train them on cpu, [00:48:00] they can run them. And of course, companies like Open ai, the name of the company simply caters to that concept. But it’s meaningless. Nobody can see the underlying technology. Nobody can tinker with it unless you go through their APIs. And so this is the opposite of open.
[00:48:14]Glenn Jocher: This is one company and not just a small company, but one of the largest companies in the world. Microsoft controlling the technology. And that raises serious ethical concerns. We’re doing the opposite, where we’re releasing everything we do into the wild and we’re letting society run. So I think these are like two completely different games.
[00:48:32]Glenn Jocher: What we do is much more challenging from a business perspective. Obviously, since we don’t have tie-ins to enormous corporations and we don’t have significant financial backing. But I think from an ethical perspective, I sleep much easier at night and I feel much better. Knowing that we’re putting the capabilities into everybody’s hands and we’re letting them decide how to use it.
[00:48:52]Glenn Jocher: Definitely.
[00:48:52]Gil Elbaz: I mean, yeah, Open AI, they started out by publishing, for example, clip, which was an amazingly [00:49:00] powerful model that was made completely public and super easy to use. But definitely it’s a shame that we don’t have access to the weights of Chat GPT, for example. Recently Facebook did release open source weights for their recent LLaMA model, which is supposed to be almost as powerful, if not more powerful than chat g p t with a much smaller footprint, so less parameters for their model.
[00:49:25]Gil Elbaz: But yes, it’s very exciting. See this, this field. And from my perspective, I mean, there are quite a few forces that are pushing the open source side of things, including stability, including meta in this case. And so I’m, I’m hopeful that we’re gonna see more and more of this come out and become more and more public.
[00:49:44]Gil Elbaz: But this is kind of a great take on this and, and definitely a very different approach to what you guys are focusing on. So just to wrap things up a bit, I’d love for you to maybe provide, Some recommendations for people starting out. Folks that are, [00:50:00] you know, the beginning of their journey. You had a very non-trivial journey, right?
50:04 Advice for Next Generation CV Engineers
[00:50:04]Gil Elbaz: Like we sometimes interview professors and we interview these super senior scientists in some of the big companies, but you have a very non-trivial path and journey. I’d love for you to provide some recommendations for folks starting out in the computer vision space. What do you recommend they start out?
[00:50:20]Glenn Jocher: Yeah, that’s a great question. I get asked a bit for recommendations, and ML is an empirical science, and so that means that you essentially have to run an experiment to figure out what the result is gonna be. And so oftentimes I get questions also about people. Say, what would be the effect of training for less epochs?
[00:50:40]Glenn Jocher: Or if I tune this hyper parameter, maybe I augment my data set with this, what’s that gonna do? And the funny thing is I really don’t know. So I’ve trained thousands of models and I’ve talked to, I think, over 10,000 individual people about their issues. But I think human assumptions and reasoning only go so far in this space.
[00:50:58]Glenn Jocher: And really [00:51:00] the answers to those questions are only gonna be answered by running that experiment. And then looking at the. And the result is the answer. This is different than theoretical sciences where, for example, we have like a nice equation for gravity and we know how a ball is gonna behave. So if you, if you go and ask me if I drop this ball here on the surface of the earth, where’s that gonna be?
[00:51:20]Glenn Jocher: In one second. You can just have a nice equation. Get the answer there. But if you ask me, if I take out half my data set and retrain, what’s that gonna do, then I really don’t know. You have to actually do that to get that result. And so what that means is that you should get your hands dirty. You should set up a simple experiment, make a very baby model, just train it and change a few things and see the result.
[00:51:41]Glenn Jocher: And so this is how you start to gain an understanding of how these models work, what they’re capable of, what the limitations. It’s made really easy these days by frameworks like PyTorch, you can create like a really simple feed forward model that’s just maybe a hundred neurons. You can train on a very simple tabular data set, like house prices, and you can do that in just a few [00:52:00] seconds, and then you can take some things out and you can start to see changes of the input, how that affects the output.
[00:52:04]Glenn Jocher: And once you do that, say a few hundred times, you’ll start to become a semi expert in ML. And then people ask you questions and then you’ll turn around and tell ’em you don’t know. So, that’s my main recommendation I think, is just to dive in, get your hands dirty, don’t be scared and start small because a lot of the lessons that you can learn on tiny models extrapolate to really big models.
[00:52:24]Gil Elbaz: Very cool. Thank you very much. So the big takeaway is get your hands dirty. Thank you very much. Been great chatting. Super interesting conversation and I wish you guys the best of luck.
[00:52:36]Glenn Jocher: Thank you, Gil. The best of luck also to you guys at Datagen, and if we’re ever in Tel Aviv, I’ll give you a call.
This is Unboxing AI. Thanks for listening. If you like what you heard, don’t forget to hit subscribe. If you want to learn about computer vision and understand where the future of AI is going, stay tuned. We have some amazing guests coming on. It’s gonna blow your minds.[00:53:00]