[00:00:00] Tadas B.: So the team really believed in the power of synthetic data. because they have huge amounts of success in body tracking and hand tracking domain. Can we expand that to faces? When I joined, we were starting to look into research, into faces. What can we do with faces? Can we track them? Can we track expressions, maybe eye gaze, head poses and things like that.
[00:00:25] Gil E.: Welcome to Unboxing AI, the podcast about the future of computer vision, where industry experts share their insights about bringing computer vision capabilities into production. I’m Gil Elbaz, Datagen, co-founder and CTO, Yalla. Let’s get started.
[00:00:47] Gil E.: Tadas Baltrusaitis is a principal scientist working in the Microsoft Mixed Reality and AI lab in Cambridge UK where he leads the human synthetics team. He recently co-authored the groundbreaking [00:01:00] paper DigiFace 1M, a data set of 1 million synthetic images for facial recognition. Tadas is also the co-author of Fake It Till You Make It Face Analysis in the Wild Using Synthetic Data Alone, among other outstanding papers.
[00:01:16] Gil E.: Tadas has pushed forward the field of simulated synthetic data in a significant way throughout his career. His PhD research focused on automatic facial expression analysis in especially difficult real world settings. He was also a postdoctoral associate beforehand at Carnegie Mellon University where his primary research lay in automatic understanding of human behavior, expressions and mental states using computer vision. Tadas, I’m happy to have you join us today. Welcome to Unboxing AI.
[00:01:48] Tadas B.: Thanks for the introduction and great to be here.
[00:01:51] Gil E.: Wonderful. So maybe to dive back and kind of take us back into the early days, can you tell us a little bit of how you started out in this field of [00:02:00] computer science when you got interested in diving into this direction? Going into0% academia.
[00:02:06] Getting Started in Computer Science
[00:02:06] Tadas B.: Yeah, I was always interested in how we can apply technologies to solve real world problems and when I started my PhD, I was interested in understanding humans and sort of maybe applying that to smart interfaces. And we’re thinking, oh, maybe if we could recognize someone’s mental state, emotional state, maybe we could adapt interfaces appropriately if we noticed that someone’s bored or excited.
[00:02:31] Tadas B.: And when I started that I realized, hmm, actually the technology’s not there yet. We can’t actually do that. So that pushed my interest into more the computer vision, machine learning side of things. Let’s build technology that can do that, that can track faces, that can track your expressions, your eye gaze, head pose and other sort of objective markers of your behavior and been in that area ever since.
[00:02:54] Tadas B.: And also looking at how we can apply these technologies for various exciting applications.
[00:02:59] Gil E.: Very cool. So I wanted to dive into the challenge of inferring mental states from facial expressions and upper body. You had a few interesting papers on this topic that I had the opportunity to run through before this interview, and both, it seemed like the technology back then was very early relative to what exists today, which seems like one large challenge.
[00:03:20] Gil E.: And then there’s also the kind of subjective challenge, right? Understanding if someone’s happy, you know, they might be smiling, but are they happy? What is their actual mental state? And so I’d love to maybe understand both the challenges of these early days as well as the subjectivity and the actual challenge of understanding what the mental state really is.
[00:03:40] Inferring Mental States from Facial Expressions
[00:03:40] Tadas B.: Yeah, indeed. Like even if we had perfect technology that sort of recognized exactly which muscles are moving by how much and where exactly you’re looking, we might not be able to infer the mental state because of the things like you mentioned. So maybe when I started my PhD, I was slightly naive about the complexity of the problem and now I’m sort of much more appreciative of the intricacies there and smiling is a great example.
[00:04:04] Tadas B.: People smile when they’re embarrassed. People smile when they’re unsure. There’s a lot of social rules. Those social rules are dictated by the culture you grew up in. The culture you’re in. You might be behaving differently based on the context you’re in. So I think to be able to actually understand your mental states.
[00:04:24] Tadas B.: We need to factor all of those in and that’s really difficult. You need to have all the context, and even then there’s different types of your emotional or mental state. One is what you feel inside, and no one can read that. That’s, I don’t believe that anyone can read that. There’s the one that you intend to show.
[00:04:41] Tadas B.: There’s the one that’s perceived by the observer and they’re all slightly different. There’s noise added to the system at every step of the way. Some intentional noise, some unintentional noise. So yeah, it’s a fascinating topic. I’m sort of thinking maybe we need to be careful when tackling it and also really think about why are we doing it.
[00:05:00] Tadas B.: It’s not always obvious what an application would be of recognizing your internal state.
[00:05:05] Gil E.: Yeah, I saw that one of the papers was written with Rana el Kaliouby who went on to found Affectiva, which was a company kind of focused on understanding expressions and kind of the mental state of the person.
[00:05:20] Gil E.: It’s interesting. How is your experience kind of working with that, and do you think that that was really a precursor for the company itself, or was that one of the kind of initial directions that got, I guess you both interested in this direction?
[00:05:33] Tadas B.: I was fortunate enough to share the same PhD advisor as Rana.
[00:05:39] Tadas B.: And I was sort of the next PhD on that topic with Peter Robinson and when I started I was continuing from her work, which is very exciting and is, we can see has led to a lot of exciting, real world applications and focusing on, on a lot of really interesting topics. But it’s also interesting to see how maybe at the beginning it was more on let’s detect emotions.
[00:06:03] Tadas B.: Let’s try detecting emotions, but then try focusing more on, okay, what application would, detection of particular emotions would be important and I think automotive is one good example where you can focus, okay, there’s certain types of mental states that can occur in this situation. It’s a bit more constrained scenario.
[00:06:22] Tadas B.: It’s much easier to actually have actionable recognition of mental states.
[00:06:28] Gil E.: Definitely. And I think that that’s one really interesting thing. I mean, if we talk a little bit about faces in general and the kind of space of understandings faces, I remember that, you know, when I started dealing with Style GAN and Style GAN 2,
[00:06:41] Gil E.: I really had, I think, an eye-opening moment where I understood suddenly. What the distribution of frontal faces actually looked like and how big it actually was relative to the capacity of the networks. And it was quite insightful for me to just start to understand this domain. And you’ve had an enormous focus, I think, around faces and understanding people throughout your career.
[00:07:05] Gil E.: I’d love to hear maybe some of the intuitions. you had early on about this and kind of how that’s evolved from your side and some of the challenges that you still see today when dealing with faces.
[00:07:16] Challenges of Facial Expressions
[00:07:16] Tadas B.: Great question, and maybe one of the, taken from my earlier work, you talked about Style GANs and representation of faces, and that’s all just single images of faces.
[00:07:27] Tadas B.: And they look really realistic. They look great. We are able to model that quite well through synthetic data and Style GAN style approaches. But there’s so much in the motion of the face that we’re not really capturing. And what one of the sort of the earlier works that I did was building on work in psychology literature.
[00:07:46] Tadas B.: Understanding what signal motion carries. So if you attach reflective dots on people’s faces and don’t show the faces at all, but just record those dots and replay them back, people are still able to infer the meaning behind them. They’re still able to infer the intended emotion expressed behind them.
[00:08:04] Tadas B.: So there’s so much information in the motion that we’re really not modeling that well or not capturing or probably not even understanding all that well. So I think there’s gonna be a lot of exciting work to do in that space.
[00:08:17] Gil E.: Yeah. So you also worked a lot on Open Face and the open source project around creating this set of capabilities.
[00:08:25] Gil E.: If you wouldn’t mind, could you maybe describe a little bit Open Face and how this project came about and a little bit of the journey there? Because it was multi-year project had a lot of impact used by a lot of different application.
[00:08:40] Open Face
[00:08:40] Tadas B.: It started during my PhD and as I mentioned, sort of, I was interested in recognizing mental state emotions and realized the technology’s not there and there were some tools out there for tracking faces.
[00:08:54] Tadas B.: So we internally were using company called Nevin Vision Tracker. Then they were purchased by Google. And oh well, the tracker’s not there. And I’m realizing, okay, every team in the research world has their own tracker. No one’s really sharing. So for a newcomer in the field, it’s really difficult to get started without having this massive computer vision and machine learning background and building the tracker.
[00:09:16] Tadas B.: So I was thinking, well, that’s not a great situation to be in. And that’s what motivated me to build the trackers and actually release them for the academic community and as I was doing research and facial landmark detection and gaze estimation, head post tracking action, unit recognition, and every sort of that small bit, I realized, oh, actually that could be put into a single tool and released to the academic community.
[00:09:40] Tadas B.: And has seen a lot of success because there really wasn’t anything out there that, there were bits, there were some projects that, oh, well, we do head post tracking. There were some projects that do eye gaze, but not in a unified way and sometimes in a very difficult to use way.
[00:09:59] Tadas B.: So a lot of practitioners in the field, they might not even be computer scientists, they just want a signal about your face and if you provide, give them an executable with a GUI to do that, that’s a huge boost for them. It came out of necessity, but what wanted to share it with, with the rest of the community.
[00:10:20] Gil E.: One thing that, by the way, was also interesting that I saw when I was going through it, is also that it’s built at least partially, but I think substantially on, MATLAB, which also kind of keys back to the era that it was developed in. Can you talk a little bit about how you transitioned? I mean, I’m assuming that you transitioned to Python at some point, but a little bit about that time and how it was looked at in the computer vision community and how maybe it transitioned over time.
[00:10:46] MATLAB to Python
[00:10:46] Tadas B.: Yeah, yeah. Like, of course, if I were to start this all over again, I probably wouldn’t be using MATLAB, but at the time, that was a tool I was comfortable with and there was interest. All the training was done in MATLAB and we had sort of implementation of some algorithms in MATLAB, but there was also a C ++ to make it run fast and without having to have a MATLAB license.
[00:11:03] Tadas B.: Yeah. Like actually when I even started at Microsoft, I was still using MATLAB quite heavily because that was just my comfort zone. And then of course, it made sharing with others much more difficult. Reimplementing some of the work was more difficult because people sort of, oh yeah, so how does, is the algorithm trained and sort of here’s this MATLAB script and no one really knows how to run it.
[00:11:28] Tadas B.: In hindsight, might not have been the best choice. But I enjoyed working with MATLAB. I actually like the GUI they have like seeing all the variables and values and easily plotting things. And Python has that now. It didn’t as much before, but now like this nice Python ID is, I’m not one of those developers that is just happy with Command Line.
[00:11:46] Tadas B.: I’m a very visual person, so I need these visual interfaces and maybe that was the driver.
[00:11:50] Gil E.: Yeah, definitely. I think that early on there was a strong appeal because of the visual kind of aspects and, and definitely now with Jupyter, but also many other tools, you can get a lot of these nice kind of syntactic sugar and visual aspects within inline code.
[00:12:06] Gil E.: But yeah,
[00:12:07] Tadas B.: This is like step by step debugging. That’s what I left in MATLAB and now Python has like, is quite nicely supportive, but like being able to just step line by line and dive in. Oh, okay. What are the variables?
[00:12:18] Gil E.: That and also matrix multiplication, like there were a lot of nice features and I think really it stopped working at a point because it’s just, it wasn’t really a language at the end of the day, right?
[00:12:28] Gil E.: It was more of a program that had an internal programming language, but it wasn’t a programming language in and of itself. And of course, having the super expensive licenses did not do it any favors. I also made the shift, started working with tensorflow 0.12 early on and since then was pure Python.
[00:12:49] Gil E.: Yeah, super interesting and, and maybe shifting a little bit, so I saw also you’ve done quite a bit of work in multimodal machine learning and also we were talking before a bit about taking context when trying to understand facial expressions and emotions, taking additional context. So I’d love to kind of understand how you got into multimodal ML and some of the challenges you dealt with there and how you see this progressing.
[00:13:17] Multimodal Machine Learning
[00:13:17] Tadas B.: That’s exactly the reason why started looking at multimodal machine learning is because of the importance of you can’t interpret facial expression just in, in a vacuum on its own body posture is important,
[00:13:29] Tadas B.: speech tone, the language you’re saying that’s all very important. It’s sort of, humans process information multimodally. We have all our senses and we look at it more holistically and I was very fortunate to do my Postdoc in multimodal research lab led by Louis Philippe Morency and they were really doing a lot of really, really exciting work on multimodal machine learning.
[00:13:56] Tadas B.: And that’s where sort of I started looking into it and combining different signals and it’s really challenging. Sometimes the signals are contradictory. Sometimes you go into it thinking, oh yeah, I’m just gonna combine the signals and it’s gonna definitely boost my model performance. And turns up, well, no, not really that maybe both signals carry redundant information.
[00:14:14] Tadas B.: You might be a bit more robust in noisy environments, but actually, they might not help that much. And that’s what sort of, it was a really, an interesting field to work in and it’s a huge field as well. Like what, what we think of, that’s why we wrote the survey on multimodal machine learning is trying to provide a bit more structure for others, but also for ourselves to sort of
[00:14:38] Tadas B.: see the appreciation of this field because it’s not only a lot of people think of multimodal machine learning as all just stack audio, video and language and just do prediction, but it’s not. It’s ability to move between modalities, translate from one to the other, aligning the modalities, exploiting sort of signal between them. So it’s huge and really exciting.
[00:14:59] Gil E.: I definitely agree that it started with very naive and simple approaches and, and as we go forward we understand that in different contexts you wanna use different modalities and you need to somehow take things that are in completely different structures of data and combine them, fuse them together, and yeah, there’s a lot of very, very interesting work.
[00:15:20] Gil E.: Also today in this field, we see it as something. As a company working on synthetic data generation, of course, and having very, very rich ground truth for all of the data that we’re creating. We definitely see at some stage the need to move into more multimodal data, meaning audio together with video, for example, together with maybe 3D input, etcetera, etcetera.
[00:15:45] Gil E.: And then I think it could be quite promising. Do you see kind of a flow from multimodal to synthetics in a way?
[00:15:52] Multimodals and Synthetic Data
[00:15:52] Tadas B.: Yeah, that’s a great point. And yes, synthesizing or, and there’s a huge body of work in synthesizing audio and adding noise to it and combining it with visual signals that’s powerful, and I dunno if we were there yet.
[00:16:08] Tadas B.: Those communities have been mostly separated. People who synthesize audio and who synthesize visual imagery, being able to build that yet, that would be hugely powerful because the interactions there, there’s a lot of interactions there, and you could definitely build much more exciting things on top of that.
[00:16:26] Gil E.: Definitely maybe rolling into synthetics in a way. Starting back in 2015, you started with a very cool, morphable model of eyes together with Errol Wood, if I’m not mistaken. You guys have been working together since then for quite some time. Could you maybe describe a bit for our audience, what is a Morphable model and why you wanted to build it then later for eyes and faces and all of these different things?
[00:16:54] Morphable Models
[00:16:54] Tadas B.: Yeah, so Errol’s PhD and as you mentioned, yeah, so worked really closely with Errol Wood and we were in the same lab in Cambridge University. We were both doing PhD at the same time. And we’re interested in eye tracking and, you know, one of the approaches we decided to take for recognizing eye gaze or tracking the eye region is trying to use more computer graphics or analysis by synthesis approaches, partly because there were just wasn’t that much training data available to tackle the problem.
[00:17:26] Tadas B.: So we’re thinking, okay, can we try approaching it from more graphics perspective? And for that we build a parametric model of an eye region. So that’s what you refer to as 3D morphable model, which is a statistical model that allows us to capture the variation of the region, sort of what is the shape of the eye.
[00:17:44] Tadas B.: You have a couple of parameters that make or respond to the width or the height of the eye. Some of the parameters may not be semantically meaningful, but it’s just a statistical model. Some of your listeners might be familiar with principal component analysis, so it’s a very sort of similar approach. You just do it on the geometry of the face and on the texture of it
[00:18:03] Tadas B.: and there was a lot of work that Errol did with actually manually retopologizing and aligning the textures. So, there was a lot of manual work, but once that was done, we sort of got quite a good morphable model and we could do analysis by synthesis. So what that means is we’re trying to, given a new image, we’re trying to explain it using our statistical model, using, rendering techniques, sort of, oh, what should be the dials in my model?
[00:18:30] Tadas B.: to match that image. And that includes the shape of the eye, the texture, the direction of the gaze, the lighting environment and a couple of others. It’s quite a slow approach doing that, because we’re using numerical derivatives rather than analytical ones so you actually have to render a couple of images.
[00:18:51] Tadas B.: with shift in the model parameters and then see, oh, okay, I should make the eye a bit wider or skin a bit wrinklier. But it led to quite a few exciting applications and we’ve released a dataset generated using our multiple model for training landmark detection algorithms for the eye region and for recognizing that gaze of the user.
[00:19:13] Gil E.: Yeah. And as you started with the eye region, it kind of evolved over time from your work that has followed up. I’d love to understand maybe the evolution of this. I think in Microsoft there was a clear need around the HoloLens. So maybe if you could guide us through that journey it would be quite interesting.
[00:19:34] Tadas B.: Yeah, so after that, and Errol was one of the reasons I joined Microsoft because he started there working on HoloLens a couple of years before me. And that team that was led by Jamie Shottten has done a lot of work using synthetic data. So they’re the ones who build, connect body tracking, which was trained on depth images.
[00:19:55] Tadas B.: trained on synthetic data. They also build articulated hand tracking for HoloLens 2 so and by articulated I mean actually tracking the finger joints and how the arm articulate, not just recognizing gestures. And that was also trained on synthetic data. And Errol was heavily involved in that synthetic data generation.
[00:20:15] Tadas B.: So the team really believed in the power of synthetic data. And when I joined, we were starting to look into research into faces, you know what? What can we do with faces? Can we track them? Can we track expressions? Maybe eye gaze, head poses and things like that. And because they have huge amount of success with in body tracking and hand tracking domain, can we expand that to faces?
[00:20:40] Tadas B.: and mine and Errol’s collaborations sort of seemed like a natural fit for pushing that forward. Of course, it’s not maybe an interesting anecdote. What was sort of we release the eye region model and synthetic data using that. Everyone was sort of, oh yeah. Well what about the rest of the face? Can you just like zoom out and sort?
[00:20:56] Tadas B.: Well, yeah, sounds easy. You know, you, we just built the eye model. Just the rest of the face was just gonna fall out. Simple, but it was a big investment. It was a lot of work to build out our capabilities with faces, but I believe it paid off. And, as you’ve seen from our published work, both that make it till you make it and more, more recent work that we can talk about later, it definitely paid dividends.
[00:21:18] Gil E.: Definitely, definitely. Yeah. It’s not just zooming out of course. And I appreciate that a lot. I did wanna ask, have you like been hands on on the 3D graphic side as well? You mentioned before also that Errol did some retopology work himself. It was intriguing for me because there are a few different skill sets that are needed.
[00:21:40] Gil E.: Really work on synthetic data well, one is understanding 3D graphics. Others can be categorized as algorithmic work geometry and deep learning. And then also software engineering. Like there’s quite a bit of software engineering that goes into this. How have you built out your skillset to kind of map out all of these different directions or focused on a part of it and had different team members focus on each part?
[00:22:07] Skill Sets for CV Engineers
[00:22:07] Tadas B.: So I was fortunate enough to get my hands on all parts of it, and I was learning from a lot of members of the team. I learned a lot, about graphics from Errol about machine learning from other members of the team. And yes, I’ve retopologized so many faces. It was great fun ranging into the hundreds and, it’s highly recommended, but it’s the reason doing that makes you really appreciate it.
[00:22:32] Tadas B.: We talked about at the beginning about diversity of faces. You start understanding what features appear, what’s difficult to model, what’s easier to model. So I think it’s really useful exercise to do, similar to if you are involved in all parts of, synthetic data generation pipeline, you will know your data really well and, Errol sometimes says, you know, it’s important to become one with the data, and if you know your data really well and you train a machine learning model yourself to generalize on real data and see the error cases.
[00:23:06] Tadas B.: It almost instantly knows why that might be the case. You said, oh yeah, clearly. We have no glasses in our data or in all of our data. There’s something missing. And it’s, if you’ve gone through all the stages, that becomes quite obvious. If you’ve only been given a data set until, okay, train something with it, you might not appreciate that and you might not really have a really, a good understanding of what’s in your data.
[00:23:29] Tadas B.: Even if you don’t dive deeply into all the subsurface scattering equations in blender cycles or something, that’s fine, but generating the data yourself and telling what should be part of the data set is really important.
[00:23:43] Gil E.: Definitely, and I love this deep intuition and kind of being one with your data. I definitely agree with that.
[00:23:49] Gil E.: Dealing myself with synthetic data for the past five, six years. I can definitely say that there’s an enormous amount of intuition that you can gain over time, but it takes a lot of time and a lot of both hands on work and a lot of failure cases that you need to see and understand and kind of really go through the motions many, many times to, to fully appreciate.
[00:24:10] Gil E.: And there are many similarities from what you’re saying to kind of how we experience it. And specifically in this domain. The experience is something that has a material impact on the quality of the data. So as we progress, we can really understand from real world customers that we work with, in our case, what is missing from the data and improve it substantially each time.
[00:24:33] Gil E.: Before we dive into the more modern works that you guys have, I’d love to maybe take a step back and ask at a very high level, what is synthetic data, right? Just to frame it a bit. There’s simulated synthetics that we’ve been dealing with and we’ve been talking about a lot of graphics, engine based simulated data.
[00:24:52] Gil E.: They’re generative models like fusion models now, but Style GAN before and and other generative models beforehand. And then there’s also kind of this maybe different approach of mixed reality. Mixing real imagery together with some kind of simulated imagery can also include NeRF inside of it. It can include other different capabilities.
[00:25:15] Gil E.: I’d love to maybe take a step back and just ask. What is synthetic data and maybe how these different parts may come together now in the future.
[00:25:25] What is Synthetic Data?
[00:25:25] Tadas B.: In the past. Yeah. I think you provided an excellent summary of that, and definitely I agree with that. Maybe it’s sort of interesting aspect of that is what are you generating the synthetic data for, and that’s sort of, you choose which synthetic data makes more sense for you and maybe computer vision and machine learning fields were
[00:25:46] Tadas B.: dominated more by the Style GAN, diffusion or like fully automatic. Methods or more generative methods, and they create great results, great visually, aesthetically appealing results, but may not lend themselves very well to solving downstream tasks. And that’s not necessarily their purpose, which is perfectly fine. With more graphics driven synthetic data
[00:26:11] Tadas B.: you have the ground truth annotations, which make it much more amenable to the downstream applications. And as you mentioned, there’s also the mixture of it where some of it is driven by generative approaches. Some of it is driven by graphics and you cleverly mix them together. And sometimes you, maybe some early stages of your pipeline are more driven by generative models.
[00:26:33] Tadas B.: But the entire pipeline is put together with graphics pipelines. It’s interesting how sometimes internally we will always have to make that distinction clear that we’re not doing generative synthetic data generation. We’re not doing GANs and stable diffusion. We actually call it visual effects at scale internally, because we use visual effect techniques.
[00:26:53] Tadas B.: That’s what it is in the end.
[00:26:53] Gil E.: Why really are direct generation techniques less suitable than simulated synthetics or visual effects based techniques for these downstream tasks? Like why is that in your opinion?
[00:27:07] GANs and Diffusion Models
[00:27:07] Tadas B.: So yes, the generative models are great at as mentioned, creating visually appealing imagery that often you will not be able to tell
[00:27:16] Tadas B.: isn’t a real photograph. However, it doesn’t come with the extra metadata around it. You want to train a landmark detection model. Well, you don’t really have landmarks. There may be ways of getting that information, but it won’t be perfect. It won’t be consistent. It won’t be of quality that you could get with visual effect style computer graphics.
[00:27:37] Tadas B.: Same goes for segmentation, masks, normals, depth imagery, whatever ground truth you want to create and it’s the flexibility of it. I dream up a new ground truth label and it will be doable Visual effects style, world with GANs or stable diffusion. You’ll need to really think hard. How do I implement that if at all possible?
[00:28:02] Tadas B.: And the other aspect is the controllability of it. I can ask for 10% of my images should have glasses, 20% of them should have lipstick and the like with a Style GAN, that’s really difficult and it’ll be biased towards the data you trained it with.
[00:28:17] Gil E.: Yeah, I definitely agree. Yeah. If you wanna touch on, on the biases and the data, that’s great as well.
[00:28:23] Tadas B.: Yeah. Especially a lot of the generative models are trained on datasets like FFHQ or Celeb a, which are people at red carpet events, or really highly curated imagery, high quality of attractive looking people, which is not always gonna match the real world. It’s great if you want to generate data like that. It’s not great if you want to represent everyone in the world.
[00:28:50] Gil E.: Definitely. Yeah, and these are all very much aligned with how we see things as well. Ideally, in the future, we do see some convergence allowing us to create additional realism on top of the graphics generated data. But we really think that control and the ability to control data with code is a very powerful thing.
[00:29:13] Gil E.: An important thing in order to properly, in order to actually solve real world tasks. And so we’ve built on top of our data generation capabilities, a simple API and then an SDK, a Python library in practice that allows you with very simple object oriented code to actually code the data that you want to be generated with a few lines, right?
[00:29:36] Gil E.: And we definitely believe in data point level control, meaning you can control any data point to the most granular method or the most granular level, or you can keep it high level and request percentages of different things in the data. This is definitely very much aligned with how we see things.
[00:29:56] Tadas B.: So one interesting, and this is how visual effects industry has moved as well, because they are actually using GANs or diffusion models, but earlier stages of their pipeline. They use it for looked at, sort of seeing what that actor would look like in a particular situation, but not necessarily plugging that directly in in their pipelines, but that acts as visual reference, which is really helpful.
[00:30:19] Tadas B.: And maybe that’s the first stages of how we can integrate these really exciting developments in the field in type of tooling. Maybe it’s for texture generation, maybe it’s for hair simulation, maybe. There are a lot of places where you can plug that in without losing that control or without losing the ground truth quality,
[00:30:42] Gil E.: That’s very interesting. And now connecting it to the Fake it Til You Make It paper. I’d love to understand, and this was a big milestone in the history of synthetic data, especially graphic space, synthetic data. I think it opened a lot of people’s eyes to the power of synthetic data.
[00:31:00] Gil E.: It was also kind of a stamp of approval from a major company like Microsoft that this approach can work for real world applications. But I’d love to hear a little bit on the inside, like how it was to develop this paper and if you guys understood ahead of time how unique this was and how important it was in the kind of history of synthetic data.
[00:31:24] Fake It Til You Make It
[00:31:24] Tadas B.: Yeah, so, and as I mentioned before, I was fortunate to be in a group that already had a proven track record with synthetic data, and sort of Microsoft already demonstrated the value of synthetic data for full body tracking, for hand tracking. It was in a more specialized domain with specialized type of imagery.
[00:31:43] Tadas B.: So no one’s really, we haven’t seen proof that this could work for faces and for visual light cameras, not depth cameras or infrared cameras. It was both internally wanted to demonstrate internally that this works cuz it, even though we sort of had evidence, look, it works for kinetic body tracking and for hand tracking in other ones too.
[00:32:02] Tadas B.: People were still a bit careful and dubious. For full face applications but the images don’t look as well and I can explain it partly by the fact that we’re so sensitive to facial imagery. If synthetic body or synthetic hand doesn’t look right, eh, you know, people will sort of be okay with it. And what will matter is evidence that it generalizes.
[00:32:24] Tadas B.: With faces, even if you have a bit of evidence that it generalizes, people will say, oh, but it does look a bit creepy. It doesn’t look really realistic. And you sort of, you have to push past that and say, no, it’s fine that it might be an uncanny valley for a DNN. That might not matter as much.
[00:32:42] Tadas B.: And that’s why we sort of, we wanted to build that evidence base and it was tricky, like when you train models on the data and because synthetic data annotations are a bit different, you sort of, you validate on real data sets. Well, it doesn’t generalize that well, but because you’re measuring the wrong thing often, sometimes your synthetic data predictions based on synthetic data are better than the ones
[00:33:04] Tadas B.: based on real data, but the real data’s annotated by people who can’t annotate strand level hair, sort of with every pixel perfect annotation, or can’t annotate facial landmarks in a very consistent way or in occlusion aware way, or in 3D, they annotate it in image space, which is perfectly valid. That’s just a different type of annotation and
[00:33:28] Tadas B.: it’s being able to evaluate on real data and sort of build that evidence based on several tasks and on several data sets. It was a lot of work, but it definitely was worth it. And I think the sort of community received it quite well. Oh, I say that at first, it wasn’t getting the paper published was difficult.
[00:33:46] Tadas B.: its reviewers were hesitant about it as well.
[00:33:50] Gil E.: Why do you think that they were hesitant about it, by the way?
[00:33:53] Tadas B.: Because it was a combination of a lot of graphics techniques and we’re publishing it in a computer vision venue. So it’s a bit more difficult to see the novelty or it’s seeing it more as a systems paper.
[00:34:05] Tadas B.: Okay. Yeah. You put a lot of these things together and look at works, so what? And you sort of have to say, well, this is actually novel information. This is, we’re not presenting it just how to put a system, but we’re saying, look, this is doable and you can do it as well. And that type of work doesn’t always get the response you expect from reviewers.
[00:34:26] Gil E.: Definitely, and there are two things that really stood out. So I mean, when we saw this paper, we also got a lot of people approaching us about this paper. There were two kind of responses. One was like, this is the end of computer vision. You know, you can solve everything with synthetics. We don’t need anymore research and computer vision.
[00:34:45] Gil E.: We just need really good synthetic data. And then they asked us to help them out with this, which is great, and I’d love to touch on that in a second. And then the second response we got, wow, there’s a crazy domain gap here. There are images here, and they sent me some of the images cuz you released, I think around a hundred thousand images in the initial GitHub.
[00:35:04] Gil E.: And so they sent me some of the images that looked crazy, that looked quite scary actually. And they said, look at this domain gap. How does this make sense? And so I’d love to hear how you tackle both sides of this question, right? Like, is this the end of computer vision? Right? And on the flip, What do we do about this domain gap?
[00:35:21] Gil E.: What is the meaning of the domain gap? Can we measure it in any way?
[00:35:25] Domain Gaps
[00:35:25] Tadas B.: Yeah. It’s not the end of computer vision. Even if you had perfect data, you still need to be able to find approaches that can train on that data, but also even, it’s not the end of real data collection. The projects we’re involved in, we always need real data to evaluate on.
[00:35:40] Tadas B.: because just training on synthetic data, you might not know, you know how well you’re performing because you could test on synthetic data and that there’s some value in some more niche settings of that. But the only answer you’re gonna get, how well we generalize to our settings is through real data collection.
[00:35:58] Tadas B.: And it’s gonna inform you what gaps might be missing in your synthetic data and there always will be gaps. Human appearance is vast. Not only the shapes of faces and how we look, but also things like, you know, hairstyles and clothing. And I genuinely hope that I’m never able to build a system that captures all of it, because that means human creativity ended and we’ll always have new types of garments, new hairstyles.
[00:36:26] Tadas B.: new Facial tattoos, piercings, what may have you, I hope that we never hit
[00:36:30] Gil E.: very long tail, you’re saying?
[00:36:32] Long Tails (Edge Cases)
[00:36:32] Tadas B.: Exactly. And the one that’s always expanding. So that’s on the computer vision in that aspect, definitely not. There’s still a lot of work for us to do here. The other one is with regards to domain gap and we often focus on and I think NVIDIA did a great job at decoupling types of domain gaps in their medicine work.
[00:36:55] Tadas B.: They identified appearance, domain gap, and content domain gap. I don’t remember if those are the exact terms they use, but one is this sort of visual appearance whether it looks like taken from a camera and we’re not there yet, Hollywood is there, but they focus on an individual actor for a particular scene.
[00:37:14] Tadas B.: We need to do it for everyone. Like if you squint and look from far away, maybe you can sort of confuse it for a real image, but we’re not there. The other one that is equally, if not more important is the content domain gap. Does your data have lipstick? Does it have glasses? Does it have gray hair?
[00:37:31] Tadas B.: And that is also really important. And if you have a lot of content, if you have a lot of dimensions, it becomes a question, you know, how do we sample it? And sometimes when you, in our work so far, we’ve been sampling it all independently. So you might have faces that are not impossible, but maybe unlikely.
[00:37:51] Tadas B.: And people tend to focus on them. And internally we have to think, oh look, this person doesn’t look very realistic because it’s hair color, eye color, and skin tone that may not occur in real life and you’re sort of, well, it’s not impossible. Maybe unlikely. And I think that’s actually good for your DNN because you see more unlikely data and you might generalize better.
[00:38:13] Tadas B.: I don’t have evidence to support it. It’s something we’re interested in building more support for, but I don’t see it as a bad thing. You want to sample those bookends or ends of the long tail in your data.
[00:38:27] Gil E.: I agree. And I think that there are some challenges, of course, supporting that with evidence.
[00:38:32] Gil E.: So for example, if you have women with a beard, right in your dataset, in your synthetic dataset, that might be good for generalizing to very, very, wide audiences, but you might not have any real data that can support that. And so you don’t really have quantifiable evidence that very long tail, right, that that’s even part of the very long tail, or that it’s an important part of the long tail.
[00:38:56] Gil E.: And so it becomes a very tricky question. I did wanna maybe ask also about training versus testing. So you can use synthetic data for training models and you guys have done that across many, many different models. It’s also interesting to look at the flip side where you can test for very specific scenarios.
[00:39:15] Gil E.: You can test for biases, you could test for very different things. We actually, we released a paper at the end of 21 in NeurlPS showing how to leverage our synthetic data to analyze biases in data between different ethnicities and genders, et cetera. But I’d love to hear from your perspective. how you would go about using it for testing or maybe unit testing in a way, or have you guys not dealt with that side of the synthetic data?
[00:39:42] Training vs. Testing
[00:39:42] Tadas B.: Yes. So then, there’s multiple parts of it, as you mentioned. So you can slice this in in many ways. One of them is, as you mentioned, unit testing. Sometimes synthetic data is great for that, even when developing an algorithm, cuz you have perfect annotations of all the geometry. So you can just run it on the data and you know how it’s,
[00:39:59] Tadas B.: how you expect it to work, and that’s great. It helps you debug things, but that’s not necessarily gonna give you a good indication of how well you’re performing on real data. The other aspect is for checking those extreme cases, as you mentioned. How do we react to pose variation? How do we react to lighting variation?
[00:40:17] Tadas B.: And even if your data’s not perfect, one sort of hypothesis that we have is that the cliffs will be in similar places. So degradation and performance when on synthetic data is likely to correspond to degradation even though the absolute. Will probably be different, but the relative trend is likely to stay similar, and that’s where there’s a lot of value, that there’s also a, a lot of value in sort of prototyping of hardware where you sort of want to know, okay, well what part of the face will be visible?
[00:40:47] Tadas B.: What, even though we may not have full confidence of the illumination properties of our skin and synthetics, but we have a good understanding of geometry variability. We know that, oh, we’re capturing a reasonable proportion of people in our parametric face model, so we can actually sample them and sort of see, oh, how do the extreme look under different cameras or different camera setups?
[00:41:10] Tadas B.: Do we still see lips in those setups, do we not? So yes, there’s value, but you have to be very, very aware of the limitations of it because it’s so easy to overtrain. And everything looks perfect on synthetic data. Then you push it out on real and it doesn’t work.
[00:41:24] Gil E.: Definitely, definitely. Yeah. It’s important not to only use synthetic data for testing, but it seems interesting like how it can be a tool in a toolbox for testing and validating in debugging different networks.
[00:41:37] Gil E.: If we look forward kind of towards the future of synthetics, how do you see technologies like NeRF and Diffusion? Maybe new advanced domain adaptation techniques. How do you see these coming together and aiding synthetic data or aiding the use of synthetic data in the future?
[00:41:53] Future of NeRF and Diffusion Models
[00:41:53] Tadas B.: Yeah, so one of the use cases I’d sort of already mentioned in maybe helping some earlier stages of the pipeline say generating textures or maybe cleaning textures or maybe
[00:42:06] Tadas B.: Despecularizing textures or geometry. So, there’s help in early stages of the pipeline now for later stages of the pipeline. That’s a bit trickier and, and we played around with it a bit where you sort of passed the synthetic image and we actually published in ECCV on that in the past. And yeah, you can get really nice realistic looking faces, but they changed slightly.
[00:42:27] Tadas B.: So your ground truth annotations are no longer valid. They might change identity, they might change where they’re looking, and it’s gonna be a change equivalent to a change driven by your real data training. So, because we’ve had style GANs trained on FFHQ, so that they made authentic images, you know, look at the camera, which is not what you want, you explicitly wanted to maybe look to the side.
[00:42:55] Tadas B.: So there’s still gonna be a lot of work to understand how to do that. But one other sort of aspect of it that maybe is sometimes underappreciated is you pass your synthetic imagery through Style GAN, or through some diffusion network, and that allows you to debug a bit, sort of. Okay, so what is missing? How would it make it look more real?
[00:43:16] Tadas B.: And that can drive a bit your synthetic pipeline. It’s sort of like a de, almost a debugging tool.
[00:43:22] Gil E.: So you could debug on one hand the synthetic pipeline by running that through. And on the flip side, you could also understand the biases in the generative algorithm by running it through. So you kind of get both sides of the equation depending on what’s interesting for you.
[00:43:37] Gil E.: If I roll back a little bit, like we definitely used diffusion in earlier stages of our pipeline. So for example, when generating the skin of people, we have a diffusion based pipeline for doing that. We’re leveraging these different technologies earlier on in the process, but of course, combining everything to maintain a perfect ground truth that doesn’t have any kind of eyes that are moving to different directions, etcetera.
[00:44:03] Gil E.: One interesting aspect that we saw when trying to leverage style GAN as a post-processing technique is that because the domain is limited by what you trained it on, it actually hurts the ability to add new features over time. And so, for example, we added masks when Corona came and there weren’t any masks in the style GAN data set.
[00:44:22] Gil E.: And so it had very, very scary images come as the output. And really the issue was that every new feature that we would wanna add to our synthetic data, we would need to gather real data. of those features and validation of it was also kind of a nightmare. So we saw that this was not yet at a place that was really scalable.
[00:44:44] Gil E.: And that’s one of the reasons that we’re not doing post-processing, or at least it’s not part of the main pipeline, the post-processing realism layer that we’ve tried to build a few times already.
[00:44:58] Tadas B.: That’s really interesting. And yeah, that’s a great point. And it’s. Sometimes presenting that to an audience that is more familiar with just gans or stable diffusion becomes challenging.
[00:45:07] Tadas B.: I gave a talk in a CMU seminar and presenting up, fake it till you make it work. And at the end some of the questions were sort of, but where’s the GAN sort of, which part of your model is the GAN? And you’re sort of like none of it. And it’s really difficult, sort of breaking that association with synthetic data and generative models.
[00:45:26] Gil E.: Yeah, definitely. I mean, I, I think that on my side, I do think that for the first time, Stable Diffusion has shown that you can reach a broad enough domain. That can be generated in a realistic way. So style GAN before was a very narrow domain that could be generated in a realistic way. I think Stable Diffusion is the first time that you see a very, very broad domain that can then lend itself nicely to the intricacies of synthetic data generation.
[00:45:53] Gil E.: Running it in a naive way does not work. We’ve tried , but I do. I am hopeful that this same technology can be the basis for the photorealistic, fully indistinguishable version of our synthetic data that will still maintain all of the control, the programmatic control that comes with the data itself.
[00:46:12] Gil E.: And so I know that you’re also a big proponent of kind of 3D human avatars leveraging these kind of capabilities potentially for powering digital humans interacting. Instead of, let’s say us doing this conversation over zoom like experience on a 2D screen, we could be potentially in a 3D world talking to each other.
[00:46:37] Gil E.: But instead of being simple puppets, we would be fully photorealistic versions of ourselves. How do you see this playing out in the next kind of few years? What are some of maybe the big challenges as well that you see happening?
[00:46:50] Tadas B.: There’s technological challenges that we’re all aware. We’re still not
[00:46:55] Tadas B.: perfect at tracking bodies or faces from limited visual input. But even if we resolve those technical aspects, there’s a lot of interesting questions about, you know, how do we place the environments? Do we both have to be in the same shape room to interact? We have to have the furniture in the same places?
[00:47:14] Tadas B.: Maybe not. How would that work? How will we present ourselves if it’s a stylized room? Should our avatars be stylized in a similar style? What about merging different styles? How do we interoperate between those? Or there’s still a lot of, I believe we’ll sort of address technology eventually, but even if, when we do it’s.
[00:47:36] Tadas B.: still gonna be a really challenging UI or user research question. What makes sense here and how do you deal? You know, two of the users are joining in through their, you know, holoLenses of virtual reality glasses of some sorts, and another user is just on a team call. How do you integrate them? How do they have a non-degraded or asymmetric experience?
[00:47:59] Tadas B.: We’ll already have that with remote work. Some people are in the conference room, some people are remote, and often remote people have a worse experience, that it is an asymmetric interaction and we want to minimize that. Want everyone to have the same feeling of presence in the meeting.
[00:48:17] Gil E.: When do you think we’ll reach a point where we have really like photorealistic avatars in this kind?
[00:48:24] Gil E.: VR or AR setting.
[00:48:26] Avatars and VR/AR
[00:48:26] Tadas B.: There’s different access to that. One is availability and scale, because I think we sort of already have some of that as demonstrated by more recent network or work from Meta, but they’re really, really difficult to scale. If you have access to a rig that will capture your couple of minute performance with hundreds of cameras, you know, you can build something like that for yourself.
[00:48:49] Tadas B.: Of course, not everyone could build it if you have just a mobile phone camera or a webcam, then you don’t have access to that. So it’s, I think, big question is more when will we get to a stage where we can build this from really simple enrollments and have it accessible, available to a lot of people.
[00:49:08] Gil E.: And what do you think is the timeline on that?
[00:49:10] Gil E.: Like more or less, what would you put your money out? .
[00:49:14] Tadas B.: Yeah. It’s one not there yet. Well, it’s gonna be, I’m hesitant with predictions because we’ll always get them wrong, but we’re definitely not there yet.
[00:49:22] Gil E.: I mean, I think the proof is, is that we’re still all using these 2D screens to talk with each other and, and not using the VR or AR devices.
[00:49:30] Gil E.: So I think that that’s really the proof when I see that everyone in my company is asking for VR devices instead of computers or AR devices instead of computers, that’s gonna be probably a turning point that that will be,
[00:49:46] Tadas B.: and it’s interesting because it’s a really high bar to beat. We sort of like as a technologist, you approach it, then you think, oh yes, we’ll put the people in VR and it’s gonna be amazing and you’re gonna have this 3D world, and sort of well, but the baseline people compare with is video calls, which are, you know, that there’s problems with them, but they’re pretty good.
[00:50:04] Gil E.: Yeah, yeah. You can communicate efficiently and. Yeah, they’re pretty good. They’re definitely pretty good. Amazing. So to wrap up, I’d love to ask the question that we ask all of our guests that come in. What would you recommend for someone new that’s coming into the space of computer vision today? What are some of the things that you feel have really helped you along your journey? Advice that you think would help people? In their careers, in the early days of their academic careers or when they move to industry,
[00:50:39] Advice for Future CV Engineers
[00:50:39] Tadas B.: Don’t be afraid to look at the data and focus on the data. So often I’ve seen people get distracted by, oh, I’ll just try this slightly different architecture or this slight tweak to the op.
[00:50:51] Tadas B.: Oh, maybe even a big tweak to the architecture. Really having an appreciation of what their training data and what their test data is because typically that’s driven by just the quality of your data. And I know it’s maybe not the most glamorous of activities to clean or organize your data or your test data and see maybe some of the test data annotations are wrong, maybe some training data annotations are wrong, but often
[00:51:15] Tadas B.: the biggest gains will be there. And once you do that, yes, there’s huge value in improving algorithms, as well. But I think a bit of focus on, on data and appreciating the data work. We often ignore that or underappreciate the work even though it’s hugely important and know there’s movements for more data centric AI now, but that’s strangely late considering the state of the.
[00:51:43] Gil E.: Thank you very much. So your main takeaway is focus on the data, understand the data, and appreciate the data. I love it. Thank you very, very much, todos. It was a pleasure to have you on Unboxing AI.
[00:51:55] Tadas B.: Great. Thank you for having me. It’s been a pleasure.
[00:51:58] Season One Wrap-Up
[00:51:58] Gil E.: That’s a wrap on season one. [00:52:00] We’ve come a long way this season and covered many topics.
[00:52:02] Gil E.: We started our inaugural session. A great discussion on solving machine learning problems with Anthony Goldblum, the co-founder and former CEO of Kaggle. Next up was a chat with Lihi Zelnik Manor, professor at the Technion and the former general manager of Alibaba Israel about the intersection of academia and industry.
[00:52:23] Gil E.: We took a turn into medical AI with Idan Bassuk, the VP AI from Aidoc, who spoke to us about saving lives with deep learning and robust infrastructure. And of course, no CV podcast could go without talking about autonomous driving. VJ (Badrinarayanan), the VP of AI at Wayve joined us to talk about waves end-to-end machine learning approach to self.
[00:52:45] Gil E.: On our fifth episode, we hosted Michael J. Black from the Max Planck Institute of Intelligent Systems. Michael brought us back to the early days of body models and also how avatars will revolutionize our everyday lives. Or Litany, a senior research [00:53:00] scientist at NVIDIA focused on 3D and the future of 3D generative models, NeRF, and how multi-model models are changing computer vision.
[00:53:09] Gil E.: In our second to last episode, we entered the world of SLAM by talking to the father of SLAM Andrew Davidson, a professor of Robotic Vision at Imperial College of London. And that brings us to today’s episode where I wanted, again, to thank Tadas for joining us and for his insightful conversation on synthetic data.
[00:53:27] Gil E.: I wanna give a big shout out to the team at Datagen for all of their assistance and support on producing this podcast. Thanks to everyone who is a big believer in this concept from the beginning. It was my pleasure to host these conversations with such top tier professionals, pick their brains and learn from them.
[00:53:43] Gil E.: I’m looking forward to the future of computer vision now more than ever. Join us next season when we continue to bring you the most interesting and brightest practitioners and thinkers in computer vision. It’s gonna blow your minds. Yalla. I’ll see you all at season two.