Body Models Driving the Age of the Avatar – with Michael J. Black

Listen & Subscribe on

About this Episode:

In this episode of Unboxing AI, I host Michael J. Black from the Max Planck Institute. We speak about body models, his journeys in industry and academia, representing all human body types and the age of the avatar. Michael explains about the early days of computer vision, his experiences commercializing body models through his startup, Body Labs, and how the metaverse and our avatars will revolutionize our everyday lives.

TOPICS & TIMESTAMPS

[00:39] Guest Intro
[01:41] What are body models and why are they so useful?
[04:17] Human interpretability – important or not?
[05:32] Real use cases for body models
[10:54] History of body model development leading to SMPL
[19:21] Body model development beyond SMPL: MANO, FLAME, SMPL-X, and more
[22:11] Edge cases: dealing with unique body shapes
[24:45] Early days of computer vision
[27:37] Working at Xerox PARC
[29:59] Shifting to academia
[31:30] The vision for Perceiving Systems at MPI-IS
[34:15] Innovation and team structure at Perceiving Systems
[37:40] Perceiving Systems – similarities to a startup
[40:38] Founding Body Labs
[45:30] Body Labs’ Acquisition by Amazon
[47:24] Distinguished Amazon Scholar role
[49:03] About Meshcapade
[50:05] What is the metaverse?
[51:52] The age of the avatar
[56:32] Career Tips for Computer Vision Engineers

LINKS AND RESOURCES

Michael J. Black @ MPI-IS
LinkedIn
Google Scholar
Twitter
YouTube

Papers at CVPR 2022
BEV: Monocular Regression of Multiple 3D People in Depth
OSSO: Obtaining Skeletal Shape from Outside
EMOCA: Emotion Driven Monocular Face Capture and Animation

Body Models
SMPL
FLAME
MANO
SMPL-X
STAR
SCAPE

About Meshcapade
Website
GitHub
Instagram

About Perceiving Systems
Website
Overview Video

GUEST BIO

Our guest is Michael J. Black, one of the founding directors of the Max Planck Institute for Intelligent Systems in Tübingen, Germany. He completed his PhD in computer science at Yale University, his postdoc at the University of Toronto, and has co-authored over 200 peer-reviewed papers to date. His research focuses on understanding humans and their behavior in video, working at the boundary of computer vision, machine learning, and computer graphics. His work on realistic 3D human body models such as SMPL has been widely used in both academia and industry, and in 2017, the start-up he co-founded to commercialize these technologies was acquired by Amazon. Today, Michael and his teams at MPI are developing exciting new capabilities in computer vision that will be important for the future of 3D avatars, the metaverse and beyond.

ABOUT THE HOST

I’m Gil Elbaz, Co-founder and CTO of Datagen. In this podcast, I speak with interesting computer vision thinkers and practitioners. I ask the big questions that touch on the issues and challenges that ML and CV engineers deal with every day. On the way, I hope you uncover a new subject or gain a different perspective, as well as enjoying engaging conversation. It’s about much more than the technical processes – it’s about people, journeys, and ideas. Turn up the volume, insights inside.

Guest Speaker:

Michael J. Black

Director of the Perceiving Systems Department, Max Planck Institute for Intelligent Systems

Michael Black received his B.Sc. from the University of British Columbia (1985), his M.S. from Stanford (1989), and his Ph.D. from Yale University (1992). After post-doctoral research at the University of Toronto, he worked at Xerox PARC as a member of the research staff and area manager. From 2000 to 2010 he was on the faculty of Brown University in the Department of Computer Science (Assoc. Prof. 2000-2004, Prof. 2004-2010). He is an Honorary professor at the University of Tuebingen and one of the founding directors at the Max Planck Institute for Intelligent Systems in Tübingen, Germany, where he leads the Perceiving Systems department. He was also a Distinguished Amazon Scholar (VP, 2017-2021). His work has won several awards including the IEEE Computer Society Outstanding Paper Award (1991), Honorable Mention for the Marr Prize (1999 and 2005), and all three major test-of-time awards including the 2010 Koenderink Prize, the 2013 Helmholtz Prize, and the 2020 Longuet-Higgins Prize. He is a member of the German National Academy of Sciences Leopoldina and a foreign member of the Royal Swedish Academy of Sciences. In 2013 he co-founded Body Labs Inc., which was acquired by Amazon in 2017.

Episode Transcript:

Guest Intro [00:39]

Gil Elbaz: Today’s guest is Michael J. Black, one of the founding directors of the Max Planck Institute (MPI) for intelligent systems in Tübingen, Germany. He completed his PhD in computer science at Yale university, his postdoc at the University of Toronto and has co-authored over 200 peer reviewed papers today.

His research focuses on understanding humans and their behavior in video, working on the boundary of computer vision, machine learning and computer graphics. He worked on realistic 3D human body models, such as SMPL. This has been widely used in both academia and industry.

And in 2017, the startup that he co-founded to commercialize these technologies was acquired by Amazon. Today, Michael and his teams at MPI are developing exciting new capabilities in computer vision. That will be important for the future of 3D avatars, the metaverse and beyond. It’s so great to have you on the podcast, Michael.

Michael J. Black: Oh, it’s my pleasure. It’s great to be here. 

What are body models and why are they so useful? [1:41]

Gil Elbaz: Wonderful. I think that one of the most interesting topics in your research that I’ve had the opportunity to read very, very deeply about is the topic of body models. Can you explain for the audience a little, what are body models and why are they so useful? 

Michael J. Black: Happy to. So in the field of computer vision, traditionally, people thought about humans in terms of very simple representations, like just the location of their joints in the image, very sparse. Or maybe humans were represented as a set of cylinders. So each limb of the body could be approximated by a cylinder or some other very simple geometric structure like that.

But that’s really not what humans look like. And, it turned out to be difficult to get computers to estimate the body when it didn’t really know anything about the body. So, back in 2007, we started working on estimating detailed 3D bodies from images. That’s really difficult. If you think about a traditional 3D graphics mesh of a human body, it’s got, you know, thousands of vertices, thousands or tens of thousands of triangles, and the idea that you would take an image and somehow get back all of these vertices and triangles are kind of crazy. So, to make that possible, you need what we call a low dimensional model, a parametric model, something with a few parameters, on the order of tens, not thousands that you can reliably estimate from images. So, the first body models we built and the body models we built today are based on 3D scans of humans.

So thousands of them, now we’re up to over a million 3D scans of people. And from these 3D scans, we take a traditional graphics mesh of a body, a template mesh and we’d register or align it to all of the scans to bring them all into correspondence. And once we’ve done that, we can learn a statistical model using standard machine learning techniques, even very simple ones, to produce a model. That captures the variation across the human population, both in terms of our body shapes, how we differ you and I, and, and how we deform as we move, which is really critical for many problems. So that in a nutshell is a body Model, it’s a function, a mathematical function that takes a small number of parameters, typically pose and shape parameters, that gives you back a 3D me.

Human interpretability – important or not? [4:17]

Gil Elbaz: So from your perspective, is it important for these body models to have parameters that are human readable and interpretable? Or is it the priority just to have a very dense representation of the low dimensional human structure?

Michael J. Black: Great question. Traditionally, these models were built by humans in computer graphics, for example, for the entertainment industry to make Spiderman or some Hulk.

In that case, they are human interpretable. There are some controls that distort various parts of the body. And when you’re hand animating something, that’s that’s important. So our insight was really that those models are great. The graphics industry, the games, industry, everything is based on them. So we wanted to build a model that’s completely standard and plugs and plays with everything else.

But there’s no reason for a computer vision system to necessarily have this interpretability. And so our models don’t necessarily have that and what we’re looking for is really a compact model that’s easy for a computer vision algorithm to take pixels and produce the parameters accurately. 

Real use cases for body models [5:32]

Gil Elbaz: These examples from the entertainment industry that inspired body models are perfect ways to illustrate its usefulness.

It would be great if you could provide us with a few additional use cases, examples of things outside of the entertainment industry that have been done with body models. 

Michael J. Black: Yeah, absolutely. Just, one little thing about models like the Hulk, the way this works is that an animator uses the dead standard technique in graphics called linear blend, skinny and linear blend skin has all kinds of artifacts.

And so what an animator does is they go in and add correctives, that depend on a particular keyed pose that fix the artifacts. And, and then they’re animating the Hulk and they see another artifact and they add another little corrective that fixes, bat for another pose. And pretty soon they have hundreds of these correctives all triggered by different poses.

And, it’s unwieldy at some point. And so the interest. The approach we took with our SMPL body model, which stands for skinned multi-person linear model. It uses the same kind of correctives, but rather than, you know, building them one at a time, it learns them all together to make a compact set of them.

So, in terms of applications of this, there are many of them that we found the early adopters. Of these accurate body models. So remember in graphics quite often, you’re, you’re trying to make Spider-Man or the Hulk, which isn’t very much like a real human and the models we built were all trying to model humans really accurately.

And so there was a lot of interest from the psychology community and from the medical community. So in psychology, they were interested in, we had collaborators who were working on anorexia nervosa. And there was a theory that anorexia was a disease of perception that people would look at their body and they would see it incorrectly.

They would see themselves as fatter than they actually were. And if it were a perceptual problem, maybe you could treat it by, by training people to perceive their body differently. So, together with our collaborators, we did VR experiments on how people looked, the patients looked at their bodies in 3D, and we were able to show that they actually perceived their body shape correctly, their body shape and the shapes of others.

And this was really useful for that community to move on from a hypothesis that didn’t really work out. 

Gil Elbaz: Seems like a very data driven approach for one of these more psychological soft use cases. Very unique to see this cutting edge technology of body models applied in this way. 

Michael J. Black: Yeah. There’s another one which I really like.

And it’s also, this is in the medical domain as well, and that’s our body models being used for the early diagnosis of cerebral palsy in infants. So,infants, you know, they move around in what looks kind of a random way, but it turns out that if you are a trained observer, the kinds of motions that an infant makes reveal whether they have motor difficulties that could be a sign of cerebral palsy.

And if it’s detected early enough, that early intervention can really make a difference in the lives of these people as they grow up. And, but there are very few people who are trained as experts in detecting these motor movements. And so we, again, worked with a bunch of collaborators, who are experts in.

And developed a system with a child body model called smile, S M I L, that tracks the babies using R G B D so color and depth imagery, in 3D. And then we developed a system that takes their motions and classifies whether they might be at risk for cerebral palsy or not. And I think that’s another neat example that I would never have come up with that on my own, but by making this kind of technology available, we get people coming to us and saying, I’ve got an idea, and I know how I could maybe use this for solving my problem, which is very exciting.

Gil Elbaz: That’s extremely exciting. The moment that you said babies. Wow. How did you guys create a body model with babies? 

Michael J. Black: Oh, that, that, yeah, that was really hard actually. So, the student who was working on it was very clever because getting babies into your lab, putting them in a scanner, they don’t stand still.

They won’t lie still. They’re constantly moving and it’s actually a real challenge to capture the data. So he developed a method to learn a body model directly from R G B D data, not full 3D scans, but just the R G B D. And that worked out really nicely. So, I’m pretty happy with the results.  

And there’s lots of other applications, you know, in industry, many people are using this model. I think it’s like six of the top 10 NASDAQ companies. Have licensed it, it’s in wide use. And a lot of the interest there is in clothing design, for, clothing sizing and applications in that space.

History of body model development leading to SMPL [10:54]

Gil Elbaz: Can you share the history of these different body models even before SMPL? What led to that? What was the novelty introduced by each new model along the way? And, you know, any color that you can add to the development of these body models. 

Michael J. Black: Yeah. So it all really started with the release of a data set called Caesar.

So this was about the year 2000 and it was actually the US army that had scanned a lot of people. They brought them in and they used an old fashioned scanner, but they scanned the whole body and created a data set of several thousand people and the first group to really exploit this and work with, it was, a group out of, University of Washington.

And this guy, Brett Allen, did his PhD on this. And they had the very first 3D body models trained from this data set. But the real breakthrough then came in 2005 with the PhD thesis of a guy named Drago Anguelov at Stanford. And he came up with a model called SCAPE. And SCAPE did something really important. It took the 3D scans of static people, which was like the Caesar dataset though we didn’t have access to that at the time. And scans of people poses and learned a model that was factored. So by factored, I mean, you could represent body shape variation independently of the changes in your body shape due to your posture. And this was really critical and all following body models have exploited this idea.

It made training possible. it made inference from images better though. They didn’t explore that yet. So SCAPE was revolutionary and that’s what got me thinking about this. And, it was actually a guy I was working with. I was consulting for Intel and a former postdoc of mine there said, hey, you know, there’s this thing at Stanford, this SCAPE body model, and it looks kind of interesting.

And so I said, oh my gosh, yeah, this would solve so many problems. So we implemented a version of SCAPE and used that in our first paper on this, which is 2007 and used it to estimate body shape from multi view imagery. So that was the first time. Anybody, I think had estimated a parametric model of the body from images, but SCAPE had a whole bunch of problems. It was a very interesting model, but it was based on triangle deformations.

And these deformations were applied effectively independently to all of the triangles in a big soup. And then you had to reconstruct the body through solving a leaf squares optimization problem. And you were never guaranteed that the body that came out actually preserved any of the properties like limb lengths that you might wanna preserve.

So it was the case that limbs would change length. It looked good to the eye, but it was problematic to work with, and it wasn’t compatible with any of the standard game engines or graphic software. 

Gil Elbaz: It sounds like it would be hard to do anything temporal with that kind of model. The moment that you have someone moving their hand, the length of their limbs changes, that’s really not ideal.

Michael J. Black: Yeah, definitely and we tried to fix it up. We had a version called blend SCAPE and we had something called Lee bodies, which improved on it. And, but fundamentally we needed, I thought, a better body model. And so the criteria I had in mind was it had to first and foremost be well accurate. So at least as accurate as SCAPE, second, it had to be compatible with existing software.

So I wanted it to plug and play in game engines and Maya and blender and unity and everything. It had to be lightweight and fast to compute because we want it to work in VR and other environments. And it had to be something that you could easily infer from images. I didn’t want this least squares optimization step in the middle.

And we tried all kinds of things. Over several years, we had a model called fast that never got published. We did all sorts of stuff until we just went back to what is it that everyone is using in the graphics industry today? And it was linear blend skinny with these post correctives that I was describing earlier and the problem with that was, that it was all hand designed. And so our insight was really to take such models and say, how can we learn it from data?  

Gil Elbaz: So, to clarify, you guys were trying to build a data driven version of these corrective blend shapes. 

Michael J. Black: Exactly. So we wanted to learn the template shape of the body.

That would then be deformed. We added these blend shapes for body shape to change the body. We wanted to learn the weights that you use in linear blend skinny. We learned the joint locations because the body shapes change.  Your joints are in a different place than my joints. So your joints have to be a function of your body shape.

We learned that, and then we learned the pose correctives and here was the big problem: how to parameterize. The joints of the body or the limbs of the body are, can be described in lots of different ways. You can use euler angles or an access angle representation Rodrigues or, quaternions, there’s all kinds of representations.

And what we needed was a representation that would be related to shape deformation in a really simple way, because we wanted it to be a simple model. And there, it turned out, oddly, that the rotation matrices describing the part rotations, we could linearly relate the shape deformation of the body, to the elements of the part rotation matrices.

And this was the best thing we came up with. We had many different formulations, but this was dead simple. Remarkably, when we trained it, even though it sounds more restrictive in many ways than SCAPE, it was more accurate than SCAPE. So it ticked off all the boxes that we needed. Accuracy, speed, compactness, and full compatibility.

Gil Elbaz: Two really impressive aspects of this. From my perspective, you guys had the foresight, especially back then to connect this to standard tools like blender and unreal, and even Unity. And I would probably attribute some of the success to that. And the other super impressive part from my perspective is trying really, really hard to create something simple.

Yeah. I think that many times in academia, we see very complex structures for representing various things. Creating a model that’s actually simple is sometimes much more of a challenge.

Michael J. Black:  I’ve had this experience many times in my career where there’s an almost innate drive among academics to do something really complicated and fancy.

Like it’s kind of fun. And I have done that in my career. So I was trying to solve a problem and I came up with this very fancy nonlinear particle, filtering technique to solve it. And, I was describing it to someone and they said, have you just tried a common filter and it was like, no, but I should have, of course that’s the first thing I should have done.

And of course the common filter beat my fancy nonlinear, crazy particle filter. And so I’ve had to learn that lesson again and again, start with the simple solution. And then when it breaks, then develop something fancier. But yes, you’re absolutely right. It’s a disease in academia to overcomplicate. But if you want people to adopt your stuff, if you want it to have an impact in the world, then if it’s compatible, you have a much higher chance of it actually working out.

Gil Elbaz: I completely agree with that. That’s a very important insight. So before we dive into the body models themselves, each of these models has its own website with information and videos on each of them, super high quality. And so for all of my listeners, you can find the links to these sites in the show notes.

Body model development beyond SMPL: MANO, FLAME, SMPL-X, and more [19:21]

Gil Elbaz: Michael starting with SMPL can you take us through the history of these body models that you’ve released? 

Michael J. Black: Yeah, so we started, of course, with a limited set of data and the simple model, but one of the reasons I was interested in 3D body models for computer vision in the first place is because humans interact with the world. They touch things to manipulate it. Previous models were focused on skeletons and we don’t often touch the world with our skeletons, right? So we needed surface models, but the thing we manipulate the world with most are our hands. So we needed to add hands to the model. And so we did that by scanning a lot of hands, the same way we did with bodies and building a model, just like simple, but focus on hands.

And that’s called MANO. And then we combined MANO with SMPL and we had SMPL H so now we had bodies and hands, but the other important thing we do with our bodies is we communicate with each other and we do that through gestures, of course, but also a lot through our face. And so we needed models that had articulated faces that were expressive.

And so we did the same thing. We scanned lots of heads. We built a model, just simple. From, head scans and called that FLAME. And then we put all of those together into a model called SMPL X, X stands for expressive here. And that’s a nice package because now we have something that really can represent a lot of the details of what humans do with their behavior along the way.

Of course, we also did this SMLE model for infants, and we also did similar things for animals. The animals were a challenge because you’ll, even if you think scanning babies is hard, scanning animals is even worse. So we had to come up with ways of building an animal model without having access to animals.

It’s continuing our, you know, SMPL had some issues. That’s the great thing about getting it out and having it being used in industry. Companies will come back to you and say, well, look, I did this and this weird thing happened. And you’re like, oh yeah, that’s a problem. So we have trained another model called STAR and it fixes a bunch of the issues that simple had.

It’s trained on much more data. And these pose corrected blend shapes are localized, spatially, local in simple, you could wiggle the foot and something would happen on the neck or something like that. They weren’t local. And so you could model long range spurious correlations in the, in the data and star fixes that.

And we have one more model coming. That’s still under review. So we hope that we’ll get another model out. 

Edge cases: dealing with unique body shapes [22:11]

Gil Elbaz: We’re definitely looking forward to seeing that. How do you guys think about edge cases, people with unique body shapes? Can your model work on those types of people or do we need a separate solution for each edge?

Or each unique body type. 

Michael J. Black: Yeah. That’s a really important question. You know, we started with the Caesar dataset, which is what most people in the field have used. And, then we added size USA, but these datasets were recorded several years ago. And their normative in the sense that they represent sort of the majority of the population.

They don’t represent everybody. For example, They classify gender as male and female. Many among us do not classify gender as male or female. So one of the things we did early on was provide three body model representations, one male, one female, and one gender neutral that could represent anybody, but it can’t represent anybody.

It can’t represent people who are amputees. It can’t represent little people. It can’t represent people with scoliosis. You know, there’s just many things that it can’t do. And it, and it is trained largely on basically 18 to 65 year olds. So it doesn’t even represent really elderly people, but, you know, we are an academic institution and what we are doing is showing here’s what you can do, right.

And if you have a specific case and I’ve had populations come to me that say, look, our population is different from what your model represents, can you help us build a model like this? And indeed you’ve gotta collect the data and then you can turn the crank and you can build these specialized models.

So I think if a company is really going to push this at scale and really wanna cover the whole world, they should, you know, go to that. I mean, in academia, you can’t really afford to do this, but a big company could do this and generate a broader model. 

Gil Elbaz: Interesting. 

Michael J. Black: Yeah, I think it’s a really interesting and important issue.

Interesting Caesar was based, it was collected in 2000, but it was based on the US census that was taken in the nineties. You know, even the US population has changed dramatically since then. It’s gotten, it’s gotten heavier for example. So there are really not a lot of, obese people in the Caesar dataset.

So it underrepresents a very common population today. 

Gil Elbaz: Yeah. This is a lot of food for thought. 

Early days of computer vision [24:45]

Gil Elbaz: So you started your academic research back in the eighties and nineties. Can you tell us about the early days of research and your work at Xerox? I love in these interviews taking a step back before everyone was all about deep learning and, you know, constantly heads down on the next paper.

I think that it provides a lot of perspective.

Michael J. Black:   Yeah, that’s a great question. And, I feel very fortunate. I started my career at a time when the field was sort of wide open. And nothing worked. Right. Everyone was doing things that they thought were interesting and it was very open because nothing worked, you know, you’d do a little experiment, it would work on one or two images.

Somebody else would be doing experiment on a couple other images. It was hard to get data and no one was making any money. Right. There was no market for this stuff. It just really didn’t work, you know, fast forward to today. Where the stuff works. It works at scale, it’s being used in multiple industries. I think it’s really rare in a scientific career to get to see that arc. 

The problem isn’t solved yet, so there’s still fun stuff to do, which is great. But you know, to really go from not understanding it all, to having things that are out there being used every day is super exciting. I started out my career looking at optical flow. So optical flow is the 2d motion. That you observe in an image between frames.

How did all the pixels move from one frame to the next? This is what my advisor had done. And, I found it a fascinating problem, but after I graduated, I thought, should I keep just doing what my advisor did? Well, it’s probably not the best career move. So I thought about it a lot and I thought, well, really the thing that’s most interesting to me is how humans move.

It’s human motion. That’s really interesting. And I can bring my knowledge about optical flow to this problem to me. So I started working on facial motion and expression recognition and in 94, 95 with a Yasser Yakub from the University of Maryland, we had a system that would track a face in films and television.

It would compute the motion of the parts of the face. And from that really reliably estimate facial expression. And this is like, you know, mid nineties and this thing worked. And I was actually pretty excited about that. because getting computers to understand us, to understand our emotions feels like a critical step to making computers full partners with us.

So I got hooked on it at that point and have been working on it ever since 

Gil Elbaz: Just getting these things to work at that time was a challenge. Even the concept of working with digital images at a certain scale was non-trivial. 

Working at Xerox PARC [27:37]

Michael J. Black: Yeah. So when I was a graduate student, I had no way to get, I worked on optical flow. I needed video sequences, right. I had no way to get a video sequence into my computer. So there were a handful of people who had done this at great expense with special equipment. And, everybody used just those like a handful, maybe five or six sequences. I got sick of them. I was working on the same stupid sequences again and again, and they were tiny, like 64 by 64 pixels.

But 64 by 64 pixels is what would fit on a connection machine. This massively parallel architecture. It was a revelation to me when I was at Xerox. I finally got a silicon graphics workstation, and this SGI had the ability to take video input and digitize. And that was like, it opened up my world.

I could actually start to do experiments. It was very hard to be a computer vision person in the old days. 

Gil Elbaz: So you founded the digital video analysis group at Xerox. What were you guys working on back then? 

Michael J. Black: So I had started there as a member of the research staff and my boss was Dan Huttonlocher and Dan was great, but he also had a job at Cornell and eventually left the leadership position to me.

And even though I was young, it was a great learning experience working in a big company like that. And I learned a lot about how to manage research and how not to manage research. It’s a tough thing in a corporate environment to really do basic science. But I managed to make the case for creating a new group because Xerox had branded itself or rebranded itself, the document company.

And I was adamant that documents are not just pieces of paper that come out of a printer or get a photocopy, right. That a video can be a document, but for a video to be a document, you have to have the same concepts that you have for other kinds of documents. You have to be able to edit it. You have to be able to search it. You have to find things in it. And so you need some sort of semantic analysis of what’s in the video to really start treating it as a document. And so I made this case to Xerox and they thought, okay, good enough, I guess. And they let me start a group on that. And so I was able to hire some really smart people and I had a lot of fun.

Shifting to academia [29:59]

Gil Elbaz: So in 2000 you then left Xerox and you shifted back to academia to become a professor. Can you tell us a little bit about that journey and why you moved back to academia? 

Michael J. Black: I had a lot of fun at Xerox until I didn’t. So, it’s almost a tautology that great research labs happen in industry when that industry has something close to a monopoly, you know, whether it’s bell labs or Xerox or Microsoft, you know, when they’re just really flush with money, they really start thinking about, well, what, how can I do something for the public?

Good. And they invest in research. Those monopolies, like situations tend to always, weigh over time and then research environments change in industry. They become much more project focused, much less creativity driven, much more bottom line. This is just a natural cycle. And when Xerox PARC became more like that, it became a much less interesting place for me.

But I had always wanted to be in academia. and I had always tried at Xerox to continue publishing and maintaining an academic record that would allow me to make that move if I needed to. So I was very fortunate to be able to do it, because Brown was a super place to work. It’s a fantastic school with absolutely amazing students.

The vision for Perceiving Systems at MPI [31:30]

Gil Elbaz: That sounds amazing. So you’re one of the founding directors at the Max Planck Institute where you lead the perceiving systems department. What’s the vision there? And what was your motivation for opening it? 

Michael J. Black: Several things led me there. I had no connection to Germany before I came here. I visited once or twice, you know, the good beer, nice cakes.

I didn’t really know much else about it. And I didn’t know anything about the Max Planck society, but a colleague and I at Brown had written a grant proposal to create what we thought was the world’s first four dimensional body scanner. It had a very cool technology and we submitted it to the national science foundation.

And, we wrote a good proposal that came back very highly rated. It said the proposal’s great. The principal investigators are great, but we just don’t see that anybody would ever want a four dimensional body scanner. And even though we had explained why, just people couldn’t imagine. and it was really frustrating for me because I could see that this could change a field and could open up new fields and I couldn’t get the money to build it.

You know, it was gonna be a million dollars or something and I couldn’t get the money to build it. And then Max Planck came along and at Max Planck, you get in addition to a tenured academic position, you also get a budget for the rest of your career. This is really unusual and very special. and it’s a substantial scientific budget, and you can use it at your discretion.

You don’t have to ask anyone for permission. The idea of the Max Planck society, they call it the Hark principle named after one of the early presidents. And the principle is you hire people and you are not focused on projects. You just hire the people you think are the best and get outta their way and let them do stuff.

So it was a big motivator for me to come to Germany. And one of the first things I did was, try to build a 4D body scanner because there wasn’t one. And I brought a couple of companies in at the time they had pieces of the technology. I thought that could solve this. I had a meeting with these two CEOs and I said, look, I want you guys to work together and build me this thing.

And then it turned out that 3D MD was able to build it and they built the first one for us. And now, lots of people have 4D body standards. Many of the big companies that are interested in the metaverse, for example, they have technology that’s way better than what we have now, and it doesn’t seem crazy anymore.

Innovation and team structure at Perceiving Systems [34:15]

Gil Elbaz: That sounds incredible. A good kind of crazy. So you have amazing students with advanced degrees working on all sorts of different problems. And every conference we see a whole new wave of papers come out of your team. How do you keep up this velocity of innovation? Also, how do you kind of structure the workforce that you’re in charge of?

Michael J. Black: That’s a good question. Also, It’s another reason I came to Max Planck. In the US back in 2005, six, seven. I was looking at how the field was evolving. And companies like Microsoft were beginning to dominate the conferences with really excellent research. Microsoft had opened up a research lab in Asia, in China, that was producing a tremendous amount of good work.

And I saw that these companies had something that we didn’t have in academia. They had engineers who could dedicate themselves to a project and work together with scientists to make something. And in the US academic system, you get grants that last for three years, you can’t hire permanent staff on them.

There’s no way to go out and hire programmers to build that kind of thing. And I thought, well, academia’s gonna fall behind industry if we don’t figure out how to solve this, but I had no way to solve it. But at Max Planck, I’m free to use my budget however I want. And, so that’s one of the things that I do in my department.

And what we do for the Institute as a whole is in addition to scientists, we hire software engineers. We have something called a software workshop where we hire professionals. They can become permanent staff members and their job is to help take science and make it real, make it something we can distribute, make it robust, licensable. But more importantly, something that we can build on ourselves. And so I have a philosophy in my department, which is build what you need and use what you build. This is key to many academic projects where, you know, a student comes in, does this little thing. And then within six months, nobody knows how to use it anymore.

It’s just a little one off. And today the problems we wanna solve in computer vision and AI, they’re bigger than that. You can’t afford little one off solutions. You have to have things that build on each other and that you can then support and maintain so that every next generation of students has a bigger foundation to play with.

I have four or five programmers in my department. Probably be hiring more. I think the mix of students to programmers is gonna change. I think there could be more programmers over time. I also have a data team that focuses on capture in our 4d scanner, our mocap lab, our foot scanner, our face scanner, hand scanner, you know, all kinds of scanners we’ve got, but they also work on capturing data from the internet and labeling it, to have a staff that can do that is really critical.

And so, I think the reason is that we’re so productive. So I have amazingly talented students and postdocs, but they’re also supported by the data team and the software team. And that lets them just do much, much more. 

Perceiving Systems – similarities to a startup  [37:40]

Gil Elbaz: It seems from my perspective, you know, as the CTO of Datagen that you are running a type of startup or something much more similar to a startup than a classic academic group, we have similar roles on our side.

Doing infrastructure, creating different code bases that we can build off of helping our algorithms, engineers that are focused on deep learning aspects of the different capabilities that we’re creating, the environment that they need in order to work productively. And then of course, in order to create the ability to generate data at scale, as you know, we need also people that don’t only come from an engineering background, we have amazing 3D artists as an integral part of our team.

And. I think that this kind of interdisciplinary mix is really important. 

Michael J. Black: Yeah. I think the difference between a company like Datagen and me is that the ratios are exactly flipped. You’re gonna have more software engineers, graphic artists and things like that. And a small number of people publishing.

I’ve got mostly people publishing and a small number of people supporting. It’s just kind of flipped because what’s our output. Our output is really papers. It’s new ideas that get written up as papers and published. Though, we also spin off companies. We’ve spun off two companies. we’ve licensed a bunch of technology and I think that’s really important, you know, as I said, I feel very fortunate to have the position I have and to have the research funding, I have the German taxpayer pays for me to do this crazy research I do, and they get out of my way and they trust me to do it. And I feel with that kind of freedom comes some responsibility to make sure that what I do has some impact and it comes back to society in some way.

And so I take that very seriously. And one way we can have an impact, of course it’s through collaborations with our medical colleagues and so on, but also through this commercialization, you know, I have another hat I wear, which is the speaker of something called cyber valley. And I’m really dedicated to helping Germany create a new industry around AI and AI startups that can be a real driver for the country going forward. There’s a tremendous amount of talent in Germany, particularly in the area of AI. and what’s been missing is a sort of a venture backed startup mentality for deep tech. And we’re trying to make that happen. 

Gil Elbaz: I love this positive connection between government and innovation.

It’s super non-trivial in Israel. In comparison to Germany, we have a very strong venture capital industry helping the space a lot. So there’s a ton of funding for deep tech, and it’s really the specialty here. Next time you’re in Israel. I’d love to connect you with some of the relevant folks from the industry.

Founding Body Labs [40:38]

Gil Elbaz: This is actually a good opportunity to segue into Body Labs, understand the motivation, the journey. Body Labs was your startup that you spun off from the base technology of body models that your team developed, which was later successfully acquired by Amazon. 

Michael J. Black: Yeah. So during my PhD, I wrote some code for optical flow and it turned out to work.

And as I said at the time, nothing really worked. But this actually worked well enough that people could use it. And so a bunch of people used it to do things like make movies and I didn’t get anything out of it. People got academy awards and made money. I got nothing out of that and I thought, Hmm, the next time I do something that might be useful, I think I’m gonna try to protect it and commercialize it because that’s what’s gonna have a big impact. And so when I started working with the 3D body models, we had our first success of estimating a 3D body model from images. I thought, okay, this is the next thing. This could really be big.

And so together with my collaborators, we wrote a big patent, on the whole suite of things we were developing at the time. It was like my second thesis in many ways, it was a huge monster document. And I started building a tech team at Brown. So I was super fortunate that I had a bunch of discretionary funding there that I’ve been given gifts, unrestricted gifts.

So I was able to hire some programmers, smart kids, and just graduated from Brown. I got a couple business school students to develop a bunch of case studies. And then we started building a prototype back in 2009. We actually had a system at Brown where we had women shopping, locally taking pictures of themselves and they had 3D body scans.

We had a recommendation system that would recommend clothing and sizes to people based on their body shape. And the whole thing worked until we said, okay, now take this camera home, take pictures of yourself and we’ll build your body model from that. Okay. Disaster. So we had the ideas and when we had an accurate 3D scan, everything worked, but the computer vision part didn’t work, people took terrible pictures.

You know, the exposures were bad, the lighting was terrible, the clothing was wrong. We couldn’t get an accurate body model in somebody’s home. And so then we realized it needed more time to incubate. And when I moved to Max Planck, the team came with me and they spent another few years developing the technology before we spun it out in 2013 as body labs, there was no cyber valley back then.

So the company spun out in New York where we thought it would be closer to the fashion industry. We started out largely looking at problems in the fashion industry, but that’s a very old fashioned industry and very slow to change. And to really change the fashion industry in the ways it needs to change, you have to attack many pieces of the problem simultaneously all the way through, from design to the point of sale and a little startup – it’s like poking an elephant.

You can’t really affect much change in an industry like that. But fortunately deep learning came along and things like Snapchat filters came along and everybody got very interested. How you could analyze 3D bodies from a cell phone and do fun things with it. And so we pivoted a bit in that direction and generated a lot of interest.

And then several companies were interested in us and we had done a proof of concept with Amazon and I’m just, it was really great that it worked out. It was my dream actually long before I’d actually incorporated the company, which was originally called Anybody, Inc. I went to visit Amazon and I pitched them this idea and they, you know, they were very patient.

Amazon was just a really small company, actually a much smaller company. When I pitched this idea, they were like, okay, this would, if you could make this work, it would be good, but you’re just an academic. We don’t believe you can do this. Go find yourself some business people, build an actual team, build this thing up and then come back and talk to us.

And so I was delighted that it actually all worked out in the end. I went and found a CEO. We built a team, we got some funding, we built the company and then they did, they did actually buy us. 

Body Labs’ Acquisition by Amazon [45:30]

Gil Elbaz: And so why Amazon, from your perspective, out of all of the tech giants, why was it a great fit for them? 

Michael J. Black: So for me, I’m very driven by real problems in the world.

And, to me, the internet is this thing where, you know, Amazon figured out how you could buy books online. Okay. It’s not the same experience that you have in a bookstore, but it’s better. In some ways you get all these reviews. It’s really a good experience. Many things work this well on the internet, clothing shopping doesn’t and I’m tall and skinny.

I literally, one reason I had to move to Germany is I cannot buy clothes in the United States. Really nobody makes anything for my body and I, where I can’t find them. Right. So I wanna discover the clothes that would fit me and be able to buy them online. Cuz I don’t actually like going into stores. So this is still a huge customer problem.

It’s a major pain point. And I saw Amazon is a tech company that has this problem and it’s ideally positioned to solve this problem. So that’s why I thought they were the best choice. When I got there, of course, Amazon is an enormous company with all kinds of things going on. And so it turned out there were many, many other uses of 3D body models and we contributed to all kinds of different things within the company. So that was, it was a great experience to go, you know, from the hurdle of the idea through to starting a company with my co-founders to becoming part of a big company, and then seeing how the technology works. Sort of pulled in and incorporated, and the people get incorporated into the company and then how the products eventually get out into the marketplace.

It’s been a wonderful journey, really. 

Distinguished Amazon Scholar role [47:24]

Gil Elbaz: Wow. And you are a distinguished Amazon scholar. Can you tell us a little bit about that unique role? I don’t think that there are too many of those 

Michael J. Black: When I left, there were only two of us. So yeah, it was a new thing, that they came up with this idea of a distinguished Amazon scholar.

Amazon had an idea to engage more with the academic community. They were looking for a model of how to do that. And this was just the beginning of a program that grew really significantly into something called the scholars program. And now Amazon has many academic scholars. These are academics who might take a sabbatical and do some time at Amazon, or they might work part-time, on a project.

It’s a way for the academics to get exposed to, you know, what it is like to actually work on real world problems in a big company. And it’s an opportunity for Amazon to get exposed to these people and how they think and their students and so on. So it’s a much bigger program, but yet I was, it was very early in that.

Yeah, it was a nice role. I came in, was at the vice president level. So that gave me a lot of access to senior people in the company. And there were a lot of fascinating conversations about computer vision, machine learning. At many times, I thought I was like a preacher proselytizing about how the world was changing and, and, you know, just educating, here’s what’s happening out there.

Here’s how that’s gonna change Amazon’s business. Here’s how we can respond to it. It was fun. 

About Meshcapade [49:03]

Gil Elbaz: So Body Labs was acquired in 2017. Jumping forward. I know that there is a company called Meshcapade, another MPI spinoff, which is currently licensing the SMPL model out and doing a lot of amazing work around productized, various aspects of body models.

I’ve also talked with them and they’re super nice people as well. Yeah. So specifically Naureen the CEO, she’s amazing. And Talha, the CTO is also great. And I’d love to kind of understand how you see Meshcapade going forward. I do understand that it is connected to the metaverse in some way. I’d love to get a better grasp of what your vision is here.

Michael J. Black: So now Naureen Mahmoud was actually one of these software engineers in my department. She was an author on the original paper and many other papers that followed. So that’s kind of the nature of these software engineering positions. People get directly involved in research, do they contribute to it, they publish papers and then she spun out to lead this company.

What is the metaverse? [50:05]

Michael J. Black: So there is a major change coming. There’s all this hype about the metaverse. There’s one fundamental thing that is going to define it, I think. And it will be a blurring of the boundary between the physical and the virtual. This is why companies are investing in AR glasses. And so on. People are, I think, misusing the term metaverse to describe some game with a social component where you can buy some clothing or something like that.

And in fact, they’re talking about multiple metaverses, they’re missing the point fundamentally, it’s this blurring of a physical and the virtual. And the most important boundary to break down between the physical and the virtual has to do with the human. Our bodies, our movements, our expressions, what we are coming towards is what I call the age of avatars.

The age of the avatar [50:56]

Michael J. Black: Everybody is going to have an avatar. And in fact, many avatars that you use for different things. You’re gonna have an avatar for shopping, for clothing and avatar for being in virtual meetings and avatar for, going to a concert, an avatar for playing a game. Now, are these all going to be separate avatars or are you gonna have control of them?

Are they going to be like you in some way? i.e., Will they have your facial expressions? You know, if I see you from a distance, I would recognize you because I know you. It should be the same with your avatar, whether it’s a Lego character, Roblox or something, or it’s physically like you, it should embody you.

And so your emotions, your expressions coming out through all of your avatars, I think is the future we want. And what this will allow is, you know, I think the vision of meta-commerce. Or eCommerce in the metaverse right now, it all sounds like it’s NFTs and digital clothing and digital sneakers or something, but more people buy real clothing in the real world and that’s a real need and we haven’t yet blended these things.

So what I see as a major shift coming is a seamless technology that allows me to go shop for Nike clothing, and then have that Nike clothing on my avatar, in my fitness app, but also have it physically fitting me. So how do we enable eCommerce in the traditional sense, as well as this new meta-commerce, and there, I think we need a unifying, avatar technology that can support your animated avatar can support you in your video game, whatever it happens to be. And could support you doing e-commerce when you’re shopping for clothing or bicycles or whatever else it is. 

And, we think that the SMPL model can underlie all of those things. And it allows us to get out of video and images, information about people’s 3D body shape, their facial expressions, their motions, and use them to animate characters of any type.

And so I think the vision of Meshcapade is really to provide a platform that enables the age of avatars that enables easy creation. So like I say, there are many friction points today. Why we don’t have this age of avatar creation is a big one. Most people don’t have 3D and 4D scanners. Most people are not digital graphics artists.

So they can’t, you know, meta-humans is super cool avatars, but you’ve gotta really know what you’re doing to create one. So most avatar creation methods are like, you goofy things where you, you know, pick your hair color and you pick some makeup and things like that. And it, you know, it makes an avatar, but it’s not really embodying you.

So we need to break down that barrier. And make avatar creation, absolutely dead, trivial, accurate, low friction. And then we need to be able to animate those avatars and we need to make it so that they plug and play with just about anything in the metaverses. And then there will be real power to this idea of meta around avatars.

Gil Elbaz: That’s an incredible vision, and I definitely connect with it on many different levels. I get asked a lot of times, you know, when is this metaverse happening? What is the main killer app in the metaverse and why will people be interested in the metaverse for example, and from my perspective, the day that we choose not to talk through zoom, but we prefer to talk through a VR device and talk between our embodied selves or our avatar by itself.

That’s the sign that the metaverse is really ready for mass adoption. And the killer app is talking with each other, communicating with each other in a fully embodied, fully present way through some digital medium. And so there are definitely a few different technical capabilities leveraging synthetic data and body models needed to develop in order to get to that point.

But we’re on our way. And like you said, we know what we need to do. We just need to execute on it and move forward. And this is something that I think was highlighted during the time of Corona. Definitely. You know, we all learned the importance of having face to face communication once we couldn’t anymore.

Having a real embodied experience through some type of digital medium is so important and we understand the importance of it now, not only the technical folks, but everyone that used zoom and meta as you framed it, I think is another amazing addition on top of this embodied itself. I definitely see this, not only as a nice to have, but it’s gonna happen.

And I think that having the right technologies and the right connections between the different layers is definitely gonna be a critical part of it. Yeah, I agree. Well, the last question that we ask our guests, each podcast is for the newer folks coming into computer vision, starting their careers. Maybe they just finished their master’s degree or they’re in their first job.

Career Tips for Computer Vision Engineers [56:32]

Gil Elbaz: What do you recommend for them, on their career path, in their own journey? 

Michael J. Black: It’s a super question, you know, right now everybody is super focused on deep learning and it’s great. It works. And it’s, you know, it’s an effort just to keep up with that, but I would really encourage people beginning in this field to not just do that.

Learning linear, algebra probability, physics, whatever it happens to be. These skills that you get in these other disciplines, really ground and inform your ideas and can be incorporated in. So there are many properties of the world. For example, machines don’t have to learn. They just are there.

And if you understand mathematics or physics, you can exploit them as a training machine. So one of the great things about Datagen is, you know, providing synthetic data to people, to train machines, but there are other ways there are some generic, in addition to that, there are generic priors that we know things about the world.

We know about the physics of the world, and it doesn’t necessarily have to be learned. So I would not just be very narrowly focused on deep learning. I would just look a little bit more broadly, learn some geometry. The 3D world is important. And even if your networks are going to learn about the geometry for you, if you don’t have intuitions about prospective cameras and occlusion and or properties of light, you know, if you can’t, if you don’t know about illumination and how it interacts with surfaces, then you may not really understand what your method is doing and why it’s not doing the right things.

And you won’t know how to choose the right data to actually solve that problem. So having physical intuitions about the world I think is really critical. 

Gil Elbaz: Thank you. And I also connect to that on a few levels, myself, starting out in mechanical engineering, gave me physics heavy background, and then later dealing with the 3D graphics and deep learning. I always worked in the context of this physics background and intuition. So I can definitely relate to that. Thank you, Michael Black. It was a pleasure having you on this interview. It was extremely interesting for me. 

Michael J. Black: Thanks for having me. It was my pleasure. Great talking with you.