Sorted and Sifted Machine Learning – with Anthony Goldbloom

Listen & Subscribe on

About this Episode:

Are you up to solving a machine learning problem? If so, start on Kaggle. Anthony Goldbloom, co-founder and CEO of Kaggle, joins us to talk about what it takes to found a machine learning community.  We answer the question of what happens when you cross a domain expert with a pragmatic problem solver. Anthony talks about the future of AI and computer vision, how important it is to learn through doing and what he is looking forward to in the next year.

Guest Speaker:

Anthony Goldbloom

Co-founder and CEO, Kaggle

Episode Transcript:

Gil Elbaz: Thank you very much for being with us today. We have Anthony Goldbloom, the co-founder and CEO of Kaggle, a large machine learning community with more than seven and a half million members.
They were acquired by Google back in 2017. And before that he was a statistician at the Australian treasury and reserve bank of Australia. Anthony, it’s a pleasure to have you on.
Anthony Goldbloom: Thanks Gil, thanks for having me.
Gil Elbaz: So Kaggle is definitely one of the best and most accessible ways to gain real practical experience in solving challenging data science and computer vision problems.
Everything from supermassive black holes to COVID diagnosed x-rays. To predict the Olympics, there’s so many amazing talents on this platform. Was this the original vision for Kaggle? Did you have something else in mind? It’s completely, I would say unexpected or incredible that it grew to this kind of level.
Anthony Goldbloom: Yeah. One axiom that I have heard said about startups is to solve a problem that you want solved because then, you at least have one user. Hopefully they represent a larger user base than just yourself. That’s definitely true in the original Kaggle case. So my background was as a statistician and the idea for Kaggle came from.
I was a journalism intern at the Economist magazine, and I wrote an article about predictive analytics, which is basically, statistics. So machine learning applied to business problems. That was predictive analytics, was the term used at the time. I was interviewing all these companies and I was thinking well I would love to work on some of the problems that I was interviewing people about, but I knew if I was calling them. Is Anthony Goldbloom a statistician, or an economist? I wouldn’t necessarily have been given the opportunity to speak with them. And the idea behind Kaggle was to give companies access to people like me and people like me access to the interesting problems that companies post.
So that was where it came from. And I think it’s really largely fulfilled that goal. Now when you go into the Kaggle side at any point in time, there’s a whole bunch of challenges and datasets from many different companies, they’re often very interesting. And when you work on a competition, you see how you’re performing relative to others.
And at the end of the competition, very often, the winners will share their methods. And so how interesting and exciting and what a learning opportunity to try a problem, get to a certain level of performance then at the end, see what the winners did that you didn’t do. So that when a problem comes the next time you face a problem like it, you can try some other methods the winners tried.
And with each challenge that you participate in, your performance gets better and better. So it’s yeah I think Kaggle has largely fulfilled what I had hoped originally that it would do.
Gil Elbaz: Amazing. And what I see is that it really is one of the best ways. And when I have a lot of friends that are engineers in a bunch of different areas, and everyone’s interested in data science today because it seems like a practical way to move the needle in many different companies, even companies that aren’t focused or teams that aren’t focused on, data science are now really learning and trying to hone these capabilities and bring them in-house and create value through this. And I’ve heard a ton of mechanical engineers, chemical engineers, a bunch of different, very diverse kinds of backgrounds, all coming together and working in this data science space. Is this something that you feel is a trend that is going to continue? Or are we going to see more specialized people that have a very strong computer science background taking over these more advanced machine learning methods as we go.
Anthony Goldbloom: Yeah, it’s interesting. I would say that the way I like to define Kaggle is we are behind where academia is, but ahead of where most of industry is.
Papers come up with new methods. Those new methods get tried out on Kaggle competitions. And then what we often find is the techniques that do well in Kaggle competitions have been demonstrated as pragmatically useful and then start to spread out in industry.
And we’ve seen this in a whole lot of, many examples of that starting with XG boost and then Keras. Both of those spread through the Kaggle community and methods like Unet for segmentation and so on and so forth. Now what we find is it’s very often the case that the very strong performance on Kaggle do come from some of the backgrounds that you saw.
In mechanical engineering, electrical engineering physics in those fields you have to know a decent amount of statistics. You have to be able to code but the thing you also tend to be as a pragmatic problem solver, right? You don’t fall in love with the elegance of a method, but you are looking for something that will solve your problem.
There’s some base level of statistical understanding that you need to be a good data scientist at some base level of coding. You have to have to be a data scientist. And then really the, once you’re at that base level, you have what you need to learn. And those backgrounds really aren’t towards problem solvers, who end up in many cases doing exceptionally well on Kaggle sometimes and often with stronger performances and then computer scientists and statisticians.
Gil Elbaz: Super interesting. Yeah. Because they’re very close to the problem at hand and they have the relevant background that they’re able to hone in on, on what is actually important to focus on within the data or how to enrich the data in the best way.
And yeah, it does definitely give an advantage being close to the, kind of the business understanding of what they’re actually trying to solve, especially in medical cases or mechanical engineering. They’re very like low levels that are valuable to understand and that you do need a significant background to bring into the.
Anthony Goldbloom: I actually would slightly contradict what you just said, which is that I think we often find that those who win challenges don’t have a background in the domain that they win the challenge in. Now there’s one reason for this, partly artificial, partly a feature of Kaggle, which is by the time of challenges on Kaggle. The interesting question, which is where the domain expert is strongest, has been asked. You mentioned the COVID challenge in the introduction taking a chest x-ray and trying to diagnose COVID. You have to be a domain expert to know that’s an interesting question to ask.
You have to be a domain expert to know what data you want to bring to bear on that problem. Once you have asked the question and structured the problem and actually being a good, pragmatic problem-solver with you having that background in medicine or radiology or not is not so important.
But yeah, but not falling in love with computer science, falling in love with being able to iterate quickly, being able to try a lot of things ends up being successful in Kaggle competitions. And I would argue also for prototyping on real world problems.
Gil Elbaz: Very interesting. Yeah, I, I do see how that, how, the business logic can be actually encapsulated in the question that’s being asked and in the data itself. And then I guess Kaggle’s difference a little bit from the real world is that you’re constantly gathering more and more data and you have the ability to define what data you need in order to solve a problem. And here, we want to, we want to fix the data and try to hone in on the best method of extracting valuable information from that data. And yeah, I can see how maybe like in the real world or in the external world to the competition environment, there could be. More emphasis on specialization, whereas in Kaggle, they purposely formulate the problem in a way that is or can be solved by anyone. And then I guess, like you’re saying, the strong problem solvers from your experience actually have a good advantage.
Anthony Goldbloom: My suspicion is for real world problems, you really do want a pairing between a domain expert and then a very strong, pragmatic problem-solver.
So if you can take somebody who has done very well in Kaggle competitions and have them partner up with somebody who has good domain context, it often is very powerful. Have a wonderful story out of our community where Somebody who was a Kaggle Grandmaster, very high performer on Kaggle, started out at an autonomous vehicle company.
And it was week one. He had a fairly junior role at this company. And he took home a problem that a bunch of Stanford PhDs had been working on for about three months. He made massive progress on it over the weekend. And this is not an uncommon story, but the way this happened was the Stanford PhDs did a lot of the in context bits.
But researchers are typically very slow and deliberative about the changes I might make or how they might iterate on a problem. Whereas what Kaggle is really training somebody for is iterating very quickly, having a bit of a sense for what experiments are likely to work, what arts he has competed in so many competitions always under time constraints.
The muscle that Kaggle trains is being able to iterate very quickly. And so him being paired with the domain experts was a really extremely powerful combination. And I think that’s the combination that when companies are solving machine learning problems, but that, that really is the one to try and emulate.
Gil Elbaz: Yeah. I think that story is defined really well. The, yeah the previous point and it’s amazing to see, these people that have learned so much on Kaggle, bring what they’ve learned to the real world. And to, move the needle within these companies, especially in an environment like that full of Stanford PhDs.
And so maybe it would be interesting to understand how Kaggle has evolved over the years, what it started from and really what is the future of Kaggle.
Anthony Goldbloom: Where Kaggle started is running machine learning competitions, as mentioned, companies posted problems and people compete to build the best algorithms.
Some of the biggest changes we’ve made to Kaggle over the years have been the introduction to a hosted notebook product. And we introduced that because we noticed that when people were competing in competitions, they were sharing code a lot in our forums, but it’s crappy because, somebody shares say a Python code and in order to get it running, you need the same version of the time. It was frameworks like Lasagne and olan and whatever else. And Lasagne had a breaking IPI change. You had to match to show you’re on the right side of that IP that’s changed. And it’s just like people coming to learn code should be the most interesting thing that anybody could share.
But sharing code was a very crappy experience. Really large change was when we introduced this host of notebooks so that people in our community, as I can pay, can share code in our hosted notebook. We had a canonical Docker container with most of the major libraries. You could still install additional libraries and then you can basically take somebody else’s notebook and fork it and run it immediately without having to replicate their environment. So that was a huge change. It also allowed us to find more.
Gil Elbaz: It seems like a big leap. It seems like something that would both level the playing field and also accelerate the learning experience because you’re able to, to really, interact with existing code that works and improves it, and jump up from.
Anthony Goldbloom: Totally. And then the other thing that enabled us allowed us to run more interesting competitions. In the palace, we had been limited to static, try and test dataset. Now we can do things like more easily run time series, forecasting challenges.
We can run computer vision challenges where we have an unseen test set. So you don’t have to worry about people hand labeling the test set. Machine learning is not an AI is not only supervised machine learning. We now run some reinforcement learning based so we can run reinforcement lace learning based simulation challenges.
So that was a really big change. The other massive change we made was that, and we launched in 2015, but we launched it basically as a text box with a run button. And over time it has evolved into a real, beautiful, hosted notebook environment, really stable, very, quite, very powerful, a very nice environment.
The second big part of the capital that we launched is what we call our public data platform. And what we noticed is when we launched the hosted notebook people weren’t only using it for competitions they were using it on other datasets as well.
Or to do a kind of reform analysis on competition datasets. So we thought let’s say we will launch five new challenges every month. It seems ridiculous that we have this powerful notebook that you can only use on these five challenges. And so we allowed anybody in our community to share any dataset with each other. They didn’t need a challenge associated with it. And interestingly the public data platform on Kaggle is now about 10 X larger and growing much faster than the competition platform. So I’m really proud of this, we built the muscle where we’re not just riding on the coattails of the thing we intended in 2010. But we built things that have gone on to outgrow the original thing which I’m proud of. And just quickly, some of the big things that we focused on over the next 12 months, one is just as a public data platform where anyone can post a dataset has been really powerful where exploring this idea of allowing our users to actually create their own competitions.
So it’s not just the competitions that companies bring us that we put up on the site. But we can have the volume of competitions. You can have community members start to post competitions as well. And one of the reasons for this is that we look at Kaggle, ultimately our mission statement is to help the world learn by data.
We look at ourselves as really trying to provide learning by doing. And at the moment you’re constrained in the challenges. People want more tabular data competitions. For instance, we can’t launch those unless a company brings those challenges to us.
But this way we’re disconnecting the challenges available on Kaggle from what companies bring us and allow more opportunities for our community. And then the second big thing we’re exploring at the moment, all the datasets in the public data platform are static. We want to have good infrastructure to allow a community to manage updating datasets as well. Cause there’s a lot more interesting things you can do on, not stale, static datasets, but with live updating datasets. So those are the two big things that we’re running out over the next 12 months.
Gil Elbaz: Very interesting. Yeah, it sounds amazing. And I understand now that kind of the core is data science allowing them to learn through doing, and this is something that I connected with at many levels, but I really, I think that the best way to, dive into the, what works, how to do it, how to, get things across the finish line is really to create it yourself and write the code and dive into the low-level bugs and understanding what’s going on.
Anthony Goldbloom: I think some people learn really well by reading textbooks. Others learn well by watching Coursera courses. We really are catering to those who learn by doing and that is our core target audience. And that’s how we look at ourselves as different from otherwise, to learn and improve at machine learning.
You can read an academic paper and every type says, my method is best and I’ve benchmarked it against these other six methods. Every single paper will say that, but if you want to stay up with what actually really pragmatically works As I said, continually competing in competitions. See what others did that you didn’t do, try those methods on the next one. And it ends up being a much more objective way to continue to learn.
What really works, as I said, trying to keep up with the flood of papers all of which advertise their method as best.
Gil Elbaz: Yeah, I completely agree. I think that there are amazing papers coming out now. Many of them are very hard to either implement or iterate on, a lot due to the compute resources required to actually retrain and then the second is the. There are many datasets that aren’t available today. And so you see an amazing paper by one of the big companies and they worked on, let’s say a billion images and you don’t have access to this, of course. And so I’d love to get your take on how you see this evolving going forward with the challenge of the need for very large compute in order to train, let’s say generative methods or various, reinforcement learning methods. And then the need for access to data where we’re only scratching the surface of what’s possible with video analysis but that would be, let’s say a very big challenge with regards to the sheer amount of data.
Anthony Goldbloom: Yeah. Certainly if you’re trying to do novel fundamental research very often breakthroughs are happening on massive datasets when it comes to solving pragmatic, real world problems. I mean transfer learning is just showing again and again and again to be a dominant strategy.
The Googles and the Facebooks of this world spend a lot of money on training. You’ve got quickly at Google who comes up with lots of many computer vision architectures, for instance, but then they’re available to all of us and we can all fine tune them.
And so it depends on the goal. As I said, Kaggle, we consider ourselves not the place where new machine learning gets invented, but where it gets sifted and sorted and we figure out what’s real and what isn’t on real pragmatic problems. And so I think, as I said, in most cases being able to fine tune an existing model and not having to spend all the money yourself on training from scratch ends up being a very many, if not most use cases, dominant strategy.
Gil: Yeah. I totally get what you’re saying. And I agree. You don’t need to create the new version of CLIP or the new version of a GPT three on Kaggle. And of course it’s going to be very valuable for the future competitions though. So ideally if you can use it, if you can fine tune it or do some prompt engineering to improve or to be able to have it serve your problem, help you to address your problem. It’s definitely going to be a nice kind of marriage between the two. I love how you described the tech and Kaggle is somewhere in between industry and academia. I don’t see academia really ahead of the industry. I see it as different to the industry with different focuses, different optimization. And I see that in Kaggle probably the best way to understand if something’s working in the most unbiased way, because you have a distributed network of data scientists that are all kind of focused on winning a specific competition. I’d love to maybe go into the, I don’t know if you call it gamification or game theory or distributed collaboration or competitive collaboration but how would you describe the community as a whole and its ability to solve problems together?
Anthony: Yeah. It’s interesting when you put up a challenge and we have nuances on this, particularly because we now can run competitions that rely on code in a notebook, but the basic architecture of a typical competition is you have a training set and you have a test set.
And with the training set we give you the features or, maybe it’s a computer vision problem. We’ll give you the images and a label and then on the test set, we give you the images, but we don’t tell you the labels.
And so what happens is when people compete in a challenge, they can see how they’re performing on a live label that can put up to five submissions per day. And then, but then at the end of the competition, what we do is we throw away anything, that portion of the test dataset and, participants need to select two entries to get rescored on a second test dataset that they’ve never had any feedback on.
And so ultimately the challenge is decided based on performance on that second test dataset. And what we typically find is the really good machine learners. There’s really no discrepancy between the score on the public leader board and their score when we retest them, because they’re very good about preventing overfitting, doing cross validation or in regularizing and all that sort of stuff.
It’s interesting. What we find is for those who are new to Kaggle very often they’ll totally optimize based on the public leaderboard score. And then they get a really big shock when we switch over to retest them on the next test dataset and the performance plummets. And it’s actually inadvertently a very powerful way to teach people the lesson of overfitting. Because it’s a crushing experience. You think you’re in the top 10 and then you finish in the hundreds. And it’s interesting how many people, just as an aside, that compete on Kaggle for the first time, maybe, great academic credentials, great industry credentials, who still overfit against that private label dataset. And it makes me wonder how many of the world’s academic papers are written on, or actually have a fit, have overfitting so I think a very powerful thing that we do is we teach this lesson of cross validation and not over-fitting that and then the winner or the top three, typically win prize money.
But I would say our users probably to be honest, care more about their Kaggle rankings and points, or at least the most active Kagglers, then prize money in many cases if there are different tiers on Kaggle, the highest tier is a Grandmaster. I think we have about seven and a half million.
I think we have about 200 grandmasters in total. So it’s really a very rare kind of a credential and then the tier under that is a master. And I think, if I’m not mistaken, we have somewhere in the range of 1500 to 2000 masters. So those are very hard tiers to achieve. And then we also have rankings, so we can be the number one on competitions as an example and but if you are, if your grand master or the master tier, like this is a pretty well-recognized credential at this point in a lot of employers like DeepMind, Nvidia hire large bunch of top Kaggle performers.
One way to get a credential is to get a PhD from Stanford or MIT or a Technion. Another is to prove yourself through this Kaggle credential.
Gil Elbaz: Yeah. And I would say that, so I definitely see the value in that and I have to say that. I think I know one grand master, but, and he’s brilliant. And he’s like 25 years of experience and he’s constantly on Kaggle and doing competitions today as well. Constantly trying to stay up to date the people that I know that can succeed at Kaggle or are very strong. It is a good indicator. At least from my experience as well.
Anthony Goldbloom: The other thing is it’s open to anybody, right? We provide if you have an internet connection, Kaggle is accessible to you as a learning opportunity and as a way to get credentialed.
And I think also the fact that in some of our challenges, you can only use notebooks in order to compete. So everybody is on the same playing field is another really nice feature of this community. I really think that if you’re willing to put in the work and anybody can get to a master status might take a lot of work. But it’s within reach.
Gil Elbaz: Interesting. And the competitive driven collaboration as a whole as a startup founder myself do you think that we can use these kinds of competition driven collaborations within companies as well? So small and large companies.
Anthony Goldbloom: I’m going to answer your question somewhat indirectly, and I’m going to tell you what Kaggle really shines. So where we really shine is in a situation where you want to know what is possible on your old problem. So maybe it’s a really good example of this.
We ran a challenge with Zillow. They are a real estate tech company in the United States and what they do is they have an algorithm that values your house. You put in the address of your house. They pull the features from the public data sources. And they will tell you what their estimate of the house value of your houses is. And they ran a challenge. Over the years that made massive improvements to this algorithm. It’s called this estimate an algorithm. And I got my 12% improvements from, might be the first launch. And then they were stuck at this certain level of era. And the question I have is, am I stuck because this is missing some suite of solutions or I’m stuck, because there’s a limit to how much signal there is in this dataset and what challenges like Kaggle competitions are really amazing for is the end of the competition. And I would say that given the data that you have, and so you get a very clear answer, like in the Zillow case, it’s possible that they could have been disappointed and that we gave them exactly what they already had and all they get is insurance. We tried everything. There was to try actually in the Zillow case and I’d say probably in about two thirds of cases, there’s some improvement sometimes large improvements sometimes it’s a small improvement. I think it’s actually probably more than two-thirds of the cases. I’d say in the majority of cases, maybe 90%, there’s some improvement. And probably in about a half to two thirds of cases, the improvement is meaningful enough that the challenge was worth running for the additional lift.
Gil Elbaz: Very interesting. And so these companies, when they come to you, they obviously have proprietary data. And they don’t want to share all of their data. It’s very valuable, of course. Is the signal that they get from the data? Does it actually encapsulate the potential or the real signal in their entire dataset? Is it a good enough sample? Or do you see it more as, showing the potential with state-of-the-art methods in practice, in a distributed way and trying to give them directions to improve their internal algorithms?
Anthony Goldbloom: It’s hard to give you a general answer to that question. There is absolutely a trade-off between the degree to which you disguise the data and how interesting the results you get back are. If the data is very heavily anonymized, a lot of the interesting characteristics of it will be stripped out. The signal that you will get out of a challenge is less than in a case where you can release the dataset that is used internally. It’s not, I don’t, it’s not like I have some rule of thumb, or guidance on what the trade-off is between the degree of anonymization and how applicable the methods are.
I’d say it, but you probably, you framed it quite nicely in that even if the dataset is heavily anonymized the thing that you do get out of it is like you get some in the indication of some methods you might want to try, but the degree to which you can lift and shift the winning algorithm is a lot less in a case where the data is heavily anonymized, particularly because we find the tabular data, competitions, feature engineering, still rules. And so if there’s a limit to the features you can generate, because the data is disguised in such a way that it’s hard to turn right features, you’re going to strip out a lot of the interesting things.
And then also, perhaps surprising, the great hope of neural networks was no feature engineering. But you do get people doing really clever things, even in the computer vision, natural language processing challenges. One example, I like this method I think is made it out of a Kaggle community, but I believe started in the Kaggle community, is when you have a natural language processing data, a text dataset that’s not as large as you want it to be. And what people were doing on Kaggle is they start using Google translate to translate from one language, from English to Russian and then translate back to English again, because the Google translate is not symmetrical. And so it gives you some clever tricks like that.
The kinds of things, you don’t necessarily find them in an academic paper, but they do. They’re the kinds of things that help when you a Kaggle challenge.
Gil Elbaz: Actually diving into that for a sec, is there a way to take the kind of Kaggle insights and turn them into papers? Like Kaggle papers, for instance, they don’t need an academic conference, maybe a Kaggle conference, for instance, I feel like there are a lot of really insights that I might not have been in that specific competition, but I definitely want to hear about something like that.
Anthony Goldbloom: Yeah, totally. We’ve run some conferences, pre COVID called Kaggle Days, and they were wonderful, like I’ve learned about that one at a Kaggle conference, for instance. So yeah we do run conferences.
We have toyed with this idea. We experimented with this idea of a Wiki at one point where distilled knowledge could be kept up to date by the community on an evolving kind of Wiki based knowledge base. It didn’t really work.
We haven’t really figured out a way to nail this. I will say that one thing we are seeing is that companies like DataRobot and H2O hire a huge number of elite Kagglers. And I think part of the reason they do it is they’re trying to capture some of these recipes in ther software and automate some of these recipes.
People are trying to do different things to capture some of the learnings that happen in categories. Kaggle is extremely useful to the 10 or 20 hour wait data center. The person who spends 10 to 20 hours a week on Kaggle. We’re not very useful to the two hour a week data scientist who doesn’t have the 10 to 20 hours a week to really put in. And on the flip side at a site like Stack Overflow is amazing for a two hours software engineer, right? Like you, you get a quick answer on the question you’re looking for. We have spoken internally, like what can we do to be more useful to the two hour data scientists and machine learning and we have got to be honest, cracked the code on that.
Gil Elbaz: I think that there’s a huge opportunity. More and more conversations that I have with, industry leaders in the field of computer vision and even with people from the academia everyone understands that there’s a problem with the academic model as it is today, it really doesn’t seem to have been built for this kind of explosion in the amount of people coming in, the amount of papers, trying to be accepted to into these conferences or trying to get into these conferences.
And then at the same time, you have these huge companies that are also investing a ton of people, money and resources into these same kinds of competitions. There are many different kinds of forces at work here, and many different things that are going on, but it seems like you get on one hand, a lot of iterative papers and everyone’s cherry picking showing that they’re better than the other one.
And so you see these kinds of trends in computer vision going up and down and up and down and really, they’re all very nice pieces of progress in the story of machine learning. But what I think just from this conversation, I see an enormous opportunity here. If you had a paper that was based on actual results from an unbiased competition that you could show in that, that are reproducible within one of your notebooks.
I’m not exactly sure what the constellation is, but these insights could be very useful for the industry, for academia as well. And for everyone.
Anthony Goldbloom: Yeah. There was this website a while back called ML Comp, which basically had a whole lot of datasets and allowed people inventing new approaches to basically run their approach across a whole lot of datasets and produce benchmark results. And I think if I’m not mistaken that has evolved into Coda Lab which has got a connection with Microsoft. But I’m not sure it has realized, that it was a nice goal. I’m not sure it has actually realized that particular goal. I think maybe one of the challenges whether it be Kaggle or ML Comp idea is that researchers want and need the opportunity to be creative, not just benchmarks, like image net have their time.
But then the next paradigm might not fit into an image net type setup or a Kaggle type setup. I don’t know. I think progress is messy. And maybe the way that you know, that the state of knowledge in machine learning progresses, yes, it feels messy and inefficient, but there might not be a better way. And the fact that you have researchers coming up with lots of different approaches and then categorizing some funnel that, for some subset of those ideas, we help fill throughout which ones are real and which ones aren’t. And then that helps them spread into the industry.
Maybe that’s not bad.
Gil Elbaz: If we dive into a little bit of future things, do you see Kaggle as something that can be used or as this community that can be put together for some positive impacts?
It could be for medical solutions, but it could also be for finding biases and datasets, finding adversarial tasks finding Different problems or bottlenecks in existing methods that are being used.
Anthony Goldbloom: Yeah. I’m proud. I think to be honest, this is a place where Kaggle really lights up where we do nice work. We’ve worked with, just to pick on some of some examples that map to some of the kinds of ideas you have in your question. We’ve worked closely with Facebook on a deep fake detection challenge trying to detect deep fake as an example.
We have the field of medicine and we work very closely with an organization called RSNA, which is the radiological society of North America. Basically the radiology industry group in North America, on challenges getting from taking chest x-rays to diagnose COVID to CT scans to diagnose lung cancer to a large range of medical challenges.
I’m really proud of a challenge we did with a well-known researcher at Stanford university to take MRNA designs and try and predict their stability MRNA vaccines for COVID and also, the initial motivation was COVID, but it’s applicable beyond, the problem with MRNA vaccines famously they’re not very stable. They have to be stored at very low temperatures and so on and so forth. Can you predict which design, how stable a design will likely be, is a very nice tool for somebody designing an MRNA vaccine to have an accurate tool that predicts vaccine stability. We’ve done a lot of diabetic retinopathy with the California healthcare foundation, which was a really nice challenge. One of the very early use cases.
So I do think that raising awareness on what’s possible on public good type challenges, putting solutions out there on public good type challenges is definitely an area that Kaggle has done good work in the past. And I think this is an area where I expect us to continue.
And if anything, this self-service competition direction is a powerful one there because hopefully if it’s successful, disconnect the number of important, impactful challenges we can launch from the size of our team, who can process and launch those challenges.
Gil Elbaz: Honestly, if there was a way to sponsor someone else’s challenge, to solve some of these medical issues,
Anthony Goldbloom: Yea, like a crowdfunding type thing, it’s a nice, it’s something we’ve toyed with for sure. We’ll start off with self-service competitions but it’s a nice idea to tack on to it.
Gil Elbaz: Very cool. So I’m going to let’s close with a few visionary questions. And one question is just, looking a little bit longer term, what do you see, like with regards to machine learning, right? The potential for human level interactions with AI in our lifetimes.
Anthony Goldbloom: I gotta tell you, this is not an area of tremendous passion for me, nor is it an area where I think you’ll speak to much more thoughtful people than me on this particular question. Where I get excited is aboutthe technology that we have today having a real impact.
I’m probably ultimately skeptical of AGI in our lifetimes, but this is not a skepticism that comes out of really deep thinking so much as just a, I see what we have today seems a long way. It seems we’re probably a few inflection points away from that.
They seem like very hard inflection points. My bias is to think that we probably won’t have success with these inflection points in my lifetime.
Gil Elbaz: Yeah. I’m completely with you on AGI. I think that in contrast to AGI, it’s at least in my opinion, a lot easier to simulate a human level interaction with people and provide them with a digital person to talk to that will respond and be interesting and be engaging over time. I think that at least from my perspective, that’s something that I think that the technological building blocks to get there isn’t too far away, but. Again, it’s all very subjective.
So it’s very interesting to get your perspective on that as well. And maybe the last question that we ask all of our guests is what would you recommend to new people starting out in the machine learning space or the computer vision space?
Anthony Goldbloom: My answer is probably somewhat predictable, but get on Kaggle, in all seriousness. I’m very much learning by doing type. I think it’s important to learn some basic Python as a starting point. Kaggle has some really nice courses where we try to teach you the basics of Python, the basics of supervised machine learning, each of these are 4 hour bite sized courses. The reason we have those courses, they’re not supposed to be a really rigorous grounding in any of these topics, but they’re supposed to teach you just enough that you can start rolling up your sleeves and playing by yourself.
And then, challenges I think are a really good way to learn. We have an entry-level challenge to predict who survived the Titanic based on maybe 10 simple features. And you should start very simple. Maybe the first thing you submit is that you guess that everybody survived.
And then perhaps your second submission is you predict if the female then survived, if male then didn’t survive and have this goal of every day putting in a new submission. And have the one you submit tomorrow be better than today’s, every day you want to make a little improvement, and you’ll be amazed.
Maybe after two days you’ve got an if then L statement and if the female then survived then on day three, maybe you try a simple random forest. Just naively putting in the 10 features. And if you just try and do a little bit better each day, I think you’ll find that over a six month period, you’ve learnt a huge amount and it’s kind of cool.
Like you probably don’t want to spend more than half an hour or an hour a day on a challenge. You’ll find you’re in the shower or you’re taking a jog and you have an idea for what you want to try.
Gil Elbaz: The best ideas, definitely.
Anthony Goldbloom: And maybe that one didn’t work, but then the one that I asked her that makes an improvement, it’s a very nice way in my view to learn.
Gil Elbaz: Thank you very much, Anthony. It was a pleasure to have you on the podcast.
Anthony Goldbloom: Thanks for having me.