The Future of AI Artistry with Suhail Doshi from Playground AI

Primary Topic

This episode explores the development and future of AI in creating and editing digital artwork, featuring insights from Suhail Doshi, founder of Playground AI.

Episode Summary

In this engaging episode of "No Priors," host Sarah Guo converses with Suhail Doshi, founder of Playground AI, about the evolution of AI tools in artistic creation. Doshi shares his journey from his early tech ventures to focusing on AI-driven image generation with Playground AI. The discussion delves into the technical aspects of AI development, including the challenges of training AI models and the future directions of AI in visual arts. Doshi emphasizes the shift from basic text-to-image models to more sophisticated applications that blend creativity with utility, such as editing and enhancing existing images.

Main Takeaways

Evolution of AI Art Tools: Doshi recounts the rapid advancements in AI art tools, from DALL-E to more sophisticated models like Playground 2.5.
Challenges in AI Development: The complexity of training AI models and the necessity of a detailed understanding of aesthetics and user needs.
Future Directions: Emphasis on editing capabilities and enhancing utility in AI-generated images.
Community and Open Source: The role of the community in developing and refining AI tools.
Impact on Other Creative Fields: Insights into how AI technology is influencing other areas like music and video.

Episode Chapters

1. Introduction

Sarah Guo introduces Suhail Doshi and discusses the inception of Playground AI. Doshi reflects on his motivations for starting the company and his vision for the future of AI artistry. Suhail Doshi: "When I saw DALL-E two, it was a big, eye-opening moment."

2. Technical Deep Dive

The conversation shifts to the technical challenges and achievements in developing Playground AI's latest models. Suhail Doshi: "It's more complicated than just mixing architecture and data; there's an art to it."

3. Industry Impact and Future

Doshi discusses the broader impact of AI on the creative industry and outlines potential future developments in AI technology. Suhail Doshi: "We're focusing on scaling pixels, not just text."

Actionable Advice

Explore AI tools for creative projects to understand their capabilities.
Stay informed about the latest advancements in AI to leverage new technologies.
Participate in communities focused on AI development to contribute and learn.
Consider the ethical implications of AI in creative work.
Experiment with different AI models to find the best fit for your creative needs.

About This Episode

"Multimodal models are making it possible to create AI art and augment creativity across artistic mediums. This week on No Priors, Sarah and Elad talk with Suhail Doshi, the founder of Playground AI, an image generator and editor. Playground AI has been open-sourcing foundation diffusion models, most recently releasing Playground V2.5.

In this episode, Suhail talks with Sarah and Elad about how the integration of language and vision models enhances the multimodal capabilities, how the Playground team thought about creating a user-friendly interface to make AI-generated content more accessible, and the future of AI-powered image generation and editing."

People

Suhail Doshi

Companies

Playground AI

Books

None

Guest Name(s):

Suhail Doshi

Content Warnings:

None

Transcript

Sarah Guo
Hi, listeners, and welcome to another episode of no priors. Today we're talking to Suhail Doshi, the founder of Playground AI, an image generator and editor. They've been open sourcing foundation diffusion models, most recently Playground 2.5. We're so excited to have Suhail on to talk about building this model in conjunction with the playground community and the future of AI pixel generation. Welcome, Suhail.

Suhail Doshi
Thanks for having me. So this is your third company. You started Mixpanel mighty. Now you're working on Playground. How did you decide this was the next thing?

I think, like, back in April of 2022, I think that was just a place. That time. It was, like, GPT 3.5 kind of came out, and then Dolly two came out, and I was actually working on the second company, mighty. And at that time, I was trying to figure out how to, like, do something with AI inside of a browser address bar. But when I saw Dolly two came out, it was just this very big, strange, eye opening moment where, I think a lot of people didn't think that we'd be able to do weird, interesting art things so soon.

I think then soon after that, I think stable diffusion came out around June or July of that same year, and I got early access, maybe a couple weeks access to early to sd 14. And I just kind of blew my mind what people could do with that. And I just thought that it seemed odd that all of this was being done in a Google Colab notebook. Shouldn't there be, like, a UI that makes it really easy, that sort of thing. From the start, were you just thinking, we will open source, we will train our own models from scratch.

Sarah Guo
Did you think about other modalities? Yeah. I mean, there have been a lot of people that thought I should do something. Something in music, but I just. I.

Suhail Doshi
Because music has been, like, a huge hobby of mine for, like, six years or so. I, like, produce music, but I just couldn't wrap my. My brain around, like, what useful thing I would end up making for people. Although now there's, like, a lot of very interesting, cool, useful things for music. And then it seemed like a lot of people were very focused on language, and I had really enjoyed.

I already work with lots of creative tools. Like, when I was in high school, I used to make logos, or I would make music or whatever. So I was excited that finally I could find something where it was a combination of creativity, tooling. Images have really amazing built in distribution. People want to share those kinds of things.

So it ended up just being this perfect thing that I was excited to work on. How do you think the overall landscape for competition is different in language versus images versus music? How do you think about in what ways you guys would want to build advantage and stand out? I think with language, there's. I don't know how many language companies there are.

You guys would probably know better than me, but it seems like there's over 20, and then maybe five or eight of them have a billion dollars worth of funding. I also didn't want to work on something if there were already extremely passionate people really working hard at that thing, people that I really respected that were working on that thing at the time with images, I think there was mid journey, there was OpenAI doing some dall e stuff, and then you saw stable diffusion. But for some of these companies, it didn't seem like there was going to be a longstanding, concerted effort to keep making them better. It was unclear who was doing this as a fun demo versus who was doing this as something they would spend and invest tons of their time in. Once I had figured out to what extent OpenAI was gonna invest in it, or to what extent, it seemed like the folks at stability AI were sort of focused on seven different kinds of things.

And I just thought, like, hmm, there's just not enough people that want to do this one thing and do it really, really great. So I think for me, it was just about, were there enough capable people that wanted to do this? Can you talk a little bit about the specific direction you decided to take with playground as well? And I know you thought really deeply about some of the applications or use cases for it, so I was just curious if you could share a bit more about that. Yeah, I think one thing that has sort of been surprising, and it hasn't changed too much, actually, from maybe around June or July of August 2022, was that a lot of people think about text to image as just right now, it's kind of not.

It's not even text to images, like text to art. Sorry, what's the difference? The difference is that these models have. They haven't quite reached the potential of maybe what its utility could be. Right now, for the most part, we formulate a prompt, which is really just a caption of what the image is, and then it diffuses into an image, a set of pixels.

But a lot of those pixels are primarily used for art. But what we haven't done is we haven't done anything beyond that. We haven't really done something like editing, for example. Why can't we take an image that you already have, and why can't we insert something into that with the correct lighting and stuff? Why can't we stylize an existing thing?

Why is there not a blend of real and synthetic imagery into a single image that could then be used for a lot more things than just pure art? And so right now, it's a lot of just people making art, but not a lot of people. Sometimes that reduces its practicality or its utility. Yeah, that makes sense. I think one of the things that you folks did as well that I thought was really interesting is you built your own models or you train your own models, and a lot of people in this space just take stable diffusion and fine tune it, or do other approaches like that, and you've gotten.

Elad Gil
You just launched 2.5. The model is performing incredibly well, really beautiful imagery, and it's super high quality. And I'm a little bit curious if you could share a bit more about how you went about training your model and hiring a team specifically for that purpose and how you thought about it and approached it. Yeah, it turns out that I think when a set of strong engineers, their first thought is that you just take a model architecture, you find a lot of data, you fund yourself with enough compute, and you just sort of throw these things into a mixture of. Of sorts, and, like, out comes out like something like Dolly two or dolly three.

Suhail Doshi
It turns out that it's just way more complex than that and way more complex than I even imagined. I had a sense that it was more complicated than that, but then it still further. It's more complicated than even that. So I think there are a couple things that we did. One of the things that we were really focused on with that model was that we wanted to see how far we could push the architecture of something that already existed.

This is mostly like a test. It was a test to see whether, how far we could get as a research team before the next model change. We wanted to take something that we knew as a recipe that worked already, which was stable diffusion, Excel's architecture, which is a UnEt and clip, and the same vae that Robin Rombach trained all this stuff. And then we sort of said, ok, what if we try to get something that's just at least better than SDXL, better than the open source model? We weren't really sure by how much.

Our only goal was to just be better and try to deliver on the number one, state of the art, open source model that we could release. We learned two things. One is that when we looked at some of the images from something like SDXL, we noticed that there was this average brightness. It was really confusing. It didn't quite have the right kind of color and contrast.

And, in fact, I became so used to this, I became so surprised about the average brightness when comparing it to the images of our model that I thought it was a bug. During evaluation, I literally was looking at the images, and I was like, these cannot be the right images. My team was sort of like, hey, I think you're actually just getting used to the images of the new model. And so we employed this thing called this EDM formulation, which samples the noise slightly differently. And it's a really clever kind of math trick, and there's a paper that you could probably read on it.

But it's surprising how this one little very clever trick can produce images that have incredibly great color and contrast. The blacks are really vibrant with a bunch of different colors, and this average brightness goes away. So that's one thing that's a really. Interesting example of really optimizing for one aspect of creating aesthetically pleasing imagery. And there's a few other aspects like that.

Elad Gil
So I'm just curious how much you have to sort of hand tune different parameters versus. It's just something that you get as you train a model or post train a model. Yeah, I mean, there's just so many. There's different dimensions of these models. I mean, one is just like, it's understanding of knowledge, but then for, like, aesthetics, it's really tricky, honestly.

Suhail Doshi
I think the field itself is just so nascent that, like, every month there's, like, a new trick. There's, like, a new thing that we all sort of develop or find. Find out. I think there's an element of some of that being, like, a lot of different tricks. Like, there's, like, this new trick that hasn't been, like, well employed or well exploited yet by this guy named Teo Karas, and he basically does this weird thing called power EMA.

Anyway, basically helps converge training really fast. And so that's one trick. And then there's this EDM trick, and there's this thing called offset noise. And so there is a lot of tricks for things like color and contrast. There's even a trick called DPO for, like, that, I think works in the language model world and also the image world.

Right? So I think there are all these. There are lots of tricks that sometimes get you, like, 1020, sometimes two x improvements. But I think the number one trick is really just that last phase of a supervised fine tune, where you're finding really great curated data. It's hard to say how much of that is a trick because it's actually just a lot of meticulous work.

So I think there's a combination of some of these things being tricks and techniques, and then there's just this other thing that's just really hard meticulous. There has to be deep care, and with images, maybe more so than language, there has to be taste and judgment. Yeah. How do you think about that from the perspective of the eval? You do, because not that many people have amazing taste aesthetically.

Elad Gil
Right. And so I'm a little bit curious how you end up determining what is good taste, or is it just user feedback? Thumbs up, thumbs down. How do you think about that? One thing that I've noticed is that every time we do an eval, we try to make our evals better, and we try to make them better than the predecessor eval.

Suhail Doshi
And so one thing I always notice, though, with each successive run, is that I find out much later after eval that the model has all these gaps. So an example of a gap that we recently had was we did well in our eval. But one area that I thought we did poorly that I wish we had done better on was photorealism. Sometimes it would make faces look like they hadn't gone to sleep for three days or something. And so I think that most evals in the industry are relatively flawed, like, a lot of them are doing benchmarks on things that maybe are valuable from the purposes of marketing, but are not necessarily well correlated with what maybe users care about.

A simple example would be with large language models. There's a reason why they're probably good at homework. It's because a lot of the evals are related to things that are related to things that could be homework, solving an LSAT or a bio test or a math test. Some of these evals just don't have the necessary coverage, I think with things like judgment and taste, my feeling is that overall, the evals need to get way stronger. One thing that we tend to do is we just tend to really look at a lot of images across a lot of grids, and we're really being exacting about what thing could be off.

But you have to look at thousands of images across lots of different grids, across different checkpoints to basically find and pick release candidates. But I still think that our own evals are not sufficiently strong enough, and they could be better at world knowledge. Whether that's its ability to reproduce a celebrity, if that's what you want, or paintings. Sometimes paintings are difficult, or 3d or illustrations or logos. Those kinds of things are all.

Overall, I think coverage is pretty tricky. One of the things that you guys do is have voting schemes or user studies within the product itself. So I don't know if it's grids, but you're asking users to express preferences more so than I think perhaps other research efforts are. Can you talk about just generally, your data curation strategy, if there's some sort of overall framework or if community is a big piece of it? Generally we try to keep something very simple because we know that users are, they have like, they're there to make images, they're not there to necessarily help us label images and so, or annotate things or tell us everything about their preferences.

And so we kind of have a very sophisticated process of how we curate images and how we're collecting data from these users to help us kind of rank and sort of make sure we're choosing the right sort of things that we want to curate. And so I think these things are very, they might seem like very simple when you encounter it, but beneath that is something very, very complex. But, yeah, it's a little tough to go into it too deeply because, yeah, it does feel like a little bit of a secret sauce, I suppose. Yeah. Well, at least I feel good that my guess as to what's interesting is.

Sarah Guo
Right. Can you just characterize where playground does really well today, where you stand out and what sort of use cases you're focused on winning? We're maybe, probably number two, I suspect, I guess, at text to art at the moment, just because we're training these models from scratch, we're closing the gap really rapidly, as rapidly as we can around all the various kind of use cases. But I think that we'll probably diverge from some of the other companies, in part because I think we start, we're going to care a lot more about editing. People just have a lot of images on their phone or they want to take some image that they love, whether that's made as art or something that they found, and they want to tweak it a little bit.

Suhail Doshi
It's a little annoying that you make this image and then you can't really change too much about it. You can't change the likeness of it. Maybe there's like a dog or your face or something. Character consistency issues. It feels a little like a loot box right now.

And I think that because it's so much of a loot box, it feels like it's too much effort, I guess, to get something that you really, really want. So I think more where we're navigating is like, how can we help you take an image that you love? Maybe it's your logo, maybe. Or incorporate something like your logo, or put it in some sort of situation that you would prefer. Text synthesis is something that we want to do, for example.

Those are some areas that we want to head towards where there's higher utility. Unless you make an image and you just post it to Instagram or something. Like that, where do you want to take the company and the product over the next few years? What is a long term vision of what you're doing? If people are out there working on scaling text, we're basically trying to focus on scaling pixels.

And the first area that we're basically started on is just images. And the reason why we're working on images instead of, say, something like video or 3d or something like that is one issue with 3d is that it tends to be better to work on 3D. If you're making the content like you're making Pixar movies. The tools in 3D tend to not make as much money. The other thing with video is videos is just extraordinarily computationally expensive to.

To do inference or even training on. And a lot of the video models first train, like pre train with a billion images first anyway, to have a rich semantic understanding of pixels. We just think that images is maybe the most obvious place to start because a, the utility is quite low, and b, it's actually somewhat efficient computationally to do so. Long term, I think that we're trying to make a large vision model. There's not really like a word, I guess, like we have llms, but I'm not really sure what the word is for vision or pixels if you're trying to make like, a.

You know, a multitask like vision model. And so the goal would be to try to, like, do three areas of a large vision model, would be to be able to create things, edit things, and then understand things. And so, like, understanding would be like GPT four V, or if you're using something open source like Cog VLM, or there's all these amazing vision language models that are happening, and then editing and creating are things that we've kind of talked about. But it would be really amazing at some point that if you made this really amazing large vision model, that it could do things like not just create things like art, but maybe help some kind of robot traverse some sort of path or maze. And then there's things in the middle that are sort of like, maybe you have a video camera or surveillance system or something, and it's able to understand what's going on in that.

But I think right now, we're just really focused on graphics. And then how do you think about the underlying architecture for what you're doing? Because traditionally, a lot of the models have been diffusion model based. And then increasingly, you see people now starting to use transformer based architectures for some aspects of image gen and things like that. How do you think about where all this is heading from an architectural perspective?

Elad Gil
And what sort of models will exist in the next year or two? My kind of controversial take, perhaps, is that there's this thing called Dit, which people allegedly believe sora is based on. And then there are variants of Dit. There's this thing called, I think. Mm, dit, which I think stable division three is supposed to be based on by that research team at stability AI.

Suhail Doshi
And my overall feeling is that transformers are definitely. I think transformers are definitely the right direction. But I don't think that we're going to get a lot of enough utility if we're not somewhat trying to figure out a way to combine the great, amazing knowledge of a language model. And then just using something like Dit, which is completely trained from some kind of video caption or image caption to an image, because there's not enough interpretable knowledge, I suppose you're not able to interpret anything about the input, which a language model is really great at. But then there's these models that are just trained on these captions that emit images.

And it's kind of unclear how we might marry these two things. And so it sure would be nice if somehow we could combine these two things. So I think the architecture is most likely going to change. I don't think that dit is the right architecture, but transformers, certainly. And just for people who are listening, dit just stands for diffusion transformers.

Elad Gil
So in case people are wondering, one. Belief held by some of the large labs focused mostly on language today is, in the end, we end up with, like, one truly multimodal general model, right? That it is like, not like we don't end up with a language model and a video model and an audio model and an image model. It is any modality in giga, brain knowledge, reasoning, long context, and any modality out, like, do you believe in that worldview? Or how do you see it differently?

Suhail Doshi
I definitely think the models are going to be multimodal. In fact, that's what I mean about some of these models that are just strictly trained through a division transformer, a diffusion transformer that's only taking caption image inputs, just completely lacks some knowledge. Conversely, if you look at just the language models, we know that language is at a much lower dimensionality than, say, like an image which has level. All these pixels that tell us about lighting or physics, or spatial relationships or size and shapes. So, for example, if you were to take a glass and shatter it on the floor, and then I asked you to describe it, and I asked and then I described it, we would both come up with completely different descriptions if Elad had to go and draw it.

So we know that pixels have an enormous amount of high information density compared to language. And language is just really between me and you. It's like a compressed way that you and I can converse with each other at somewhat of a higher bandwidth, right? Like, we have an abstract view of, like, those. What those words mean.

So I think that the models, there has to be something like, language is really great because it's compressed information, and then, like. And then, like, vision is really great because it's so information rich, but it's been hard to annotate until recently. It's only because vision language models exist that it's now suddenly a lot easier to sort of label or annotate or understand what's going on in an image. So I think that these two things are going to be very likely married. The only question is, does language?

To me, it's kind of a question of, does language? Language has this wonderful trait where it's like, you can use language to control things, which is pretty cool because of its low dimensionality. But my question would be like, I wonder if language will hit a ceiling. It has a lower ceiling than, say, vision, because it's very easy to get lots of pixel data, and that pixel data is very, very high density, very. Easy to get more pixel, additional pixel data to the already collected data from the Internet that's gone into these models.

Yeah, I mean, there's an assumption, maybe. One assumption I tend to question is whether the Internet data is sufficient. The Internet's very big, but maybe there's some kind of mode collapse even with Internet data. With vision, at least you can make a robot that just travels down the street and just keeps taking pictures of everything. You can get infinite training data with vision, but it might be trickier to filter and clean Internet data, especially as more synthetic data ends up on the Internet.

Elad Gil
One other area that I know you spend a lot of time is music. You make your own music and produce it. There have been a number of different applications, refusion, et cetera, that have come up on the music side. I was just curious how you've been paying attention to that, what you think of it and where you think that whole space evolves to. Yeah, I love audio.

Suhail Doshi
That'd be the other kind of thing that I would go work on if it weren't for playground. Partly I didn't work on music because the music industry is like, only the whole industry is $26 billion. So it was a little hard for me to figure out how big a music thing could be. But I definitely think audio is going to be enormous. Things like eleven labs are very interesting.

But anyway, yeah, I mean, the way that I've been, I've been trying to find ways to figure out how to use it as a user because that gives me like a really, that gives me a stronger sense of maybe where things are going. And so one thing that I've been waiting for for many years is that instrumentals in music are actually very easy to get or to make. You know, there's a wide variety of like, quality, of course, but generally instrumentals in a song, like if you hear a song from Taylor Swift or whoever, rap song, those beats or those instrumentals are fairly easy to make and easy. What's hard is to get lyrics and vocals. And that's always been like, a difficulty of mine.

Like, how do I find a singer and then how do I get them to write lyrics and then sing it? That's a much more scarce resource in the music world. And. And so for the first time, with like something like suno AI, it was really cool because it's the first time that I heard them be able to make like a rap song where the rapper has good flow. Flow is just like the swing of lyrics to a beat and, or like, you hear actually really good lyrics that feel like, very emotional, have the right breathiness.

Doesn't sound like it's all made on like auto tune, I guess. And so I have this little flow where I make a song in suno and then I use a different AI tool. It's AI tools all the way down, I guess, to split the stems and just grab the lyrics, but then throw away the instrumental and then I get to make a song with the instrumental and the vocal. Anyway, so I put some songs on my twitter where I basically tried to do this and it sounds so I can get to a higher quality song, I guess, because I make the instrumental. There are still some weird errors in the songs, but that's been like a really cool way to use AI, in my opinion.

Elad Gil
So, Hal, thanks so much for sharing everything that you're working on a playground with us. Find us on Twitter opryorspod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week and sign up for emails or find transcripts for every episode at no dash pryors.com dot.

Sarah Guo
That way you get a new episode every week and sign up for emails or find transcripts for every episode at no dash pryors.com dot.