I know 29 minutes is long but theoretically you can have all the images you ever want in a small 6gb package and run inference on (nearly) everything. That's fucking amazing.
But honest question, if this is your goal, why not use a GAN instead? You should still be able to produce high quality images but at a much faster rate (I'd guess around 10 minutes?). Sure, you'll have a bit lower diversity and maybe not SOTA quality image generation, but neither is this thing. Or you could reduce quality. This reddit user seems to be doing fast inference on a pi[0] using stylegan, but that's before mobile stylegan came out which uses <1GB for inference. (It is a distilled StyleGAN2 model. We could distill more recent models)
Just seems like different models, different contexts. Certainly you'd want diffusion on the computer you're doing photoshop on, but random images? Different context.
I presumed it was safe to let users infer that a 2018 (StyleGAN) or 2019 (StyleGAN2) model was not going to compete with performance of a 2023 model, regardless of architecture. There has in fact been improvements in GANs in the last 5 years. Text conditioning is not unique to diffusion and is rather a subnetwork for conditioning (actually StyleGAN's whole innovation was creating a subnetwork for conditioning synthesis)
There are definitely modern GANs that are T2I and computationally cheaper than a latent diffusion model for a comparable benchmark score (but recognizing that our metrics are limited and only rough qualifications of image quality, but that the meaningfulness of the metrics decreases with increased realism so that's not really a hindrance for our specific use case and as mentioned before decreased diversity.)
> The quality is not really close, also StyleGAN is not conditioned on text.
Some examples of T2I GANS with comparable quality (not even something I claimed for context of the request...):
- GigaGAN (2023). Base generator @ 652.5M params and upsampler @ 359.1M params. While 512x512 generation has similar model size to SD 1.5 the inference speed is 16x faster. The gap widens for text conditioned super-resolution 128->1024 https://arxiv.org/abs/2303.05511
- StyleGAN-T (2023) (notably by Sauer who recently joined Stability and their first paper hit the front page a few weeks ago. Also includes the main SG authors from Nvidia) Figure 1 speaks for itself, noting that this is on T2I. It's also worth noting Sauer's previous work (StyleGAN-XL (2022)) did text conditioned experiments. https://arxiv.org/abs/2301.09515
- LAFITE (2022) comes in at a tiny 75M params for 256x256 generation and has quality comparable to the 12B param (autoregressive) DALL-E while being 1,600x faster in inference https://arxiv.org/abs/2111.13792
There are plenty more too. I'm not even suggesting proven architecture and train under comparable settings to that which the popular diffusion models have been trained to (which would be a fair one-to-one comparison), but what has already been done and demonstrated because the context is an engineering project not research. Certainly all the above works, including may diffusion methods, would vastly were they given the same treatment as Stable Diffusion but that's not the context here.
Stop buying into hype. There isn't one model to rule them all, there are models that are better in differing contexts.
>Stop buying into hype. There isn't one model to rule them all, there are models that are better in differing contexts.
I didn't say any of that, there is simply no open source GAN model that can compete with the open source diffusion models we have today, and the fact that these models can be distilled down to 1/2/4 steps makes GANs less attractive.
I'm sorry, then I don't know what you're saying. Because what I read was 2 claims. 1) GAN quality is less than diffusion. 2) GANs can't do T2I. I think I adequately showed that both these assumptions were wrong. I'm not sure what else your comment meant as that contains all of its words...
> there is simply no open source GAN model that can compete with the open source diffusion models we have today
Again, I provided citations. Do you want the github links? Here's GigaGAN's: https://github.com/mingukkang/GigaGAN (checkpoints under the evaluation folder).
What do you mean by compete? Because the metric scores are quite comparable. I think that's a reasonable interpretation of the word "competitive" but I'm a generative researcher so we might be using the terms to mean different things. (I even like diffusion more fwiw, but I'm particularly more interested in tractable density models)
Do you mean "the open community has rallied around Stable Diffusion and sunk in more time to tuning this model and producing textual inversions and LoRAs which far out surpass that of any other model, even including other diffusion models and is a phenomena nearly exclusive to Stable Diffusion and has been mostly accomplished in the last year"? Because if so, yeah, I agree.
But that's not a really good argument for saying Diffusion is better than GANs. It's a completely different argument. That's just to say __Stable Diffusion__ (not the class of diffusion models, or even more specifically latent diffusion models (the two are different)) has better community support. That's a reasonable argument, but a different one. Because it really just shows what you can do to __any__ model. These techniques and efforts are not architecture dependent nor are they even mode dependent. We see similar community effort around OpenLLaMa and GPT but less so around Bard/Gemini or Claude. So what? That's not really relevant to the conversation nor to the specific context that we're discussing which is very likely not going to include many of those TIs or LoRAs (or if it would you'd probably be training a custom one, so the point is again moot since LoRAs are not unique to diffusion models). I'm really happy a lot of people have entered the community and are effectively doing research, but it'd be quite naive to say that such a thing isn't possible around other architectures. Evangelism is quite useful, but not if it turns into religious beliefs. Hell, we can argue that Linux is better than Windows because Linux has a bigger hacker community but I'm not sure that's a great or even meaningful argument because it's void of context in what better actually means. The better OS is clearly situationally dependent and makes a lot of the OS holy wars silly.
> and the fact that these models can be distilled
Literally any model can be distilled. I'm not understanding your argument. Are you just arguing that such efforts have already taken place? Sure, I'll agree to that. But it's worth noting that the distilled diffusion models are still quite large and much larger than some of the aforementioned works.
> 1/2/4 steps makes GANs less attractive.
I even addressed the one step process. Which, single step diffusion generation represents a significant decrease in quality, so this seems to run counter to your prior argument. This is why I'm a bit confused. If we're going with a low step diffusion model then the case for the GAN becomes clearer because even your non-distilled GAN's inference is still higher and it's quality is definitely superior.
So really, I am confused. I'm not sure what you're arguing.
We can get into the weeds and discuss diversity, recall, memorization, density estimation, and all manner of things but these are quite open questions and frankly understudied. Plus we'll have to be extra nuanced because the metrics are proxies and incorporate different biases that different architectures are going to suffer from, making it difficult to compare in a more fair sense. But that's okay.
Let's also be clear: GAN != StyleGAN{1,2,2-ada,3,XL,T,*} and diffusion != StableDiffusion. There are other GANs and other diffusions and even other image synthesis models, the vast majority of them being open source.
Again, I'll assert, there are no universal models that are best for all situations. There are only models that are best at specific situations. You can translate this to "AGI doesn't exist but we have narrow AI" if that is clearer. But my claim is a bit broader still because if that AGI took a warehouse to run it still wouldn't be contextually appropriate.
Just FYI, it will less obvious that you’re entrenched and “dug in” to the field of GAN research if you’re less defensive when people say fairly reasonable things about different ideas.
Problem is none of what you believe about me is true. By main focus area is explicit density estimation, GANs are implicit. I like math...
But you can check my comment history to see that I really hate hype. I said there are better models for different tasks. Here's a breakdown:
- Autoregressive: Best for time sequential data
- Normalizing Flows: Best for density estimation and statistics
- Diffusion: Best at general image synthesis, editing, and diversity.
- GANs: Best when working on edge devices or throughput is critical
- VAEs: Best for situations between diffusion and flows, where a implicit PCA is desired
You can tell why I discussed GANs because it hit the specific use case. Remember my first comment is literally saying the motivation is trading image quality and diversity for generation time. It's because again, the claim is that there isn't one size fits all models. Such a notion is silly. You don't use a diffusion model to do real time image upscaling (e.g. for video) nor do you want it for RTX supersampling, but you do want diffusion for general image synthesis tasks in such areas like photoshop. Inpainting, outpainting, T2I, unconditional, I2I, and all that you want diffusion because it is much better for those tasks.
Just... diffusion requires big models and they're slow and the raspberry pi is small. So you make sacrifices.
Sorry, but at the link you provided there is no model weights, only inferred images offered for examination, so at this point it is at vaporware stage compared to stable diffusion tools.
Sorry, you're right. That is a really weird thing to do. Just provide 6GB of images...? They even don't have an issues page. Looks to be a common thing by that author. You're right, very suspicious.
Lafite has their checkpoints at least. Results aren't great, but it is small and fast.
They have a colab but it's broken (lol). Fix by removing the torch versions, and add gdown. Replace the wget line with "!gdown https://drive.google.com/u/0/uc?id=17ER7Yl02Y6yCPbyWxK_tGrJ8..." (checkpoint from from their github). Then everything will work fine. It took some time to get some decent outputs (but then again so did my first time with any diffusion model. This is definitely lower quality though). At least the authors look engaged in the github issues and do show how to get better results. (Always be suspicious of the images shown in papers... this one is certainly no exception)
I mean it was pretty simple, the quality is not really close, also StyleGAN2 is not conditioned on text (because you talked about this model in first comment). In the future there could be one that is competitive but not today.
GigaGAN the best GAN model by far is not open source and it's not competitive yet because even the images cherry picked for the paper and project paper do not look that coherent, the FID is relatively low because the inception v3 model used to calculate the FID doesn't care that much about global coherency and more like texturing but if the FID was calculated using DINO v2 (like some recent paper do) instead of inception v3 it will really show the gap between GAN models and diffusion models today.
Look, every single message I've said that you're trading quality for speed. That does in fact mean GANs are worse. I'm not sure why you think I have said anything short of that, I've explicitly agreed with you that diffusion produces higher quality images, and I'm not interested in repeating myself any further.
We're talking about a fucking raspberry pi, it is definitely reasonable to want to __trade quality for speed__ when you're working on a tiny computer.
I'm glad you're aware that FID has limits. For some reason this is uncommon. But there are more limits than the backbone classifier. Yes, DINOv2 helps (so does CLIP-FID and clean FID (which is backbone independent. I feel I need to explain because we're having difficulties communicating and going to their code and seeing they use inception isn't going to mean anything to be because that's not what that work is about)), but fundamentally the difference between two normalization layers of a classifier is not actually measurement of image quality. It correlates, but these also are fundamentally about the distributional nature of the outputs. There actually is no method that does particularly well but we're over here doing our best. Just got to be aware of the limits of your metrics because (as is the theme of our entire conversation), context matters. Here's a paper you may find interesting https://arxiv.org/abs/2306.04675
Running SD v1.5 using one single step is faster than sampling with GigaGAN and you can achieve better coherency in my opinion: https://arxiv.org/abs/2311.09257
Then it does make more sense to run this on a raspberry pi.
Yeah I'll agree to that. But neither UFOGen nor GigaGAN have released models (I've admitted to being wrong about GigaGAN's checkpoints. In fairness, who the fuck releases 6GBs of generated images? And removes the issues tab from GitHub? Can we agree that's sketchy as fuck?[0]).
LAFITE is the only of the 3 with released checkpoints but it's definitely not to the same quality (Nvidia backed down to place it in products and imo a bad move). But it is 75M params compared to the 1B of UFOGen and GGAN. I was able to get a bit better images by some prompt engineering but yeah, classic paper painting a much better picture of their model than it actually is (pun intended). But then again, I got shitty images the first time I used SD so YMMV. There's probably better works out there but honestly my focus is elsewhere and no one's got time to keep on top of and try everything coming out.
I'm literally just saying that you can trade quality for inference. Do you disagree?
Because I'm not sure why you think I think GANs are better for quality, as I've said the opposite many times. Why are you hyper-focused on quality and thinking I've said GANs do better. This is why we're talking past one another, because you're attributing assumptions to me that I ('m doing my fucking best to clarify that I) don't have. If you think I think GANs produce higher quality images, I assure you that this is from your imagination as I've never stated such and it is not an opinion that I have. Sorry, I said I wouldn't say this again, so last time for real.
[0] There's too much sketchy shit going on in ML research right now and honestly, that's why I've been more passionate about trying to get people to think harder and about context. Again, this is about context. (I also really hate this railroading as it stifles community innovation and sweeps important problems under the rug by saying to just rely on large companies for checkpoints. The community doing so much around Stable Diffusion is awesome, but I want to see that around lots of works because there's a lot that can be accelerated by even a hundredth of this community effort)