So it’s safe to assume in the next 10 years - AI will be running locally on every device from phones, laptops and even many embedded devices. Even robots- from street cleaning bots to helpful human assistants?
The bottleneck is probably the availability of lithography machines that can make ubiquitous chips for processing that much data quickly enough without heating or drawing electricity too much.
Not too far in the future every device will have some 5nm or better tech LLM chip inside and devices understanding natural language will be the norm.
By dumb machines, people will mean machines that have to be programmed by people using ancient techniques where everything the machine is supposed to do is written step by step in a low level computer language like JavaScript.
Nerds will be making demos of doing something incredibly fast by directly writing the algorithms by hand, and will be annoyed by the fact that something that can be done in 20 lines of code on few hundred MB of RAM in NodeJS now requires a terabyte of RAM.
Dumb phone will be something like iPhone 15 pro or Pixel 8 Pro where you have separate apps for each thing you do and you can't simply ask the device to do it for you.
I really don’t buy that last paragraph. It’s the idea the AI Pin founders keep saying - you don’t want apps, the device chooses it for you, etc. In a world where you have economical choice, that’s quite hard to imagine: would there be no competition? What good is it for you if you can’t even choose what music app you want? Without _something_ like an app, how do you even know what you can interact with?
> Without _something_ like an app, how do you even know what you can interact with?
You just tell what you need or want and the AI will suggest you options.
>What good is it for you if you can’t even choose what music app you want?
You don't want a music app, you want to listen to music. The AI will simply play it to you and any additional features that you might like will be available. Feel like doing karaoke? Just tell to the AI. Want more songs like this? Tell it to the AI. Not sure what you want? Tell the AI that you are not sure and ask for suggestions, maybe vaguely describe how you feel. You get the idea.
> would there be no competition? There would, but not in the current sense.
Maybe there would be experience designers and knowledge brokers. The AI might tell you that you will need a weather info subscription if you need something specific that is not provided by your current subscriptions. Maybe you want some specific recipe from a home lighting style from a designer? The AI can help you purchase and apply that. Maybe you want to try dating? It wouldn't be an app but a subscription you can join and the AI will handle the UX.
However, I think there will be games because games are apps whose value comes from interaction with the app itself.
I absolutely 100% want youtube music because it has the music I want and I have my playlists there. That entire argument of “you don’t want X” is misguided.
Maybe for someone whose first computer ever is that AI panopticon, then sure. But will all music be free? What about netflix? Will it also be free? What happens to competition? Or choice of brands?
These are already under threat, people are editing images using text for a year now - no tools needed, you can describe what you want or tap on an area of the image and describe what changes you want. These days they started doing videos too.
They’re still apps, and nobody that does anything minimally professional uses “no tools” (in fact it’s quite the opposite, AI added a mindnumbing number of tools to the toolset). I really don’t see how these will just “disappear” and become some amorphous “talk to the computer” interface
It's not on professional level yet, but it was barely O.K. a few months ago. It's moving very fast.
The gist is, if something is learnable by practice the AI can do it because by training these machines we actually teach them patterns and methods. Any "blue collar" job like editing images is going away.
Talking is waaaaay more time consuming than tapping or typing, having to memorize what I’m able to do on my device instead of just taking a glance at its screen is a lot of cognitive overload, not having any shared UI with others is a recipe for eternal confusion…
I really don’t see any clear advantages of having a one-fits all shapeless interface that’s driven by prompts (verbal or not). Doesn’t feel like a UX that makes sense to me, and I haven’t heard a case where it does yet tbh (I do think the “personal assistant” makes sense, and might replace some stuff, but don’t see how it’d become the next interface for everything)
There's no reason why AI app wouldn't be able to interact with you through the most suitable UI.
For example, when playing music can show you the basic buttons but also let you type or speak for more advanced stuff like "let's do karaoke" or "why don't we go through move soundtracks by showing me iconic scenes from each movie while playing the songs".
Every Google Home device is already running an ML model to do speech recognition to recognize the "hey Google" wake word, so sooner than 10 years. The Raspberry Pi Zero is a particularly underpowered device for this. Doing it on the Coral TPU accelerator plugged into a pi zero would take less than 30 mins. Doing it on an iPhone 15 would take less time. Doing it on a Pixel 8 would be faster. Not to diminish getting it to work on a Pi Zero, but that future is already here, just as soon as we figure out what to do with them.
There's an ocean of difference between optimizing for a single wakeword and the class of models that are taking off today. I'm excited for more on-board processing, because it will mean less dependency on the cloud.
I'm not going to argue that there isn't a difference when going from 0 -> 1 and 1 -> 10, or in this case, from 1.5b (Whisper-large) -> 1.7t parameters (gpt-4). But it's not like we don't know how to do it, so it won't take 10 years to get there.
Siri’s wake word stuff is also terrible, she gets constantly activated whenever I have my Apple Watch near running water, frying food or anything else that makes a white noise-type sound.
Yeah, that's happening right now really? There has been loads of developments in the mobile space already, in many ways lower powered arm devices are way more optimised for AI applications than the current crop of intel machines.
This example, whilst impressive, feels way more in the "Doom running on a calculator" vein of progress though.
Will models of similar quality to the current LLaMA, GPT, and Stable Diffusion be running locally on devices and edge systems? Very likely.
Will much higher quality models that still require compute incapable by such edge or consumer devices be available, sold as a service, and in high use? Also very likely.
So expect current quality to make it to your devices but don't expect to necessarily move everything to local because the whole ecosystem will improve too. Overton window will shift and it's like asking if gaming will move to phones. In some ways yes, in other ways you're still going to want to buy that Playstation/Xbox/PC.
“AI” is already running on every single laptop and phone. If you mean diffusion models (like this), then I’d say it’s 100% guaranteed they’ll run everywhere too, since they’ll get faster and more refined, and processing power keeps growing (it doesn’t even have to grow fast)
I mean, diffusion models tend to be less computationally expensive than say CNNs or LLMs, so probably? And before than ppl ran SVMs, random forest, and other forms of non-gpu intensive ML algorithms locally as well...
This project is a fun POC but it's not very practical for that type of application.
A 4090 can generate over 100 images a second with turbo+lcm and a few techniques, you can make 2 days worth of images in 1 seconds. You could make a years worth in roughly 3 minutes and put them on the sd card
> I found this claiming an A100 can generate 1 image/s.
The article you linked is over a year old. Needless to say there have been a LOT of optimizations in the last year.
Back then it was common to use 50+ steps for many of the common samplers. Current methods use a few steps like 1. This OnnxStream are using SDXL-turbo, and you can combine LCM and a few other methods to go very fast.
The reason it's so much faster now is the OnnxStream is only using a single step.
However even if you only get 1 image/s with whatever GPU you have I stand by my original statement that unless you want to do it for the cool factor (which is very valid), pre-calculating them makes more sense.
I actually get around 100 imgs/s on my 3080Ti. Three things to note: 1) you gotta run the max perf code to get the high throughput, 2) the images in this setting are absolute garbage, 3) you don't save the images so you're going to have to edit the code to extract them.
Definitely agree that this project is much more about the cool factor. I suggested a GAN in other comment for similar reasoning (because it's a pi...) but if you want quality images well I'm not sure why anyone would expect to get those out of a pi. High quality images take time and skill. But it's also HN, I'm all for doing things for the cool factor (as long as we don't sell them as things they aren't. ML is cool enough that it doesn't need all the hype and garbage)
> Back then it was common to use 50+ steps for many of the common samplers. Current methods use a few steps like 1.
The "look how fast we can go" method (turbo model with 1 step and without CFG) is blindingly fast, but the quality is...nothing close to what was being done in normal 50+ steps with normal setitngs gens.
Realistically, even with Turbo+LCM, you're still going to 4+ steps (often 8+), with CFG, for reasonable one-generation quality anywhere close to the images people generated at 50+ steps without Turbo/LCM.
> Realistically, even with Turbo+LCM, you're still going to 4+ steps (often 8+), with CFG, for reasonable one-generation quality anywhere close to the images people generated at 50+ steps without Turbo/LCM.
For sure the only reason I considered comparing it that way was because the linked repo appears to also be going for a similar approach with 1 step/image on the pi.
From my own experience I've had a hard ever getting a decent image below 6~8steps, but this repo seems more focused on getting it to run in a reasonable amount of time at all, which understandably requires the minimal "maybe passable" settings.
They're might be talking about this[0] as it has been popular recently. It can definitely do >60 imgs/s on my 3080Ti, but you're not going to want any of those images. They are absolute garbage.[1] I can do a little under an image a second and some may be quite usable, but nowhere near what you're going to get from the standard model.
The whole point was that you’d be getting ramdom puctures just-in-time, at a leisurely rate suitable for background image rotation, without other interaction.
I think you’re just missing the point, which most certainly isn’t buying compute to generate zillions of images ahead-of-time and then replaying them at a rate of one every half hour or whatever. Anyone can do this. The idea of having a tiny instance of SD crammed on a tiny computer taking its time to compute the images just-in-time (so you don’t even in theory don’t know what you’re going to get next) is simply much more fun and original, never mind way more aligned with the hacker ethos.
I was just using that as a reference. Stable diffusion will run well with almost any relatively modern gpu.
You don't have to use a 4090, you'll still get double digit performance with a 3060 or whatnot.
> for people who can only otherwise afford Raspberry Pis ;)
You can rent a 4090 for 0.7USD/1hr, or get an A100 for 1.1USD/hr. And if your project is a display + raspberry pi then those costs will dwarf the rental cost.
I know 29 minutes is long but theoretically you can have all the images you ever want in a small 6gb package and run inference on (nearly) everything. That's fucking amazing.
But honest question, if this is your goal, why not use a GAN instead? You should still be able to produce high quality images but at a much faster rate (I'd guess around 10 minutes?). Sure, you'll have a bit lower diversity and maybe not SOTA quality image generation, but neither is this thing. Or you could reduce quality. This reddit user seems to be doing fast inference on a pi[0] using stylegan, but that's before mobile stylegan came out which uses <1GB for inference. (It is a distilled StyleGAN2 model. We could distill more recent models)
Just seems like different models, different contexts. Certainly you'd want diffusion on the computer you're doing photoshop on, but random images? Different context.
I presumed it was safe to let users infer that a 2018 (StyleGAN) or 2019 (StyleGAN2) model was not going to compete with performance of a 2023 model, regardless of architecture. There has in fact been improvements in GANs in the last 5 years. Text conditioning is not unique to diffusion and is rather a subnetwork for conditioning (actually StyleGAN's whole innovation was creating a subnetwork for conditioning synthesis)
There are definitely modern GANs that are T2I and computationally cheaper than a latent diffusion model for a comparable benchmark score (but recognizing that our metrics are limited and only rough qualifications of image quality, but that the meaningfulness of the metrics decreases with increased realism so that's not really a hindrance for our specific use case and as mentioned before decreased diversity.)
> The quality is not really close, also StyleGAN is not conditioned on text.
Some examples of T2I GANS with comparable quality (not even something I claimed for context of the request...):
- GigaGAN (2023). Base generator @ 652.5M params and upsampler @ 359.1M params. While 512x512 generation has similar model size to SD 1.5 the inference speed is 16x faster. The gap widens for text conditioned super-resolution 128->1024 https://arxiv.org/abs/2303.05511
- StyleGAN-T (2023) (notably by Sauer who recently joined Stability and their first paper hit the front page a few weeks ago. Also includes the main SG authors from Nvidia) Figure 1 speaks for itself, noting that this is on T2I. It's also worth noting Sauer's previous work (StyleGAN-XL (2022)) did text conditioned experiments. https://arxiv.org/abs/2301.09515
- LAFITE (2022) comes in at a tiny 75M params for 256x256 generation and has quality comparable to the 12B param (autoregressive) DALL-E while being 1,600x faster in inference https://arxiv.org/abs/2111.13792
There are plenty more too. I'm not even suggesting proven architecture and train under comparable settings to that which the popular diffusion models have been trained to (which would be a fair one-to-one comparison), but what has already been done and demonstrated because the context is an engineering project not research. Certainly all the above works, including may diffusion methods, would vastly were they given the same treatment as Stable Diffusion but that's not the context here.
Stop buying into hype. There isn't one model to rule them all, there are models that are better in differing contexts.
>Stop buying into hype. There isn't one model to rule them all, there are models that are better in differing contexts.
I didn't say any of that, there is simply no open source GAN model that can compete with the open source diffusion models we have today, and the fact that these models can be distilled down to 1/2/4 steps makes GANs less attractive.
I'm sorry, then I don't know what you're saying. Because what I read was 2 claims. 1) GAN quality is less than diffusion. 2) GANs can't do T2I. I think I adequately showed that both these assumptions were wrong. I'm not sure what else your comment meant as that contains all of its words...
> there is simply no open source GAN model that can compete with the open source diffusion models we have today
Again, I provided citations. Do you want the github links? Here's GigaGAN's: https://github.com/mingukkang/GigaGAN (checkpoints under the evaluation folder).
What do you mean by compete? Because the metric scores are quite comparable. I think that's a reasonable interpretation of the word "competitive" but I'm a generative researcher so we might be using the terms to mean different things. (I even like diffusion more fwiw, but I'm particularly more interested in tractable density models)
Do you mean "the open community has rallied around Stable Diffusion and sunk in more time to tuning this model and producing textual inversions and LoRAs which far out surpass that of any other model, even including other diffusion models and is a phenomena nearly exclusive to Stable Diffusion and has been mostly accomplished in the last year"? Because if so, yeah, I agree.
But that's not a really good argument for saying Diffusion is better than GANs. It's a completely different argument. That's just to say __Stable Diffusion__ (not the class of diffusion models, or even more specifically latent diffusion models (the two are different)) has better community support. That's a reasonable argument, but a different one. Because it really just shows what you can do to __any__ model. These techniques and efforts are not architecture dependent nor are they even mode dependent. We see similar community effort around OpenLLaMa and GPT but less so around Bard/Gemini or Claude. So what? That's not really relevant to the conversation nor to the specific context that we're discussing which is very likely not going to include many of those TIs or LoRAs (or if it would you'd probably be training a custom one, so the point is again moot since LoRAs are not unique to diffusion models). I'm really happy a lot of people have entered the community and are effectively doing research, but it'd be quite naive to say that such a thing isn't possible around other architectures. Evangelism is quite useful, but not if it turns into religious beliefs. Hell, we can argue that Linux is better than Windows because Linux has a bigger hacker community but I'm not sure that's a great or even meaningful argument because it's void of context in what better actually means. The better OS is clearly situationally dependent and makes a lot of the OS holy wars silly.
> and the fact that these models can be distilled
Literally any model can be distilled. I'm not understanding your argument. Are you just arguing that such efforts have already taken place? Sure, I'll agree to that. But it's worth noting that the distilled diffusion models are still quite large and much larger than some of the aforementioned works.
> 1/2/4 steps makes GANs less attractive.
I even addressed the one step process. Which, single step diffusion generation represents a significant decrease in quality, so this seems to run counter to your prior argument. This is why I'm a bit confused. If we're going with a low step diffusion model then the case for the GAN becomes clearer because even your non-distilled GAN's inference is still higher and it's quality is definitely superior.
So really, I am confused. I'm not sure what you're arguing.
We can get into the weeds and discuss diversity, recall, memorization, density estimation, and all manner of things but these are quite open questions and frankly understudied. Plus we'll have to be extra nuanced because the metrics are proxies and incorporate different biases that different architectures are going to suffer from, making it difficult to compare in a more fair sense. But that's okay.
Let's also be clear: GAN != StyleGAN{1,2,2-ada,3,XL,T,*} and diffusion != StableDiffusion. There are other GANs and other diffusions and even other image synthesis models, the vast majority of them being open source.
Again, I'll assert, there are no universal models that are best for all situations. There are only models that are best at specific situations. You can translate this to "AGI doesn't exist but we have narrow AI" if that is clearer. But my claim is a bit broader still because if that AGI took a warehouse to run it still wouldn't be contextually appropriate.
Just FYI, it will less obvious that you’re entrenched and “dug in” to the field of GAN research if you’re less defensive when people say fairly reasonable things about different ideas.
Problem is none of what you believe about me is true. By main focus area is explicit density estimation, GANs are implicit. I like math...
But you can check my comment history to see that I really hate hype. I said there are better models for different tasks. Here's a breakdown:
- Autoregressive: Best for time sequential data
- Normalizing Flows: Best for density estimation and statistics
- Diffusion: Best at general image synthesis, editing, and diversity.
- GANs: Best when working on edge devices or throughput is critical
- VAEs: Best for situations between diffusion and flows, where a implicit PCA is desired
You can tell why I discussed GANs because it hit the specific use case. Remember my first comment is literally saying the motivation is trading image quality and diversity for generation time. It's because again, the claim is that there isn't one size fits all models. Such a notion is silly. You don't use a diffusion model to do real time image upscaling (e.g. for video) nor do you want it for RTX supersampling, but you do want diffusion for general image synthesis tasks in such areas like photoshop. Inpainting, outpainting, T2I, unconditional, I2I, and all that you want diffusion because it is much better for those tasks.
Just... diffusion requires big models and they're slow and the raspberry pi is small. So you make sacrifices.
Sorry, but at the link you provided there is no model weights, only inferred images offered for examination, so at this point it is at vaporware stage compared to stable diffusion tools.
Sorry, you're right. That is a really weird thing to do. Just provide 6GB of images...? They even don't have an issues page. Looks to be a common thing by that author. You're right, very suspicious.
Lafite has their checkpoints at least. Results aren't great, but it is small and fast.
They have a colab but it's broken (lol). Fix by removing the torch versions, and add gdown. Replace the wget line with "!gdown https://drive.google.com/u/0/uc?id=17ER7Yl02Y6yCPbyWxK_tGrJ8..." (checkpoint from from their github). Then everything will work fine. It took some time to get some decent outputs (but then again so did my first time with any diffusion model. This is definitely lower quality though). At least the authors look engaged in the github issues and do show how to get better results. (Always be suspicious of the images shown in papers... this one is certainly no exception)
I mean it was pretty simple, the quality is not really close, also StyleGAN2 is not conditioned on text (because you talked about this model in first comment). In the future there could be one that is competitive but not today.
GigaGAN the best GAN model by far is not open source and it's not competitive yet because even the images cherry picked for the paper and project paper do not look that coherent, the FID is relatively low because the inception v3 model used to calculate the FID doesn't care that much about global coherency and more like texturing but if the FID was calculated using DINO v2 (like some recent paper do) instead of inception v3 it will really show the gap between GAN models and diffusion models today.
Look, every single message I've said that you're trading quality for speed. That does in fact mean GANs are worse. I'm not sure why you think I have said anything short of that, I've explicitly agreed with you that diffusion produces higher quality images, and I'm not interested in repeating myself any further.
We're talking about a fucking raspberry pi, it is definitely reasonable to want to __trade quality for speed__ when you're working on a tiny computer.
I'm glad you're aware that FID has limits. For some reason this is uncommon. But there are more limits than the backbone classifier. Yes, DINOv2 helps (so does CLIP-FID and clean FID (which is backbone independent. I feel I need to explain because we're having difficulties communicating and going to their code and seeing they use inception isn't going to mean anything to be because that's not what that work is about)), but fundamentally the difference between two normalization layers of a classifier is not actually measurement of image quality. It correlates, but these also are fundamentally about the distributional nature of the outputs. There actually is no method that does particularly well but we're over here doing our best. Just got to be aware of the limits of your metrics because (as is the theme of our entire conversation), context matters. Here's a paper you may find interesting https://arxiv.org/abs/2306.04675
Running SD v1.5 using one single step is faster than sampling with GigaGAN and you can achieve better coherency in my opinion: https://arxiv.org/abs/2311.09257
Then it does make more sense to run this on a raspberry pi.
Yeah I'll agree to that. But neither UFOGen nor GigaGAN have released models (I've admitted to being wrong about GigaGAN's checkpoints. In fairness, who the fuck releases 6GBs of generated images? And removes the issues tab from GitHub? Can we agree that's sketchy as fuck?[0]).
LAFITE is the only of the 3 with released checkpoints but it's definitely not to the same quality (Nvidia backed down to place it in products and imo a bad move). But it is 75M params compared to the 1B of UFOGen and GGAN. I was able to get a bit better images by some prompt engineering but yeah, classic paper painting a much better picture of their model than it actually is (pun intended). But then again, I got shitty images the first time I used SD so YMMV. There's probably better works out there but honestly my focus is elsewhere and no one's got time to keep on top of and try everything coming out.
I'm literally just saying that you can trade quality for inference. Do you disagree?
Because I'm not sure why you think I think GANs are better for quality, as I've said the opposite many times. Why are you hyper-focused on quality and thinking I've said GANs do better. This is why we're talking past one another, because you're attributing assumptions to me that I ('m doing my fucking best to clarify that I) don't have. If you think I think GANs produce higher quality images, I assure you that this is from your imagination as I've never stated such and it is not an opinion that I have. Sorry, I said I wouldn't say this again, so last time for real.
[0] There's too much sketchy shit going on in ML research right now and honestly, that's why I've been more passionate about trying to get people to think harder and about context. Again, this is about context. (I also really hate this railroading as it stifles community innovation and sweeps important problems under the rug by saying to just rely on large companies for checkpoints. The community doing so much around Stable Diffusion is awesome, but I want to see that around lots of works because there's a lot that can be accelerated by even a hundredth of this community effort)
Nice to see people finding way to get the square peg though the round hole.
Something that I wondered when the Raspberry Pi 5 came out is the weirdness that might be possible now that they have their own chip doing IO cleverness.
On the PI 5, the two MIPI interfaces ave been enhanced to do either output or input, It made me wonder if the ports are now generalized enough that you could daisy chain a string of PI 5's connecting MIPI to MIPI. Then you could run inference layers on individual PIs and pass the activations down the MIPI. 10x 8Gig Pi 5's might not be the speediest way to get an 80gig setup, but it would certainly be the cheapest (for now)
Submitted title was "Stable Diffusion Turbo on a Raspberry Pi Zero 2 generates an image in 29 minutes", which is good to know in order to understand some of the comments posted before I changed the title.
Submitters: if you want to say what you think is important about an article, that's fine, but do it by adding a comment to the thread. Then your view will be on a level playing field with everyone else's: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so...
It looks like the reason it was submitted today is a new feature added "Added support for Stable Diffusion XL Turbo 1.0! (thanks to @AeroX2)", from the news section of the README.
Oh well, I had the 16 kB RAM expansion card. At least with only 64 by 48 pixels to fill we were gaining some of the times we were losing because of the 1 MHz clock.
I would've loved if this were more portable. It requires XNNPACK, which has no generic c implementation. I'd've loved to see Stable Diffusion running on an Alpha, SPARC, or m68k.