Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It mostly depends on "how" the models work. Multi-modal unified text/image sequence to sequence models can do this pretty well, diffusion doesn't.




Multimodal certainly helps but "pretty well" is a stretch. I'd be curious to know what multimodal model in particular you've tried that could consistently handle generative prompts of the above nature (without human-in-the-loop corrections).

For example, to my knowledge ChatGPT is unified and I can guarantee it can't handle something like a 7-legged spider.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: