A modified BLIP2 that works with diffusion models rather than vision transformers would be pretty cool. Using Vicuna-13B or another large language model as the language model of Stable Diffusion, instead of the text encoder of CLIP, would be a game changer, it would completely change how you prompt Stable Diffusion from something kinda like a tag list to it being able to follow instructions in English instead.