Yes, unfortunately these models take a lot of VRAM. It may be possible to do an 8GB version but it will have to compromise on quality of voice recognition and the language model so it might not be a good experience.
Yes, it absolutely could. You're right that this configuration is rare. Although people have been putting together machines with multiple 24GB cards in order to split and run larger models like llama2-70B.