Generative models yes, since there are terabytes of audio available. High qualit...

Generative models yes, since there are terabytes of audio available. High quality contextual info is much harder to obtain. It’s like saying that we could easily build a model for X if we had training data available.

With LLMs we can leverage human insight to e.g. caption or describe images (which was what made CLIP and successors possible). With animals we often have no idea beyond a location. There is work to include kinematic data with audio to try and associate movement with vocalisation but it’s early days.

https://cloud.google.com/blog/transform/can-generative-ai-he...