Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Ferret: A Multimodal Large Language Model

What I thought when reading the title: A new base model trained from the ground up on multimodal input, on hundreds to thousands of GPUS

The reality: A finetune of Vicuna, trained on 8xA100, which already is a finetune of Llama 13b. Then it further goes on to re-use some parts of LLava, which is an existing multimodal project already built upon Vicuna. It's not really as exciting as one might think from the title, in my opinion.



this seems like a good but small research project by a research team in Apple. far away from what product teams are working on for next generation of apple products.


The innovation is the modification of the neural network architecture to incorporate the spatial-aware visual sampler. The data and existing models are not the interesting part.


Thanks for the summary.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: