Maybe the abstract of the paper is a better introduction to what this is:
> We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.
This is going to be great for accessibility! Imagine being blind and loading up a video game and using this to figure out what's around, having everything described locally. I mean, um, well that's what I'd use it for anyway. But knowing Apple, we won't be able to prompt the LLM directly so that probably won't happen until 5 years from now.
Yes, I wondered whether "referring" had some special meaning, since the way they seem to use it suggests the word reference would normally be more appropriate there (unless it's a special meaning that warrants the different word).
I'm just inferring myself, but I believe it's referring to discussing things in the foreground / background or in a specific location in the provided image (such as top right, behind the tree, etc) in user queries.
It sounds like the "region inputs" are raster or vector inputs. So I'm imagining highlighting a region of the photo with my finger and having it tell me "that's the Duomo in Florence."
This will make Drone-based AI image context for behavior extremely powerful - especially when aspects of that MLLM handling for spatial-sitrep extremely precise for autonomous movement, then ultimately for decision making WRT interacting with humans (positive interactions and negative interactions).
Is it just me, or doesnt this MLLM seem particularly useful for flying objects with vision?
> We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.
https://arxiv.org/abs/2310.07704