Saying we should tokenize different modalities the same would be analogous to saying that in order to be really smart, a human has to listen with its eyes. At some point there has to be SOME modality specific preprocessing. The thing is in all current sota arch.’s this modality specific preprocessing is very very shallow, almost trivially shallow. I feel this is the peice of information that may be missing for people with this view. In the multimodal models everything is moving to a shared representation very rapidly - that’s clearly already happening.
On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!
For sure, we can't process images the same way that we process sound, but the author argues for processing images and text the same, and text is fundamentally a visual medium of communication. The author makes a good point about how VLMs can still struggle to determine the length of a word, or generate words that start and end with specific letters, etc. which is an indicator that an essential aspect of a modality (its visual aspect) is missing from how it is processed. Surely a unified visual process for text and image would not have such failure points.
I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.
I don't think you can classify reading as a purely visual modality, despite being a visual medium. People with dislexia may see perfectly fine, but only the translation layer processing the text gets jumbled. Granted, we are not born with the ability to read, so that translation layer is learned. On the other hand, we don't perceive everything in our visual field either, magicians and youtube videos use this limitation to trick and entertain us, and these we are presumably born with, given that its a shared human trait. Evidently, some of the translation layers involved with processing our vision were seemingly evolved naturally and are part of our brains, so why would we not allow artificial intelligence similar advance starting points for processing data?
On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!