Saying we should tokenize different modalities the same would be analogous to sa...

nemjack · 2025-06-05T15:29:34 1749137374

For sure, we can't process images the same way that we process sound, but the author argues for processing images and text the same, and text is fundamentally a visual medium of communication. The author makes a good point about how VLMs can still struggle to determine the length of a word, or generate words that start and end with specific letters, etc. which is an indicator that an essential aspect of a modality (its visual aspect) is missing from how it is processed. Surely a unified visual process for text and image would not have such failure points.

I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.

hallh · 2025-06-05T20:07:12 1749154032

I don't think you can classify reading as a purely visual modality, despite being a visual medium. People with dislexia may see perfectly fine, but only the translation layer processing the text gets jumbled. Granted, we are not born with the ability to read, so that translation layer is learned. On the other hand, we don't perceive everything in our visual field either, magicians and youtube videos use this limitation to trick and entertain us, and these we are presumably born with, given that its a shared human trait. Evidently, some of the translation layers involved with processing our vision were seemingly evolved naturally and are part of our brains, so why would we not allow artificial intelligence similar advance starting points for processing data?