Hacker Newsnew | past | comments | ask | show | jobs | submit | nemjack's commentslogin

I don't think you're quite right. The author is arguing that images and text should not be processed differently at any point. Current early fusion approaches are close, but they still treat modalities different at the level of tokenization.

If I understand correctly he would advocate for something like rendering text and processing it as if it were an image, along with other natural images.

Also, I would counter and say that there is some actionable information, but its pretty abstract. In terms of uniting modalities he is bullish on tapping human intuition and structuralism, which should give people pointers to actual books for inspiration. In terms of modifying the learning regime, he's suggesting something like an agent-environment RL loop, not a generative model, as a blueprint.

There's definitely stuff to work with here. It's not totally mature, but not at all directionless.


Saying we should tokenize different modalities the same would be analogous to saying that in order to be really smart, a human has to listen with its eyes. At some point there has to be SOME modality specific preprocessing. The thing is in all current sota arch.’s this modality specific preprocessing is very very shallow, almost trivially shallow. I feel this is the peice of information that may be missing for people with this view. In the multimodal models everything is moving to a shared representation very rapidly - that’s clearly already happening.

On the ‘we need to do rl loop rather than a generative model’ point - I’d say this is the consensus position today!


For sure, we can't process images the same way that we process sound, but the author argues for processing images and text the same, and text is fundamentally a visual medium of communication. The author makes a good point about how VLMs can still struggle to determine the length of a word, or generate words that start and end with specific letters, etc. which is an indicator that an essential aspect of a modality (its visual aspect) is missing from how it is processed. Surely a unified visual process for text and image would not have such failure points.

I agree that modality specific processing is very shallow at this point, but it still seems not to respect the physicality of the data. Today's modalities are not actually akin to human senses because they should be processed by a different assortment of "sense" organs, e.g. one for things visual, one for things audible, etc.


I don't think you can classify reading as a purely visual modality, despite being a visual medium. People with dislexia may see perfectly fine, but only the translation layer processing the text gets jumbled. Granted, we are not born with the ability to read, so that translation layer is learned. On the other hand, we don't perceive everything in our visual field either, magicians and youtube videos use this limitation to trick and entertain us, and these we are presumably born with, given that its a shared human trait. Evidently, some of the translation layers involved with processing our vision were seemingly evolved naturally and are part of our brains, so why would we not allow artificial intelligence similar advance starting points for processing data?


Concepts within modalities are potentially consistent, but the point the author is making is that the same "concept" vector may lead to inconsistent percepts across modalities (e.g. a conflicting image and caption).


This is a great analogy, I totally agree!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: