I think it’s just that affine transforms in high dimensions are surprisingly exp...

cs702 · on April 17, 2023

Good point. Didn't think of that. It's a plausible explanation here, because the dimensionality of the spaces is so different, 5120 vs 768. Not surprisingly, the trained weight matrix has rank 768: it's using every feature in the lower-dimensional space.

Still, it's kind of shocking that it works so well!

I'd be curious to see if the learned weight matrix ends up being full-rank (or close to full-rank) if both spaces have the same dimensionality.

visarga · on April 17, 2023

They would have full-rank because all the embedding space is used. There are no unused large pockets.

cs702 · on April 17, 2023

The weight matrix's rank would decrease for each feature in the target space that cannot be expressed as as a linear combination of features in the input space (plus a bias). For example, if the target space has a feature representing a non-visual quality like "smelliness," it would not be expressible as a linear combination of features representing visual attributes like "redness," "blueness," and "greenness," etc. in the input space.

If both spaces have the same dimensionality, the learned weight matrix would be full-rank only if every feature in the target space is expressible as a linear combination of features in the input space (plus a bias). Which brings me back to my original question: WHY would that be the case when the two models are trained independently on data that is so different?

sdenton4 · on April 17, 2023

A random nxn matrix is full rank... So it's kinda the default: any amount of noise in the embedding is going to result in full-rank transformations.

So it's really less-than-full rank which would require an explanation - ie, why does this image representation project into this perfectly isolated subspace of the language representation (or vice versa)?

If that happened I would start looking for things like a vocabulary of smell which is completely distinct and non-overlapping with any visual context. But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities... Maybe there's some branch of analytic philosophy which has managed to completely divorce itself from the physical world...

cs702 · on April 17, 2023

> But we use cross-modal analogies in language /constantly/ (many smells are associated with things we can see - 'smells like a rose') so you wouldn't expect any clean separations for different modalities...

That's a really good point. Thank you!