Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Search by image is conceptually easier because you don't have to map between text and images, but it's a very different product. It is something we've considered.

Encoding words and images into the same space and doing ANN is kind of what the current system is, if you look at it right. The ANN is framed in terms of similarity rather than distance -- and is approximate because of the sparseness approximation. But the big difference from the papers you linked is what we use as the encodings: not the traditional penultimate layer of a network, but classifier scores for images and projected word vectors for text. This gives us a space with semantically meaningful dimensions, which lets us build the system without a large multimodal training set; our text and image models are independently trained on different datasets.



Interesting. Thanks for responding.

1. Did you look at CLIP? it provides a common (to images & text) embedding.

2. Do your models need specialized training (vs. open models)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: