I’ve also used similar approach to build Q&A system for PDF files (my main use case - board game manuals). OpenAIs embeddings are nice to play with. Also there is an easy technique for getting better results with dense search - Hypothetical Document Embeddings (https://arxiv.org/abs/2212.10496).
So instead of finding docs that are semantically similar to the question, you find docs that are semantically similar to a fake answer to the question.
Is the intuition that an answer (even if wrong) will be closer to the targets, than will a question?
Exactly. The fake (hypothetical) answer is usually longer than the question and often contains words that match the real answer even when the domain of question and answer is different ie “how to get out of jail?” asked in the context of monopoly game and the hypothetical answer is in the context of a real jail. It sounds stupid but it works and is super easy to implement.
That paper is interesting but it doesn't necessarily work better. I have OpenAI vectors for about 250k podcast episode descriptions and just searching "pyramids" works about the same as asking GPT to write 500 words about pyramids and then doing vector search against that essay. So worth testing out, but not guaranteed better
Oh, yes - in my tests it worked better most of the time but there were some cases where the results were worse. Regarding the “pyramids” - I think it might work better with actual questions.