Query2vec: Search query expansion with query embeddings

vector_spaces · on Nov 5, 2019

As a frequent user in Albuquerque, my personal experience with GrubHub search is that it returns very strange results, and categorizations are unreliable. For example a restaurant whose name contains my query (for a query like "Chinese" or "Thai") may not even appear in the results for that query, and relevant results will be buried beneath completely irrelevant results (like a taco place and an ice cream joint might appear above a Thai restaurant in the above example).

Further restaurants will be miscategorized or eg have the word "sushi" in the name, and yet not appear in the sushi category (not sure if there's actually a sushi category in the app, just using that as an example).

eggie5 · on Nov 5, 2019

Author here: We hope to fix this w/ the techniques described in the post! The search engine will first collect a high-recall set of candidates which is then passed to a high-precision ranker. This should help you w/ your cuisine and dish queries misclassifications. It will also introduce semantic understanding into the query engine meaning if you type "French" it will not give you "French fries" but French cuisine.

throwawayhhakdl · on Nov 5, 2019

This seems like a pretty good write up, but I do wonder if it’s necessary? I use grub hub and Uber eats fairly often. If I want to search for a restaurant I would hope it’s just doing a match on exactly what I search with minor spelling checks.

If I’m looking for a dish or cuisine, why not display it in a visual fashion like Uber eats? There’s usually not that many options to actually sift through.

For example the first graphic of the embedding space has a very clear mixture of restaurant names and food items. Feels a bit messy.

dnc · on Nov 5, 2019

"The difference between this and word2vec is that the primitive atomic unit we are embedding isn’t a word, but a search query. Therefore this isn’t a language model, but a query model."

If a primitive embedding unit is a search query instead of a word, then I assume that a query vector model is trained on a limited dictionary of queries. I wonder if that implies that the trained vector model can encode only search queries that are already present in its dictionary? If not, I think it would be interesting to know more about how the closed dictionary problem was solved.

eggie5 · on Nov 5, 2019

Correct, the model has a fixed vocabulary set at train-time. However, we can benefit from the fact that the distribution of queries in most search systems follows the Zipfian distribution. That means we can capture the vast majority of queries that are ever issued in a fixed vocabulary. Daily retraining helps us pick up new queries outside of the vocabulary.

mlthoughts2018 · on Nov 6, 2019

How is your model fundamentally different from doc2vec though? The idea of swapping a vocab of words for a vocab of tokens, sentences, paragraphs, queries, articles, pages, etc., has been around and used in research and production virtually since the same time as word2vec itself.

One of the most common “off the shelf” search solutions is to train an embedding for X (where X is whatever you want to search) using some implicit similarity or triplet loss for examples that have natural positive labels (like due to proximity in the word2vec case) plus random negative sampling, and use the embedding in an exact or approximate nearest neighbor index.

In fact, it’s even very “off the shelf” to use Siamese networks for multi-modal embeddings, for example simultaneously learning embeddings that put semantically similar queries and food images into the same ANN vector space.

I think the blog post is very cool don’t get me wrong, but no part of this is novel (if that’s what you were going for, I can’t tell). This exact end to end pipeline has been used for image search, collaborative filtering (customer & product embeddings), various recommender systems where the unit of embedding is something like “pages” or “products” or “blog posts” or other product primitives for whatever type of business, in production across several different companies I’ve worked for.

eggie5 · on Nov 6, 2019

I agree:

In the IR community you will see image2image search implemented by passing an image through a headless CNN and then do ANN search on the embedding space. Then that can lead to discussion about cross-model hashing which you suggested. I think also you touched on LTR techniques w/ siamese networks. For example we can use a siamese network to learn the diff between two embeddings.

In the Recys community for collaborative filtering you can take interaction data between users and items and decompose that matrix into two parts using SVD which is called the user and item embeddings. Also in the recsys community you see people doing item2vec (what I did) for the item embeddings and then aggregating those for the user representation (embeddings).

Maybe you will grant me the novelty of this method in making the connection between queries and converted restaurants as a similarity metric and then applying it to the IR use case of search query expansion? Should the post be more clear that I am not claiming to have invented item2vec or ANN (approximate nearest neighboors)?

mlthoughts2018 · on Nov 6, 2019

Your final paragraph is reasonable, and I do think the post would benefit from more clearly talking about this.

The “trick” is to find some naturally occurring property that indicates a pair of examples has the positive label, such as two users selecting a given product, two text queries leading to clicking on the same image, two different blog posts with the same keyword, etc. This combined with an efficient way to sample acceptable negative pairs that do not express the trait that indicates positive label.

That + deciding how to approach the loss function (e.g. cosine similarity, euclidean distance, distance in a hash space, triplet loss, should it have a margin) is the whole trick of the problem.

The black box that produces embeddings (DNN, doc2vec, sparse vector from tfidf-like features, whatever) and the nearest neighbor piece are standard plug and play.

eggie5 · on Nov 6, 2019

thanks, I think we are in agreeance.

gourabmi · on Nov 5, 2019

Excellent post. Is this a candidate for https://sigir.org/sigir2020/ industry track ?

eggie5 · on Nov 5, 2019

Author here. Thanks, for the recommendation. I actually got inspiration for this idea from chatting w/ industry colleagues at SIGIR. Last SigIR in Paris, I spoke to some people from Apple, where they are using a technique like this for the app store search.

bayonetz · on Nov 6, 2019

Neat! eggie5 and friends, I did something related for Caviar that you might be interested in: https://developer.squareup.com/blog/caviars-word2vec-tagging...

eggie5 · on Nov 6, 2019

Although your post is orthogonal to what we present in this work, I think it still merits discussion.

The high-level concept of a product is important for any commerce company w/ an unbounded and unstructured product catalog. In this case, the concept of a Dish is a very valuable primitive for Caviar and GH to understand.

You propose an interesting technique for dish tagging:

1. train word2vec on menu item text 2. Generate menu item embeddings by aggregating respective word vectors 3. Curate dish (label) set 4. Generate embeddings for dish labels by aggregating word vectors 5. Tag menu items w/ their dish label by NN search.

Then once you have this high-level concept of dish you can drive all kinds of interesting product innovations:

* trending asian noodle dishes * tacos in san Diego * It also helps w/ the item sparsity problem in matrix factorization

Do I understand?

I'm going to try this on our next learning Friday.

We have taken a different route for dish tagging. First we have a graphical model tag some items w/ low coverage but high precision and then we feed that into a classifier to generalize and provider full coverage. It's like a generative model fed into a discriminative model.

misterman0 · on Nov 6, 2019

>> items w/ low coverage but high precision

Newbie NLP dev here. What's this concept of coverage vs precision that you speak of? It sounds intriguing.

sdnlafkjh34rw · on Nov 6, 2019

It's the ratio of false positives and false negatives to their true equivalents.

lootsauce · on Nov 5, 2019

This is a fantastic post. Would love to see an even more detailed walk through some code examples, and discussion of development to production of these models. Do you have any other resources you would suggest?

lootsauce · on Nov 5, 2019

For example, what ANN do you use in production, what are the reasons for using it. Another question is how much data does it take to make these models work, can they be primed effectively with little data if doing transfer learning from a more general model?

eggie5 · on Nov 5, 2019

Hi, I'm the author of the post. We actually just presented this work and more at PyData NYC. We share more implementation details. Here are the slides: https://www.slideshare.net/AlexEgg1/discover-yourlatentfoodg...

But to answer you question here: For production ANN we have Annoy integrated into our backend as a service. Annoy was an easy choice bc it checked out box for JVM support.

For training the models we have endless amounts of behavioral data, so we didn't even need to look at transfer learning. For this query2vec example, it was trained on 1 year of queries which takes 15min/epoch on an AWS p2 GPU. We do all our preprocessing (heavy normalization) in pyspark.

manigandham · on Nov 5, 2019

Neat tech write up but ironically Grubhub has just lost 50% of its market cap due to poor earnings where the company has basically admitted its business model is indefensible and borderline unsustainable.

eggie5 · on Nov 5, 2019

Author here. Thanks for the comment. This is a tough business to be in, but very exciting a the same time.

I'm trying to effect change by making meaningful contributions to our search pipeline by applying novel techniques from information retrieval literature.

For example, with this query expansion technique, we can service search queries in small markets (not a lot of restaurants) more completely by showing something instead of nothing. Or in big markets, we can service obscure, tail queries by fining something similar. This helps increase the overall recall of the search system, making it better!

Don't loose faith in GH yet!

manigandham · on Nov 6, 2019

I'm happy to read about the tech. Not so much about the business but as long as you're working on what you like then who am I to say anything. Good luck!

nautilus12 · on Nov 6, 2019

Curious, what's the underlying persistence. Are you actually using a graph database?

eggie5 · on Nov 6, 2019

The latent graph structure is encoded into the embeddings. Each node in the graph is a vector and you traverse the graph by preforming NN search in the embedding space.