Plain old gzip+kNN outperforms BERT and other DNNs

Jimmc414 · on July 13, 2023

Very interesting but, I just tested this out and I had better performance with a process to find text similarity that uses the Word2Vec model to represent text documents as vectors and then computes the cosine similarity between these vectors. Here is that code. https://github.com/jimmc414/document_intelligence/blob/main/... It does require a 3GB download of pretrained word2vec embedding model. An explanation is provided in https://github.com/jimmc414/document_intelligence/blob/main/...

Here is the gzip knn implementation I tested https://github.com/jimmc414/document_intelligence/blob/main/...

I will note that I am comparing entire text files in these implementations not sentences.

iflp · on July 13, 2023

This seems to be about easier classification tasks with not too many samples, for which TF-IDF also works well (Table 3). But more generally gzip for text modeling might make sense. Quoting http://bactra.org/notebooks/nn-attention-and-transformers.ht... :

> Once we have a source-coding scheme, we can "invert" it to get conditional probabilities; we could even sample from it to get a generator. (We'd need a little footwork to deal with some technicalities, but not a heck of a lot.) So something I'd really love to see done, by someone with the resources, is the following experiment:

> - Code up an implementation of Lempel-Ziv without the limitations built in to (e.g.) gzip; give it as much internal memory to build its dictionary as a large language model gets to store its parameter matrix. Call this "LLZ", for "large Lempel-Ziv".

> - Feed LLZ the same corpus of texts used to fit your favorite large language model. Let it build its dictionary from that. (This needs one pass through the corpus...)

> - Build the generator from the trained LLZ.

> - Swap in this generator for the neural network in a chatbot or similar. Call this horrible thing GLLZ.

> In terms of perplexity, GLLZ will be comparable to the neural network, because Lempel-Ziv does, in fact, do universal source coding.

Maybe someone on HN will have resources for such an experiment?

ailef · on July 13, 2023

This is really interesting. How would this compare in terms of performance/resources necessary for training/inference w.r.t. neural networks?

Would this be leaner and run on less or would it reach the same complexity eventually?

softwaredoug · on July 13, 2023

Jimmy Lin and his lab are pretty amazing. I highly recommend a lot of their other writings. They've been simultaneously skeptical and innovative, cutting through a lot of the problematic, hard to recreate BS of academic Machine Learning / Search, and contributing a lot to open source. Not something you see from most academic labs these days.

homarp · on July 13, 2023

also at https://news.ycombinator.com/item?id=36707509

homarp · on July 13, 2023

and https://news.ycombinator.com/item?id=36707193

ilovefood · on July 13, 2023

This is completely unbelievable, I love it! The paper is also written in a very accessible way.