Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Plain old gzip+kNN outperforms BERT and other DNNs
52 points by albert_e on July 13, 2023 | hide | past | favorite | 7 comments
As per this paper:

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

Zhiying Jiang, Matthew Yang, Mikhail Tsirlin, Raphael Tang, Yiqin Dai, Jimmy Lin

https://aclanthology.org/2023.findings-acl.426/

via: twitter.com/goodside/status/1679358632431853568



Very interesting but, I just tested this out and I had better performance with a process to find text similarity that uses the Word2Vec model to represent text documents as vectors and then computes the cosine similarity between these vectors. Here is that code. https://github.com/jimmc414/document_intelligence/blob/main/... It does require a 3GB download of pretrained word2vec embedding model. An explanation is provided in https://github.com/jimmc414/document_intelligence/blob/main/...

Here is the gzip knn implementation I tested https://github.com/jimmc414/document_intelligence/blob/main/...

I will note that I am comparing entire text files in these implementations not sentences.


This seems to be about easier classification tasks with not too many samples, for which TF-IDF also works well (Table 3). But more generally gzip for text modeling might make sense. Quoting http://bactra.org/notebooks/nn-attention-and-transformers.ht... :

> Once we have a source-coding scheme, we can "invert" it to get conditional probabilities; we could even sample from it to get a generator. (We'd need a little footwork to deal with some technicalities, but not a heck of a lot.) So something I'd really love to see done, by someone with the resources, is the following experiment:

> - Code up an implementation of Lempel-Ziv without the limitations built in to (e.g.) gzip; give it as much internal memory to build its dictionary as a large language model gets to store its parameter matrix. Call this "LLZ", for "large Lempel-Ziv".

> - Feed LLZ the same corpus of texts used to fit your favorite large language model. Let it build its dictionary from that. (This needs one pass through the corpus...)

> - Build the generator from the trained LLZ.

> - Swap in this generator for the neural network in a chatbot or similar. Call this horrible thing GLLZ.

> In terms of perplexity, GLLZ will be comparable to the neural network, because Lempel-Ziv does, in fact, do universal source coding.

Maybe someone on HN will have resources for such an experiment?


This is really interesting. How would this compare in terms of performance/resources necessary for training/inference w.r.t. neural networks?

Would this be leaner and run on less or would it reach the same complexity eventually?


Jimmy Lin and his lab are pretty amazing. I highly recommend a lot of their other writings. They've been simultaneously skeptical and innovative, cutting through a lot of the problematic, hard to recreate BS of academic Machine Learning / Search, and contributing a lot to open source. Not something you see from most academic labs these days.




This is completely unbelievable, I love it! The paper is also written in a very accessible way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: