Don't use all-MiniLM-L6-v2 for new vector embeddings datasets. Yes, it's the ope...

xfalcox · 2025-11-28T19:28:14 1764358094

I am partial to https://huggingface.co/Qwen/Qwen3-Embedding-0.6B nowadays.

Open weights, multilingual, 32k context.

SteveJS · 2025-11-28T20:26:07 1764361567

Also matryoshka and the ability to guide matches by using prefix instructions on the query.

I have ~50 million sentences from english project gutenberg novels embedded with this.

dleeftink · 2025-11-28T20:53:54 1764363234

Why would you do that and I'd love to know more

SteveJS · 2025-11-29T06:09:14 1764396554

The larger project is to allow analyzing stories for developmental editing.

Back in June and August i wrote some llm assisted blog posts about a few of the experiments.

They are here: sjsteiner.substack.com

Tostino · 2025-11-28T20:53:56 1764363236

What are you using those embeddings for, If you don't mind me asking? I'd love to know more about the workflow and what the prefix instructions are like.

greenavocado · 2025-11-28T21:49:39 1764366579

It's junk compared to BGE M3 on my retrieval tasks

simonw · 2025-11-28T23:48:47 1764373727

It's a shame EmbeddingGemma is under the shonky Gemma license. I'll be honest: I don't remember what was shonky about it, but that in itself is a problem because now I have to care about, read and maybe even get legal advice before I build anything interesting on top of it!

(Just took a look and it has the problem that it forbids certain "restricted uses" that are listed in another document which it says it "is hereby incorporated by reference into this Agreement" - in other words Google could at any point in the future decide that the thing you are building is now a restricted use and ban you from continuing to use Gemma.)

minimaxir · 2025-11-29T02:21:54 1764382914

For the use cases of embeddings anyways, the issues with the Gemma license should be less significant.

wanderingmind · 2025-11-29T04:37:25 1764391045

Can someone explain what's technically better in the recent embedding models. Has there been a big change in their architecture or is it lighter on memory or can handle longer context because of improved training?

tifa2up · 2025-11-29T09:01:41 1764406901

https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models

melvinodsa · 2025-12-01T15:54:41 1764604481

I am trying sentence-transformers/multi-qa-MiniLM-L6-cos-v1 for deploying a light weight transformer on CPU machine -its output dimension is 384. I want to keep the dimension low as possible. nomic-embed-text offers lower dimensions upto 64. I will need to test my dataset. Will comeback with the results.

spacecadet · 2025-11-29T12:33:46 1764419626

Great comment. For what its worth, really think about your vectors before creating them! Any model can be a vector model, you just use the final hidden states... with that, think about your corpus and the model latent space and try to pair them appropriately. For instance, I vectorize and search network data using a model trained on coding, systems, data, etc.

kaycebasques · 2025-11-28T20:31:37 1764361897

One thing that's still compelling about all-Mini is that it's feasible to use it client-side. IIRC it's a 70MB download, versus 300MB for EmbeddingGemma (or perhaps it was 700MB?)

Are there any solid models that can be downloaded client-side in less than 100MB?

intalentive · 2025-11-28T21:57:00 1764367020

This is the smallest model in the top 100 of HF's MTEB Leaderboard: https://huggingface.co/Mihaiii/Ivysaur

Never used it, can't vouch for it. But it's under 100 MB. The model it's based on, gte-tiny, is only 46 MB.

nijaru · 2025-11-28T22:06:28 1764367588

For something under 100 MB, this is probably the strongest option right now.

https://huggingface.co/MongoDB/mdbr-leaf-ir

stingraycharles · 2025-11-29T01:17:00 1764379020

How do the commercial embedding models compare against each other? Eg Cohere vs OpenAI small vs OpenAI large etc?

I have troubles navigating this space as there’s so much choice, and I don’t know exactly how to “benchmark” an embedding model for my use cases.

dangoodmanUT · 2025-11-28T19:35:49 1764358549

yeah this, there's much better open weights models out there...

SamInTheShell · 2025-11-28T21:48:04 1764366484

I tried out EmbeddingGemma a few weeks back in AB testing against nomic-embed-text-v1. I got way better results out of the nomic model. Runs fine on CPU as well.