On the human ratings, three different 7B LLMs (Two different Openchat models and a Mistral fine tune) beat a version of GPT-3.5.
(The top 9 chatbots are GPT and Claude versions. Tenth place is a 70B model. While it's great that there's so much interest in 7B models, and it's incredible that people are pushing them so far, I selfishly wish more effort would go into 13B models... since those are the biggest that my macbook can run.)
I think the current approach — train 7b models and then do MoE on them — is the future. It’ll still be only runnable on high end customer devices. As for 13b + MoE, I don’t think any customer device could handle that in the next couple years.
My years-old M1 macbook with 16GB of ram runs them just fine. Several Geforce 40-series cards have at least 16GB of vram. Macbook pros go up to 128GB of ram and the mac studio goes up to 192GB. Running regular CPU inference on lots of system ram is cheap-ish and not intolerably slow.
These aren't totally common configurations, but they're not totally out of reach like buying an H100 for personal use.
1. I wouldn't consider Mac Studio ($7,000) a customer product.
2. Yes, and my MBP M1 Pro can run quantized 34b models. My point was that when you do MoE, memory requirements suddenly become too challenging. A 7b Q8 is roughly 7GB (7b parameters × 8 bits each). But 8x of that would be 56GB, and all of that must be in memory to run.
I have no formal credentials to say this, but intuitively I feel this is obviously wrong. You couldn’t have taken 50 rats brains and “mixed” them and expected the result to produce new science.
For some uninteresting regurgitation, sure. But size - width and depth - seems like an important piece for ability to extract deep understanding of the universe.
Also, MoE, as I understand it, will inherently not be able to glean insight into, nor reason about, and certainly not be able to come up with novel understanding, for cross-expert areas.
The MOE models are essentially trained as a single model. Its not 7 independent models, individually (AFAIK) they are all totally useless without each other.
Its just that each bit picks up different "parts" of the training more strongly, which can be selectively picked at runtime. This is actually kinda analogous to animals, which dont fire every single neuron so frequently like monolithic models do.
The tradeoff, at equivalent quality, is essentially increased VRAM usage for faster, more splittable inference and training, though the exact balance of this tradeoff is an excellent question.
But its not totally irrelevant. They are still a datapoint to consider with some performance correlation. YMMV, but these models actually seem to be quite good for the size in my initial testing.
More or less. The automated benchmarks themselves can be useful when you weed out the models which are overfitting to them.
Although, anyone claiming a 7b LLM is better than a well trained 70b LLM like Llama 2 70b chat for the general case, doesn't know what they are talking about.
In the future will it be possible? Absolutely, but today we have no architecture or training methodology which would allow it to be possible.
You can rank models yourself with a private automated benchmark which models don't have a chance to overfit to or with a good human evaluation study.
Edit: also, I guess OP is talking about Mistral finetunes (ones overfitting to the benchmarks) beating out 70b models on the leaderboard because Mistral 7b is lower than Llama 2 70b chat.
> today we have no architecture or training methodology which would allow it to be possible.
We clearly see that Mistral-7B is in some important, representative respects (eg coding) superior to Falcon-180B, and superior across the board to stuff like OPT-175B or Bloom-175B.
"Well trained" is relative. Models are, overwhelmingly, functions of their data, not just scale and architecture. Better data allows for yet-unknown performance jumps, and data curation techniques are a closely-guarded secret. I have no doubt that a 7B beating our best 60-70Bs is possible already, eg using something like Phi methods for data and more powerful architectures like some variation of universal transformer.
I mean, I 100% agree size is not everything. You can have a model which is massive but not trained well so it actually performs worse than a smaller, better/more efficiently trained model. That's why we use Llama 2 70b over Falcon-180b, OPT-175b, and Bloom-175b.
I don't know how Mistral performs on codegen specifically, but models which are finetuned for a specific use case can definitely punch above their weight class. As I stated, I'm just talking about the general case.
But so far we don't know of a 7b model (there could be a private one we don't know about) which is able to beat a modern 70b model such as Llama 2 70b. Could one have been created which is able to do that but we simply don't know about it? Yes. Could we apply Phi's technique to 7b models and be able to reach Llama 2 70b levels of performance? Maybe, but I'll believe it when we have a 7b model based on it and a human evaluation study to confirm. It's been months now since the Phi study came out and I haven't heard about any new 7b model being built on it. If it really was such a breakthrough to allow 10x parameter reduction and 100x dataset reduction, it would be dumb for these companies to not pursue it.
UNA: Uniform Neural Alignment.
Haven't u noticed yet? Each model that I uniform, behaves like a pre-trained.. and you likely can fine-tune it again without damaging it.
If you chatted with them, you know .. that strange sensation, you know what is it.. Intelligence.
Xaberius-34B is the highest performer of the board, and is NOT contaminated.
In addition to what was said, if its anything like DPO you don't need a lot of data, just a good set. For instance, DPO requires "good" and "bad" responses for each given prompt.
quick to assert authoritative opinion - yet the one word "better" belies the message ? Certainly there is are more dimensions worth including in the rating?
Certainly, there may be aspects of a particular 7b model which could beat another particular 70b model and greater detail into different pros and cons of different models are worth considering but people are trying to rank models and if we're ranking (calling one "better" than another), we might as well do it as accurately as we can since it can be so subjective.
I see too many misleading "NEW 7B MODEL BEATS GPT-4" posts. People test those models a couple of times, come back to the comments section, declare it true, and onlookers know no better than to believe it and in my opinion has led to many people claiming 7b models have gotten as good as Llama 2 70b or GPT-4 when it is not the case when you account for the overfit being exhibited by these models and actually put them to the test via human evaluation.