With M2, yes - I’ve used it in Claude Code (e.g. native tool calling), Roo/Cline (e.g. custom tool parsing), etc. It’s quite good and for some time the best model to self-host. At 4bit it can fit on 2x RTX 6000 Pro (e.g. ~200GB VRAM) with about 400k context at fp8 kv cache. It’s very fast due to low active params, stable at long context, quite capable in any agent harness (its training specialty). M2.1 should be a good bump beyond M2, which was undertrained relative to even much smaller models.
You might have 1A rights as an American but it seems to me the manner in which this person protested would be grounds for termination in many jurisdictions.
1A doesn't apply to private entities anyway. 1A protects against government prosecution for your speech, and the government may make no laws "abridging the freedom of speech."
But your employer? They can put whatever rules and restrictions they want on your speech, and with at-will employment, can fire you for any reason anyway, at anytime.
You can say whatever you want, but you aren't free from the consequences of that speech.
This comment sums up well how the spirit of the law is not being upheld, given that the biggest players in government, finance, and the corporate world are working together hand in glove.
>”Corporations cannot exist without government intervention”
>”Some privates companies and financiers are too big to fail/of strategic national importance”
>”1A does not apply to private entities (including the above)”
>”We have a free, competitive market”
I find it very difficult to resolve these seemingly contradictory statements.
This is the right take. You might be able to get decent (2-3x less than a GPU rig) token generation, which is adequate, but your prompt processing speeds are more like 50-100x slower. A hardware solution is needed to make long context actually usable on a Mac.
The problem with OpenAI models is the lack of a Max-like subscription for a good agentic harness. Maybe OpenAI or Microsoft could fix this.
I just went through the agony of provisioning my team with new Claude Code 5x subs 2 weeks ago after reviewing all of the options available at that time. Since then, the major changes include a Cerebras sub for Qwen3 Coder 480B, and now GPT-5. I’m still not sure I made the right choice, but hey, I’m not married to it either.
If you plan on using this much at all then the primary thing to avoid is API-based pay per use. It’s prohibitively costly to use regularly. And even for less important changes it never feels appropriate to use a lower quality model when the product counts.
Claude Code won primarily because of the sub and that they have a top tier agentic harness and models that know how to use it. Opus and Sonnet are fantastic agents and very good at our use case, and were our preferred API-based models anyways. We can use Claude Code basically all day with at least Sonnet after using our Opus limits up. Worth nothing that Cline built a Claude Code provider that the derivatives aped which is great but I’ve found Claude Code to be as good or better anyways. The CLI interface is actually a bonus for ease of sharing state via copy/paste.
I’ll probably change over to Gemini Code Assist next, as it’s half the price and more context length, but I’m waiting for a better Gemini 2.5 Pro and the gemini-cli/Code Assist extensions to have first party planning support, which you can get some form of third party through custom extensions with the cli, but as an agent harness they are incomplete without.
The Cerebras + Qwen3 Coder 480B with qwen3-cli is seriously tempting. Crazy generation speed. Theres some question about how long big the rate limit really is but it’s half the cost of Claude Code 5x. I haven’t checked but I know qwen3-cli, which was introduced along side the model, is a fork of gemini-cli with Qwen-focused updates; wonder if they landed a planning tool?
I don’t really consider Cursor, Windsurf, Cline, Roo, Kilo et al as they can’t provide a flat rate service with the kind of rate limits you can get with the aforementioned.
GitHub Copilot could be a great offering if they were willing to really compete with a good unlimited premium plan but so far their best offering has less premium requests than I make in a week, possibly even in a few days.
Would love to hear if I missed anything, or somehow missed some dynamic here worth considering. But as far as I can tell, given heavy use, you only have 3 options today: Claude Max, Gemini Code Assist, Cerebras Code.
> If you plan on using this much at all then the primary thing to avoid is API-based pay per use.
I find there's a niche where API pay-per-use is cost effective. It's for problems that require (i) small context and (ii) not much reasoning.
Coding problems with 100k-200k context violates (i). Math problems violate (ii) because they generate long reasoning streams.
Coding problems with 10k-20k context are well suited, because they generate only ~5k output tokens. That's $0.03-$0.04 per prompt to GPT-5 under flex pricing. The convenience is worth it, unless you're relying on a particular agentic harness that you don't control (I am not).
For large context questions, I send them to a chat subscription, which gives me a budget of N prompts instead of N tokens. So naturally, all the 100k-400k token questions go there.
16 hours ago the readme for codex CLI was updated. Now codex cli supports openai login like claude does, no API credits.
From the readme:
After you run codex select Sign in with ChatGPT. You'll need a Plus, Pro, or Team ChatGPT account, and will get access to our latest models, including gpt-5, at no extra cost to your plan. (Enterprise is coming soon.)
Important: If you've used the Codex CLI before, you'll need to follow these steps to migrate from usage-based billing with your API key:
Update the CLI with codex update and ensure codex --version is greater than 0.13
Ensure that there is no OPENAI_API_KEY environment variable set. (Check that env | grep 'OPENAI_API_KEY' returns empty)
Run codex login again
Is this actually true? Last I checked (a week ago?) Codex the agents were free at some tiers in a preview capacity (with future rate limits based on tier), but codex cli was not. With codex cli you can log in but the purpose of that is to link it to an API key where you pay per use. The sub tiers give one time credits you would burn through quickly.
> Availability and access
> GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week. Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI (opens in a new window) by signing in with ChatGPT.
I believe it could be true because I think training dataset contained a lot more yaml than json. I mean...you know how much yaml get churned out every second?
DDR3 workstation here - R1 generates at 1 token per second. In practice, this means that for complex queries, the speed of replying is closer to an email response than a chat message, but this is acceptable to me for confidential queries or queries where I need the model to be steerable. I can always hit the R1 API from a provider instead, if I want to.
Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.
The full thing, 671B. It loses some intelligence at 1.5 bit quantisation, but it's acceptable. I could actually go for around 3 bits if I max out my RAM, but I haven't done that yet.
If you mean clearly, noticeably erratic or incoherent behaviour, then that hasn't been my experience for >=4-bit inference of 32B models, or in my R1 setup. I think the others might have been referring to this happening with smaller models (sub-24B), which suffer much more after being quantised below 4 or 5 bits.
My R1 most likely isn't as smart as the output coming from an int8 or FP16 API, but that's just a given. It still holds up pretty well for what I did try.
The user should be able to enable/disable tools or an entire tab’s toolset. Some keep open hundreds of tabs and that’s simply too many potential tools to expose. Deduping doesn’t make sense for the reasons you say, and that one logical task could lead to a series of operations missequenced across a range of tabs.
If the primary use case is input heavy, which is true of agentic tools, there’s a world where partial GPU offload with many channels of DDR5 system RAM leads to an overall better experience. A good GPU will process input many times faster, and with good RAM you might end up with decent output speed still. Seems like that would come in close to $12k?
And there would be no competition for models that do fit entirely inside that VRAM, for example Qwen3 32B.
reply