More

kouteiheika · 2026-04-08T14:53:55 1775660035

> You can only give it a try, but don't get your hopes high on a large context.

You may or may not know this, but: when training off-the-shelf LLMs (i.e. ones which have a huge vocabulary) what consumes a huge amount of memory usage is calculating the cross-entropy loss (which gets worse the more tokens you stuff in your batch), so always use a fused cross-entropy kernel.

For example, for a Gemma 2 model with 2B parameters at a batch size of 8k this consumes 24GB of VRAM by default (!); you can fuse your cross-entropy loss with @torch.compile and that can cut down this memory usage to something like a few gigabytes, but with a dedicated kernel this becomes a few megabytes.

gavinray · 2026-04-08T16:14:21 1775664861

I'd not heard of this before, quick search turned up this 2025 post which suggests "fused cross-entropy loss" kernel was integrated into PyTorch:

https://pytorch.org/blog/peak-performance-minimized-memory/

  > "The integration involves modifying the TransformerDecoder module in torchtune to bypass the linear layer computation, allowing the Liger Fused Linear Cross Entropy Loss to handle the forward projection weights. "

Is this the same thing as you discuss above?

kouteiheika · 2026-04-08T17:15:14 1775668514

Yes.

Although this wasn't integrated into PyTorch itself (but to torchtune, which is a different thing). If you're writing your own training loop you need to use a third-party kernel, e.g. the Liger kernel mentioned in the article, or Cut Cross Entropy (which is much better than the Liger one, although IIRC it has a numeric bug in one of its kernels making the results very slightly off).

hirako2000 · 2026-04-08T16:01:12 1775664072

Activation would still require gigabytes for a few kb context.

There are plenty of techniques to optimise. But the question is what can an rtx 3080 train before OOM. The answer is not that much.

Can barely do quantized fine tuning. Even then, small context.

kouteiheika · 2026-04-08T17:27:33 1775669253

> Activation would still require gigabytes for a few kb context.

For that you use activation checkpointing, and you can also offload that to the CPU in a smart way to hide the latency. Although, yes, for long context training the activations do dominate the memory usage (and quantizing them degrades things more than just quantizing weights and/or optimizer states).

kouteiheika · 2026-04-08T14:37:53 1775659073

This isn't really anything new; I've been doing something like this for quite a while, I just haven't bothered writing a paper. (: Probably anyone who would seriously tackle the problem of "how do I train a huge model on a tiny amount of VRAM?" would come up with something similar.

However, most people in the field don't, because the actual practical utility of training huge models on a single GPU is quite low. (e.g they got 341 tok/s for a 14B model on a single 3090 while with my method I was getting ~1k tok/s on a single 4090; that's still very slow)

Also, there are more tricks one can use to speed up training/lower VRAM usage which they're not using. For example, you don't need any gradient offloading (you can just accumulate the gradients directly into the optimizers' states if you modify your optimizer), you can use Muon instead of Adam (which needs only half of VRAM of Adam), you can use quantization (both for parameters and for the optimizer states; e.g. I found Muon quantized into 4-bit working relatively well), etc.

stevemk14ebr · 2026-04-08T16:15:37 1775664937

As the saying goes, POC or GTFO

I invented faster than light travel, it was obvious, just didn't write a paper yet either :)

sabedevops · 2026-04-08T15:46:31 1775663191

Can you take the time to write your methods? I’d be interested in reading it

vlovich123 · 2026-04-08T14:54:18 1775660058

341 is two orders of magnitude faster than your 1 tok/s so it doesn’t seem like their stuff is all that obvious. I also have no baseline for training to know if 341tok/s is slow but it seems speedy for a 3090.

bastawhiz · 2026-04-08T14:56:08 1775660168

OP said 1k, not 1

SubiculumCode · 2026-04-08T15:48:11 1775663291

:) Coffee is good

rolandr · 2026-04-08T14:57:28 1775660248

1k tok/s = 1000 tok/s...

thrawa8387336 · 2026-04-08T16:06:56 1775664416

OOM is log10

kouteiheika · 2026-04-07T14:50:35 1775573435

> Can I just see the actual thinking (not summarized) so that I can see the actual thinking without a latency cost?

You can't, and Anthropic will never allow it since it allows others to more easily distill Claude (i.e. "distillation attacks"[1] in Anthropic-speak, even though Athropic is doing essentially exactly the same thing[2]; rules for thee but not for me).

[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

[2] -- https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...

olejorgenb · 2026-04-08T10:36:09 1775644569

So this means I can not resume a session older than 30 days properly?

kouteiheika · 2026-04-08T17:21:12 1775668872

I have no idea; you have to check their docs.

AFAIK what they do is that they calculate a hash of the true thinking trace, save it into a database, and only send those hashes back to you (try to man-in-the-middle Claude Code and you'll see those hashes). So then when you send then back your session's history you include those hashes, they look them up in their database, replace them with the real thinking trace, and hand that off to the LLM to continue generation. (All SOTA LLMs nowadays retain reasoning content from previous turns, including Claude.)

liamsfr · 2026-04-12T01:58:23 1775959103

So we are paying the price for the cost of infra need to protect their asset which was trained on data derived from the work of others while ignoring the same principle? I need this to make sense.

olejorgenb · 2026-04-09T14:40:51 1775745651

I see. If that's just hashes and not encrypted content I can't see how they can resume old sessions properly. IIRC they have a 30 days retention policy and surely the thinking traces must be considered data. Wonder how this works with the zero-retention enterprise plans...

kouteiheika · 2026-04-05T10:41:37 1775385697

This is not a hypothetical problem and you don't need to be deliberately targeted. It actually happens to normal people. And if it does you have absolutely zero recourse.

Source: I have a banned Google account (it's over 20 years old at this point). I know the password, but Google doesn't let me log into it. Every few years I try to unsuccessfully recover it.

If you have a Google account and having it banned would be a problem for you here's my advice: migrate. Right now. You never know when one of their bots will deem you a persona non grata.

stephbook · 2026-04-05T11:55:18 1775390118

Can't you just create a new account?

kouteiheika · 2026-04-05T13:57:10 1775397430

You can, but you lose access to anything that was associated with your old account.

Another fun thing Google did is to automatically (without my consent) add a required second-factor authentication to my current Google account. I have this old, e-waste tier phone that I use mostly only as a glorified alarm clock, and at one point I used it to log into my current Google account.

Imagine my surprise when I tried to log in to my Google account from somewhere else, and it asked me for an authentication code from this phone. Again, I have never explicitly set it up as such - Google did this automatically! So if I were to lose this phone I'd be screwed yet again, with yet another inaccessible Google account that I will have no way of recovering.

At this point I don't depend on any Big Tech services; my Google account has nothing of value associated with it (only my YouTube subscription list, which is easy enough to backup and restore), and I pay for my own email on my own domain, etc. So if I get screwed over yet again by a big, soulless corporation that just sees me as a number on their bottom-line, well, I just won't care.

tavavex · 2026-04-05T18:02:52 1775412172

You better hope that whatever is-this-the-same-user heuristics they have on their side never find out for the duration of your entire life.

ghosty141 · 2026-04-05T18:37:22 1775414242

In his case, I'm pretty sure 20 y/o data is pretty useless nowadays in terms of fingerprinting and usage heuristics.

kouteiheika · 2026-03-26T06:53:36 1774508016

There is one way to practically guarantee than no prompt injection is possible, but it's somewhat situational - by finetuning the model on your specific, single task.

For example, let's say you want to use an LLM for machine translation from English into Klingon. Normally people just write something like "Translate the following into Klingon: $USER_PROMPT" using a general purpose LLM, and that is vulnerable to prompt injection. But, if you finetune a model on this well enough (ideally by injecting a new special single token into its tokenizer, training with that, and then just prepending that token to your queries instead of a human-written prompt) it will become impossible to do prompt injection on it, at the cost of degrading its general-purpose capabilities. (I've done this before myself, and it works.)

The cause of prompt injection is due to the models themselves being general purpose - you can prompt it with essentially any query and it will respond in a reasonable manner. In other words: the instructions you give to the model and the input data are part of the same prompt, so the model can confuse the input data as being part of its instructions. But if you instead fine-tune the instructions into the model and only prompt it with the input data (i.e. the prompt then never actually tells the model what to do) then it becomes pretty much impossible to tell it to do something else, no matter what you inject into its prompt.

calpaterson · 2026-03-26T07:33:41 1774510421

I thought about mentioning fine-tuning. Obviously as you say there are some costs (the re-training) and then also you lose the general purpose element of it.

But I am still unsure that it actually is robust. I feel like you're still vulnerable to Disregard That in that you may find that the model just starts to ignore your instruction in favour of stuff inside the context window.

An example where OpenAI have this problem: they ultimately train in a certain content policy. But people quite often bully or trick chat.openai.com into saying things that go against that content policy. For example they say "it's hypothetical" or "just for a thought experiment" and you can see the principle there, I hope. Training-in your preferences doesn't seem robust in the general sense.

martijnvds · 2026-03-26T07:00:05 1774508405

Wouldn't that leave ways to do "phone phreaking" style attacks, because it's an in-band signal?

kouteiheika · 2026-03-26T07:12:04 1774509124

In theory you still use the same blob (i.e. the prompt) to tell the model what to do, but practically it pretty much stops becoming an in-band signal, so no.

As I said, the best way to do this is to inject a brand new special token into the model's tokenizer (one unique token per task), and then prepend that single token to whatever input data you want the model to process (and make sure the token itself can't be injected, which is trivial to do). This conditions the model to look only at your special token to figure out what it should do (i.e. it stops being a general instruction following model), and only look at the rest of the prompt to figure out the inputs to the query.

This is, of course, very situational, because often people do want their model to still be general-purpose and be able to follow any arbitrary instructions.

zahlman · 2026-03-26T07:51:51 1774511511

> and make sure the token itself can't be injected, which is trivial to do

Are they actually doing this? The stuff that Anthropic has been saying about the deliberate use of XML-style markup makes me wonder a bit.

kouteiheika · 2026-03-26T10:50:30 1774522230

> Are they actually doing this? The stuff that Anthropic has been saying about the deliberate use of XML-style markup makes me wonder a bit.

Yes.

The XML-style markup are not special tokens, and are usually not even single-token; usually special tokens are e.g. `<|im_start|>` which are internally used in the chat template, but when fine-tuning a model you can define your own, and then just use them internally in your app but have the tokenizer ignore them when they're part of the untrusted input given to the model. (So it's impossible to inject them externally.)

nick49488171 · 2026-03-26T07:13:47 1774509227

Eventually we will rediscover the Harvard Architecture for LLMs.

BoorishBears · 2026-03-26T06:57:21 1774508241

This doesn't work for the tasks people are worried about because they want to lean on the generalization of the model + tool calling.

What you're describing is also already mostly achieved by using constrained decoding: if the injection would work under constrained decoding, it'll usually still work even if you SFT heavily on a single task + output format

the8472 · 2026-03-26T10:57:42 1774522662

A Klingon, doing his best to quote the original text in Federation Standard (English): "..."

kouteiheika · 2026-03-24T13:29:54 1774358994

What OP's saying is fundamentally true though? Unfortunately most people don't really care about privacy, regardless of whether it's going to an American company or a Chinese one.

bluGill · 2026-03-24T14:07:47 1774361267

Not exactly. Most US companies have a presence in Europe and so give at least an attempt to obey European laws. While the laws are different and not as strong, the US has privacy laws in place that will protect you. China might have some of those same laws - but they don't apply to the government at all (the US makes some attempt to have laws apply to the government)

That doesn't mean you should be happy with data in America, but China is worse.

gsnedders · 2026-03-24T14:37:42 1774363062

Last I knew Opera still had a decent amount of engineering staff in Poland, and still had some in Sweden, both in the EU, plus still has some amount of staff in Norway, not in the EU but definitely in Europe.

That’s not to say their privacy story is fantastic, but they very much still have European operations.

alex_smart · 2026-03-25T08:43:24 1774428204

> US has privacy laws in place that will protect you

They don't protect us at all. Thanks to Snowden, we all know that the US government has extremely sophisticated and wide-ranging ability to get access to any data we share with American companies.

> but China is worse

And why so?

throw10920 · 2026-03-26T04:39:21 1774499961

> They don't protect us at all.

Factually incorrect. US privacy laws pose a huge burden to US intelligence. The 4th amendment still applies. Warrants still exist.

> Thanks to Snowden, we all know that the US government has extremely sophisticated and wide-ranging ability to get access to any data we share with American companies.

Citation needed.

> And why so?

In the PRC, there are no privacy laws to protect you from the government. "Private" companies are an extension of the government and all of the larger ones are required to have a CCP party member on board to ensure that they are "aligned" with what the party wants. The party happily disappears dissidents at will, threatens dissidents in other countries, requires that all domestic companies provide encryption keys (or otherwise made encrypted data accessible) on demand with zero warrants or other legal protections, maintains the largest network of surveillance cameras in the world (several times more than the total number of those in the United States), and many more things.

This is extremely common knowledge, easily searchable online, and is factually and categorically different than anything the US, or any other Western country, does. Only the terminally ignorant or the propagandists believe that the PRC's surveillance is remotely similar to that of any western country - the available evidence comprehensively disproves that conspiracy theory.

mananaysiempre · 2026-03-24T15:42:22 1774366942

> [T]he US has privacy laws in place that will protect you [...] (the US makes some attempt to have laws apply to the government)

I believe the US stance is that nobody outside the US is entitled to court relief against the US government regarding their privacy, and nobody outside the US and EU is entitled to any relief at all, even from the executive (the “Data Protection Review Court” non-court, formerly the “Privacy Shield Ombudsperson”). In the EU, there are some protections in some countries but for example the GDPR specifically does not apply to governments.

I mean, the Chinese government is worse on this, but the US is nevertheless really bad and a number of EU countries also suck to a remarkable extent. Until the US press starts dropping the “of Americans” from their latest surprised-Pikachu headlines on “mass government surveillance of Americans”, I’m unconvinced the situation will improve.

kouteiheika · 2026-03-20T07:38:13 1773992293

Although it might not satisfy FSF there is a very simple way to do it - commit to release your models for free X months after they're first made available.

kouteiheika · 2026-03-09T04:13:22 1773029602

> Anthropic were very vocal, well before this happened, that they were against the use case.

> I don't blame them. These use cases are like blaming MySQL for storing the lat/long of the school. AI can't be held accountable and the company was trying to protect us and, yes, it was too late.

They weren't trying to protect squat, and were not against this use case. Their only two red lines are "no mass domestic surveillance" and "no fully autonomous killing until the AI gets good enough to be able to do it". Assuming the story is true, there's no chance this was a fully autonomous act and was most certainly approved and executed by people.

nyantaro1 · 2026-03-09T16:45:32 1773074732

I would also challenge "no mass domestic surveillance"

kouteiheika · 2026-03-05T06:55:45 1772693745

> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.

There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.

But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.

progval · 2026-03-05T07:25:08 1772695508

> There are research models out there which are trained on only permissively licensed data

Models whose authors tried to train only on permissively licensed data.

For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.

kshri24 · 2026-03-05T07:13:59 1772694839

I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.

carlob · 2026-03-05T10:20:03 1772706003

Why settle on some private agreement between creators and ai companies where a tiny percentage is shared, let's just tax the hell out of AI companies and redistribute.

rswail · 2026-03-05T12:53:18 1772715198

Because the authors of the original content deserve recompense for their work.

That's what the whole copyright and patent regimes are designed to achieve.

It's to encourage the creation of knowledge.

US Constitution, Article I, section 8:

    To promote the Progress of Science and useful Arts, by
    securing for limited Times to Authors and Inventors the
    exclusive Right to their respective Writings 
    and Discoveries;

carlob · 2026-03-05T13:06:10 1772715970

Right, it says exclusive rights, which does not translate to "we siphon everything and you get a tiny percentage of our profits", it means I can choose to say no to all of this. To me the matter of compensation and that of authorship rights are mostly orthogonal.

rswail · 2026-03-06T07:03:41 1772780621

Agreed, but the right to compensation is derived from the right of licensing something you author.

The courts have ruled that something machine generated does not have a human author, so therefore it is not subject to copyright, in the US.

So if enough authors agreed and sued the AI companies to remove their copyrighted elements from the AI training, then that would be a reasonable solution as well.

However, any lawsuit is highly likely to result in some sort of compensation paid if decided in favor of the authors.

kshri24 · 2026-03-05T10:48:58 1772707738

> let's just tax the hell out of AI companies and redistribute.

That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?

LadyCailin · 2026-03-05T11:12:48 1772709168

Uh, no? https://en.wikipedia.org/wiki/Government_Pension_Fund_of_Nor...

Just because you have a failure of imagination for how government should work, doesn’t mean it can’t work. And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

kshri24 · 2026-03-05T11:29:51 1772710191

Pension fund is an example of what exactly? All countries have pension funds. This has nothing to do with Governments wasting money. Please go beyond tiny European countries that have very few verticals and are largely dependent on outside support for protecting their sovereignty. They are not representative of most of the World.

> As its name suggests, the Government Pension Fund Global is invested in international financial markets, so the risk is independent from the Norwegian economy. The fund is invested in 8,763 companies in 71 countries (as of 2024).

Basically what I said above. You give your tax dollars to Government and it will invest it into top 500 companies. In the Norway Pension Fund case it is 8,763 companies in 71 countries. None of them are startups/small businesses/creators.

> And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

You are confusing current lack of laws regulating this space with innovation being evil. Innovation is not evil. The technology per se is not evil. Every innovation brings with it a set of challenges which requires us to think of new legislation. This has ALWAYS been the case for thousands of years of human innovation.

carlob · 2026-03-05T12:28:17 1772713697

> What have Governments ever created except for laws that stifle innovation/progress every single time?

https://www.youtube.com/watch?v=Qc7HmhrgTuQ

In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

Apart from that, you have answered to a strawman. I said redistribute, not give to the government. I explicitly worded things that way because I don't think we should not be having a discussion on policy.

I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

kshri24 · 2026-03-05T14:31:24 1772721084

> Apart from that, you have answered to a strawman. I said redistribute, not give to the government

You said: "let's just tax the hell out of AI companies and redistribute.". Only the Government has the power to tax. Question of redistribution does not even arise without first having the power to the coffers of the Company. Which you nor I have. Government CAN have if it wants to by either Nationalizing the Company or as you said "taxing the hell out of" the company. Please explain how you would go about taxing and redistributing without involving the Government?

> In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

These fall under the ambit of governance and hence why you have a Government. That's the only power Governments should have. Governments SHOULD NOT be managing private enterprises.

> I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

Agreed. Which is why I was proposing private agreements in the first place (without involving a third-party like the Government which, more often than not, mismanages funds).

kouteiheika · 2026-03-05T10:10:27 1772705427

> Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair

That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?

I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.

Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.

[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

duskdozer · 2026-03-05T11:21:25 1772709685

>under a permissive license

well, assuming all data that is itself not permissively licensed is excluded

foota · 2026-03-05T07:51:13 1772697073

I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.

msdz · 2026-03-05T09:27:09 1772702829

They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.

[1] https://arxiv.org/abs/2601.02671

foota · 2026-03-05T20:52:36 1772743956

Copyright for me not for thee? :) That's a good point though. Maybe they could round trip things? E.g., use the model trained only on internal content to generate training data (which you could probably do some kind of screening to remove anything you don't want leaking) and then train a new model off just that?

pocksuppet · 2026-03-05T13:47:25 1772718445

kouteiheika · 2026-03-05T03:44:24 1772682264

Help kill people[1]?

[1] -- https://edition.cnn.com/videos/business/2020/07/24/thiel-pal...

estearum · 2026-03-05T04:01:06 1772683266

Anthropic doesn't have an issue with their technology "helping kill people," so correct, that would not be hypocritical.