Per your point 4, some current hyped work is pushing hard in this direction [1, 2, 3]. The basic idea is to think of attention as a way of implementing an associative memory. Variants like SDPA or gated linear attention can then be derived as methods for optimizing this memory online such that a particular query will return a particular value. Different attention variants correspond to different ways of defining how the memory produces a value in response to a query, and how we measure how well the produced value matches the desired value.
Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.
It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token.
Couldn't you do something like add a bidirectional encoder after your embedding look up table to compress your text into some smaller token-count semantic space before feeding your transformer blocks to get a similar effect, then?
Yes, you can get good compression of a long sequence of "base" text tokens into a shorter sequence of "meta" text tokens, where each meta token represents the information from multiple base tokens. But, grouping a fixed number of base tokens into each meta token isn't ideal, since that won't align neatly with sensible semantic boundaries, like words, phrases, sentences, etc. So, the trick is how decide which base tokens should be grouped into each meta token....
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
Breakthrough in image generation speed literally came from applying better differential equations for diffusion taken from statistical mechanics physics papers:
I think there's an implicit assumption here that interaction with the world is critical for effective learning. In that case, you're bottlenecked by the speed of the world... when learning with a single agent. One neat thing about artificial computational agents, in contrast to natural biological agents, is that they can share the same brain and share lived experience, so the "speed of reality" bottleneck is much less of an issue.
Yeah I'm envisioning putting a thousand simplistic robotic "infants" into a vast "playpen" to gather sensor data about their environment, for some (probably smaller) number of deep learning models to ingest the input and guess at output strategies (move this servo, rotate this camshaft this far in that direction, etc) and make predictions about resulting changes to input.
In principle a thousand different deep learning models could all train simultaneously on a thousand different robot experience feeds.. but not 1 to 1, but instead 1 to many.. each neural net training on data from dozens or hundreds of the robots at the same time, and different neural nets sharing those feeds for their own rounds of training.
Then of course all of the input data paired with outputs tested and further inputs as ground truth to predictions can be recorded for continued training sessions after the fact.
Never thought I’d get to do this but this was my masters research! Simulations are inherently limited and I just got tired of robotic research being done only in simulations. So I built a novel soft robot (notoriously difficult to control) and got it to learn by playing!!
Here is an informal talk I gave on my work. Let me know if you want the thesis
A very interesting idea. I am curious about this sharing and blending of the various nets; I wonder if something as naive as averaging the weights (assuming the neural nets all have the same dimensions) would actually accomplish that?
Basically everything applicable to the playpen of a human baby is applicable to the playpen of an AI robot baby in this setup, to at least some degree.
Perhaps the least applicable part is that "robot hurting itself" has the liability of some cost to replace the broken robot part, vs the potentially immeasurable cost of a human infant injuring themselves.
If it's not a good idea to put a "glass vessel" in a human crib (strictly from an "I don't want the glass vessel to be damaged" sense) then it's not a good idea to put that in the robot-infant crib either.
Give them something less expensive to repair, like a stack of blocks instead. :P
> In that case, you're bottlenecked by the speed of the world
Why not have the AI train on a simulation of the real world? We can build those pretty easily using traditional software and run them at any speed we want.
But, if empirically our current system for net wealth creation tends to also produce wealth concentration, it makes sense to consider ways of modifying the system to mitigate some of the wealth concentration while maintaining as much of the wealth creation as possible.
The target you should look for is how much wealth gets created for the least well-off (or for some low percentile representative person). Just don't worry about what the rich people doing at all. No need to punish them.
Where is the "wealth created for the least well off" going to come from?
Necessarily, that must be wealth that did' go to the rich instead (it could have!). So, necessarily, you are "punishing" them by doing so.
You mainly seem to be against some kind of hypothetical robinhoodesque style redistribution because you worry it's unfair to the rich. Any solution, though, will have to take this shape, whether it targets the existing wealth or wealth generated going forward. It's all about redistribution of access no matter how you slice
You don't need to be so protective of the rich. They are doing just fine and they have plenty of resource and mechanisms in place to protect themselves. If the world's wealthiest people were made even just a tiny bit less wealth by redistribution of assets they would still be living like absolute kings.
> Where is the "wealth created for the least well off" going to come from?
Well, mostly where everyone's wealth is coming from: from the fruits of their own labour.
> You mainly seem to be against some kind of hypothetical robinhoodesque style redistribution because you worry it's unfair to the rich.
No, I haven't started worrying about fairness, yet. No, I'm afraid that a tax system designed by what sounds good instead of what works will leave the poor even worse off.
Only tiny fraction of a billionaire's wealth tends to be the fruit of their personal labor. It's the labor of their employees and machines that create the wealth. To my understanding, this is broadly accepted.
Now, billionaires do supply a different key ingredient to the wealth creation - risk. Without investment and risk, wealth cannot be created. In terms of $ investment, billionaires take on the vast majority of the risk and deserve the bulk of the rewards, the argument goes. Workers take on far less risk with their guaranteed* paycheck .
But which is the bigger risk? A billionaire's $100,000,000? Or your home, your health, and your retirement savings were you to lose your job in a bad market?
I'm interested in company structures that incentivize distributing risk, profit, and power across a larger group than we tend to see in modern companies.
> I'm interested in company structures that incentivize distributing risk, profit, and power across a larger group than we tend to see in modern companies.
Please feel free to start your own company or cooperative.
> But which is the bigger risk? A billionaire's $100,000,000? Or your home, your health, and your retirement savings were you to lose your job in a bad market?
This parallels the diminishing marginal utility of wealth, which states that with extreme wealth, you can't buy any more to get more utility or happiness.
In a way, the risk phenomenon picks up where that phenomenon leaves off, where the need for normal "utility" gives way to the desire for amassing power over society at large.
The mistake they make is not realizing how much of their wealth and welfare relies on the welfare of the masses.
> I'm interested in company structures that incentivize distributing risk, profit, and power across a larger group than we tend to see in modern companies.
Ironically this is a tiny bit of what we saw with employee stock options in the early days of the internet industry, reflected in the historically outsized power and voice of workers. Arguably, that is a part of the rationale behind the big tech layoffs - to put labor back in its place.
> This parallels the diminishing marginal utility of wealth, which states that with extreme wealth, you can't buy any more to get more utility or happiness.
You can buy more. The utility just diminishes, but doesn't go to zero.
> Ironically this is a tiny bit of what we saw with employee stock options in the early days of the internet industry, reflected in the historically outsized power and voice of workers. Arguably, that is a part of the rationale behind the big tech layoffs - to put labor back in its place.
The bigger relative risk is precisely why the billionaire is so rich - their surplus wealth may be wagered against longer odds when it would be suicidally reckless to yolo your life's savings into a start-up. Those sorts of bets are the Venture Capital strategy.
The relative value of money being lower is what enables riskier investments and essentially what 'justifies' inequality in a bloodless utilitarian sort of way. You know how in economics trades may be net positives due to different valuations between individuals? The same applies in current certain money vs future risky unbound returns. That taking such bets is consistently a successful strategy breeds inequity even without any winner-takes-all effects or high barriers to entry.
Hypothetically if the VCs kept on 'gambling' on failed start-ups and always losing without any offsetting huge wins, not quitting because they think a win is just around the corner, it would be a trend that reduces inequity. As it puts money into the pockets of employees and smaller suppliers of neccessary capital production goods.
I am afraid you would find it harder to get larger groups of people to agree to the high growth potential, high risk enterprises because they tend to lack the spare capital to be able to afford to risk it. I think ironically the most probable tolerable risk profile for larger groups (who are presumably more precarious) is something big and secure being sold out of by larger players. (Small traders panic buying and selling and doing worse is its own separate problem.)
One form of company structure that technically does paying labor well better are partnetships typically used by law firms. It works for them because they have no real capital requirements and have high per hour productivity and labor expenses as the lion’s share of profits go to the lawyers whose names are in the company name.
You clearly believe you're very objective and applying very "rational" thinking to the problem. It's about the dollar value of the income of the least well-off, so what are these stupid people even talking about inequality? Don't they realise making a poor person 10% worse off and Bezos 11% worse off reduces inequality but lowers the floor (the pedestrian argument you've made several times in this thread)?
But please consider that the problem is slightly (i.e. a lot) more complicated than you think. Economics is a very very hard discipline and perhaps more closely related to philosophy than the natural sciences. There have been countless books written on the topic of inequality by people smarter than you or me, so it's highly it's all so simple as your dismissive "just do X" line imagines it to be.
A simple, almost trivial observation: very high inequality of wealth also means very high inequality of power, meaning the rich elite can and will influence the political process to enrich themselves further at the expense of the "low percentile" less well-off, which will be denied political power. This is one example of why you should care about inequality.
> But please consider that the problem is slightly (i.e. a lot) more complicated than you think. Economics is a very very hard discipline [...]
Yes, and that's why I am saying that it's far from an obvious conclusion that making rich people worse off is a good thing for poor people.
And once you admit that this ain't trivial, you can look at topics like deadweight losses or tax incidence.
Different tax and redistribution system have different effects. It's not just 'more tax = more revenue to redistribute'.
For example, I actually think you can drive overall tax rates (eg as percentage of GDP) a lot higher than they are today in most countries without harming the economy, _if_ you switch to something as efficient as land value taxes for the vast majority of your government revenue (and lower other taxes). Property taxes are a second best approximation.
In contrast, capital gains taxes and income taxes are less efficient. Tariffs are even worse (by a large margin!), even if they could theoretically raise some revenue. Stamp duties or other taxes on transactions are also pretty bad. And silly things like price controls just hurt the economy without raising any revenue at all.
But that's all vastly simplified. As you suggest, there's lots of theory and practice you can investigate for the actual effects. They might also differ in different times and places.
> There have been countless books written on the topic of inequality by people smarter than you or me, so it's highly it's all so simple as your dismissive "just do X" line imagines it to be.
That's why I'm saying exactly the opposite: I'm arguing against the naive 'just tax the rich'.
Wealth redistribution has this positive effect: If you take $1000 from a billionaire and give it to a very poor person, total happiness increases.
It also has a negative effect, high level of redistribution can inhibit production.
The optimal level of redistribution depends on what you're optimizing, it's usually a mix of societal happiness and some notion of fairness. (I personally would want to optimize happiness and prosperity.)
Most of the people pursued in these "AI talent wars" are folks deeply involved in training or developing infrastructure for training LLMs at whatever level is currently state-of-the-art. Due to the resources required for projects that can provide this sort of experience, the pool of folks with this experience is limited to those with significant clout in orgs with money to burn on LLM projects. These people are expensive to hire, and can kind of run through a loop of jumping from company to company in an upward compensation spiral.
Ie, the skills aren't particularly complicated in principle, but the conditions needed to acquire them aren't widely available, so the pool of people with the skills is limited.
I'd say superintelligence is more about producing deeper insight, making more abstract links across domains, and advancing the frontiers of knowledge than about doing stuff faster. Thinking speed correlates with intelligence to some extent, but at the higher end the distinction between speed and quality becomes clear.
If anything, "abstract links across domains" is the one area where even very low intelligence AI's will still have an edge, simply because any AI trained on general text has "learned" a whole lot of random knowledge about lots of different domains; more than any human could easily acquire. But again, this is true of AI's no matter how "smart" they are. Not related to any "super intelligence" specifically.
Similarly, "deeper insight" may be surfaced occasionally simply by making a low-intelligence AI 'think' for longer, but this is not something you can count on under any circumstances, which is what you may well expect from something that's claimed to be "super intelligent".
I don't think current models are capable of making abstract links across domains. They can latch onto superficial similarities, but I have yet to see an instance of a model making an unexpected and useful analogy. It's a high bar, but I think that's fair for declaring superintelligence.
In general, I agree that these models are in some sense extremely knowledgeable, which suggests they are ripe for producing productive analogies if only we can figure out what they're missing compared to human-style thinking. Part of what makes it difficult to evaluate the abilities of these models is that they are wildly superhuman in some ways and quite dumb in others.
It is really more of a value judgement of the utility of the answer to a human.
Some kind of automated discovery across all domain pairs for something that a human finds utility in the answer seems almost like the definition of an intractable problem.
Superintelligence just seems like marketing to me in this context. As if AGI is so 2024.
> It's a high bar, but I think that's fair for declaring superintelligence.
I have to disagree because the distinction between "superficial similarities" and genuinely "useful" analogies is pretty clearly one of degree. Spend enough time and effort asking even a low-intelligence AI about "dumb" similarities, and it'll eventually hit a new and perhaps "useful" analogy simply as a matter of luck. This becomes even easier if you can provide the AI with a lot of "context" input, which is something that models have been improving at. But either way it's not superintelligent or superhuman, just part of the general 'wild' weirdness of AI's as a whole.
I think you misunderstood what I meant about setting a high bar. First, passing the bar is a necessary but not sufficient condition for superintelligence. Secondly, by "fair for" I meant it's fair to set a high bar, not that this particular bar is the one fair bar for measuring intelligence. It's obvious that usefulness of an analogy generator is a matter of degree. Eg, a uniform random string generator is guaranteed to produce all possible insightful analogies, but would not be considered useful or intelligent.
I think you're basically agreeing with me. Ie, current models are not superintelligent. Even though they can "think" super fast, they don't pass a minimum bar of producing novel and useful connections between domains without significant human intervention. And, our evaluation of their abilities is clouded by the way in which their intelligence differs from our own.
Comparing the process of research to tending a garden or raising children is fairly common. This is an iteration on that theme. One thing I find interesting about this analogy is that there's a strong sense of the model's autoregressiveness here in that the model commits early to the gardening analogy and then finds a way to make it work (more or less).
The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics. Guided by this connection, they decided to train models to generate data by learning to invert a diffusion process that gradually transforms complexly structured data into a much simpler distribution -- in this case, a basic multidimensional Gaussian.
I feel like these sorts of technical analogies are harder to stumble on than more common "linguistic" analogies. The latter can be useful tools for thinking, but tend to require some post-hoc interpretation and hand waving before they produce any actionable insight. The former are more direct bridges between domains that allow direct transfer of knowledge about one class of problems to another.
> The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics.
These connections are all over the place but they tend to be obscured and disguised by gratuitous divergences in language and terminology across different communities. I think it remains to be seen if LLM's can be genuinely helpful here even though you are restricting to a rather narrow domain (math-heavy hard sciences) and one where human practitioners may well have the advantage. It's perhaps more likely that as formalization of math-heavy fields becomes more widespread, that these analogies will be routinely brought out as a matter of refactoring.
You wouldn't get 5 years to noodle -- maybe 1 or 2 at best. You're competing for your next thing against other smart folks who are going hard on maximizing publication rate and grant winning in their current thing. To continue with your riskier, bigger thinking you'd have to be ready to bet that: (i) you'll produce a highly impactful result before you start applying for your next thing and (ii) the high impactfulness of that result will be recognized in time to support your applications.
The most successful folks tend to mix talent and hard work with a bit of luck in terms of early gold striking to gain a quick boost of credibility that helps them draw other people into their fold (eg, grad students in a big lab) who can handle a lot of the metric maxxing to free up some (still not enough) time for more ambitious thinking.
One challenge with this line of argument is that the base model assigns non-zero probability to all possible sequences if we ignore truncation due to numerical precision. So, in a sense you could say any performance improvement is due to shifting probability mass towards good reasoning behaviors and away from bad ones that were already present in the base model.
I agree with your general point though. Ie, we need more thorough empirical investigation of how reasoning behavior evolves during RL training starting from the base model. And, current RL training results seem more like "amplifying existing good behavior" than "inducing emergent good behavior".
While it's true that the model assigns non-zero probabilities to all sequences by design, those probabilities can get a lot smaller. E.g. replace that 99% per-step success probability with 10% and suddenly the overall chance of a correct result is truly astronomically small.
For a novel reasoning strategy, I would expect at least a few individual tokens where the base model assigns much smaller probabilities than the reinforcement-learning trained one, as opposed to just being a little smaller but spread out over many tokens. (Which would better fit a "death by a thousand cuts" scenario.)
The best way to hit 3000 is cycling. A reasonably fit (70kg-100kg) cyclist should burn 600-800 cal/hr riding at a moderate pace, so 3000 is a 4-5hr ride. It wouldn't be unusual for an enthusiastic amateur cyclist to hit that 1-2x/week.
However, if you assume that 2000 calories is pretty much maintenance and you'll burn that anyway, then you only need somewhere around an hour and a half or two hours cycling. Also if you can replace a medium commute with cycling, then it's not that difficult to hit that target just through active travel. (I used to regularly cycle commute approx 37kms each way and I could easily hit 1000 calories on just one of the journeys).
Yeah. It's easy to get over 3000 total daily calories if you have, eg, an hour of cycle commute per day and then add some purposeful gym or running on top.
Could do it far more biomechanically efficient on an elliptical, but overdoing cardio risks less type IIb (wiry appearance) and hypertrophic cardiomyopathy.
Or incorporate more strength training that increases type IIb adaptations and greater BMR.
Some of the attention-like ops proposed in this new work are most simply described as implementing the associative memory with a hypernetwork that maps keys to values with weights that are optimized at test time to minimize value retrieval error. Like you suggest, designing these hypernetworks to permit efficient implementations is tricky.
It's a more constrained interpretation of attention than you're advocating for, since it follows the "attention as associative memory" perspective, but the general idea of test-time optimization could be applied to other mechanisms for letting information interact non-linearly across arbitrary nodes in the compute graph.
[1] https://arxiv.org/abs/2501.00663
[2] https://arxiv.org/abs/2504.13173
[3] https://arxiv.org/abs/2505.23735
reply