Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
Skimmed quickly the paper. This does not look like RL. It's a genetic algorithm. In a previous life I was working on compbio (protein structure prediction), we built 100s of such heuristic based algorithm (monte carlo simulated annealing, ga..). The moment you have a good energy function (one that provide some sort of gradient), and a fast enough sampling function (llms), you can do looots of cool optmization with sufficient compute.
> This does not look like RL. It's a genetic algorithm.
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.
This depends quite a bit of what you’re trying to optimize.
Gradient descent is literally following the negative of the gradient to minimize a function. It requires a continuous domain, either analytical or numerical derivatives of the cost function, and has well-known issues in narrow valleys and other complex landscapes.
It’s also a local minimization technique and cannot escape local minima by itself.
_Stochastic_ gradient descent and related techniques can overcome some of these difficulties, but are still more or less local minimization techniques and require differentiable and continuous scoring functions.
In contrast, genetic algorithms try to find global minima, do not require differentiable scoring functions, and can operate on both continuous and discrete domains. They have their own disadvantages.
Different techniques for different problems. The field of numerical optimization is vast and ancient for a reason.
You also need a base model that can satisfy the verifier at least some of the time. If all attempts fail, there's nothing there to reinforce. The reinforcement-learning algorithms themselves haven't changed much, but LLMs got good enough on many problems that RL could be applied. So for any given class of problem you still need enough human data to get initial performance better than random.
This technique doesn't actually use RL at all! There’s no policy-gradient training, value function, or self-play RL loop like in AlphaZero/AlphaTensor/AlphaDev.
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
IMO RL can only solve "easy" problems. The reason RL works now is that unsupervised learning is a general recipe for transforming hard problems into easy ones. But it can't go all the way to solutions, you need RL on top for that. Yann LeCun's "cherry on top" analogy was right.
Are there platforms that make such training more streamlined? Say I have some definition of success for a given problem and it’s data how do I go about generating said RL model as fast and easily as possible?
We're working on an OSS industrial-grade version of this at TensorZero but there's a long way to go. I think the easiest out of the box solution today is probably OpenAI RFT but that's a partial solve with substantial vendor lock-in.
This isn't quite RL, right...?
It's an evolutionary approach on specifically labeled sections of code optimizing towards a set of metrics defined by evaluation functions written by a human.
I suppose you could consider that last part (optimizing some metric) "RL".
However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.
Most things are verifiable, just not with code. I'm not particularly excited for a world where everything is predictable. This is coming from a guy who loves forecasting/prediction modeling too, but one thing I hate about prediction modeling, especially from a hobbyist standpoint is data. Its very hard to get useful data. Investors will literally buy into hospital groups to get medical data for example.
There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.
Bravo! Planning your life in order to minimize deathbed regrets has always bothered me, because the nature of humanity is to want what it hasn't got. If you assume that, on average, people make correct decisions to work hard and pursue what matters to them at the opportunity cost of not enjoying quite enough free time, then their final wishes will naturally include the time they gave up to live the life they had. If, however, they had fully indulged the desire to enjoy and maximize free time, their wishes might instead have revolved around the unfulfilled potential thereby relinquished.
The problem, of course, is that the feeling of regret considers what may have been gained without reflecting on what would have been lost.
Now the right way to deal with this is some sort of self-consistent closure, where present you and past you with the same values and access to the same information (which could be anything from zero to complete knowledge of then-future outcomes and downstream effects) would make the same choices including both upside and downside. But that would be too complex for motivational advice, which is primarily about creating an inspirational mood, somewhat about positive first-order consequences, and not even a little bit about recursive self-consistency.
I doubt that the reflex to want what we don't have is in our nature. It wouldn't be selected for. That's how you kill off the herd in the spring and starve the following winter.
We work to ensure that others want what they don't have because we've built systems that rely on them continuing to do so. It creates a sort of logic that defines for us what counts as rational behavior. But when that logic meets another one and they each evaluate the other as irrational, there's no reason to expect that the want-what-you-dont-have logic is somehow more valid. If it seems so, it's just that more of us are under its spell than the other.
Who reassigned the species Brachiosaurus brancai to its own genus, and when?
---
Here is the transcription of the text from the image:
Reasoned for 8 seconds ▼
The user is asking about the reclassification of Brachiosaurus brancai to its own genus. Originally, Janensch described it as Brachiosaurus brancai in 1914. However, in 2009, Michael P. Taylor and Matt Wedel reassigned it to Giraffatitan brancai. Their paper, titled “Brachiosaurus brancai and the early evolution of titanosauriform sauropods,” argued for this change, distinguishing Giraffatitan brancai from Brachiosaurus. I’ll check the exact references to confirm the specifics.
It seems “Giraffatitan” was first proposed by Gregory S. Paul in 1988, when he reassigned Brachiosaurus brancai to its own genus. Although some confusion exists, it’s generally accepted that Paul was the main figure behind this change. Other references, such as a 2007 paper by Sereno and a 2009 paper by Taylor and colleagues, also discuss this reclassification. The 1988 reshuffling is usually seen as the turning point for Brachiosaurus brancai, although later research helped solidify this decision.
The species was reassigned to its own genus—Giraffatitan—by paleontologist Gregory S. Paul in 1988.
> 3. If a problem can be framed in a way that a coding agent can solve...
This reminds me of the South Park underwear gnomes. You picked a tool and set an expectation, then just kind of hand wave over the hard part in the middle, as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
Does it sometimes take 50x effort to understand a problem and the agent well enough to get that done? Are there classes of problems where it can't be done? Are either of those concerns something you can recognize before they impact you? At commercial quality, is it an accessible skill for inexperienced people or do you need a mastery of coding, the problem domain, or the coding agent to be able to rely on it? Can teams recruit people who can reliable achieve any of this? How expensive is that talent? etc
>as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem.
It's not, but if you can A) make it cheap to try out different types of framings - not all of them have to work and B) automate everything else then the labor intensity of programming decreases drastically.
>At commercial quality, is it an accessible skill for inexperienced people
I'd expect the opposite, it would be an extremely inaccessible skill requiring high skill and high pay. But, if 2 people can deliver as much as 15 people at a higher quality and they're paid triple, it's still way cheaper overall.
I would still expect somebody following this development pattern to routinely discover a problem the LLM can't deal with and have to dive under the hood to fix it - digging down below multiple levels of abstraction. This would be Hard with a capital H.
> as though framing problems "in a way coding agents can solve" is itself a well-understood or bounded problem
It is imminently solvable! All that is necessary is to use a subset of language easier for the machine to understand and use in a very defined way; we could call this "coding language" or something similar. Even build tools to ensure we write this correctly (to avoid confusing the machine). Perhaps we could define our own algorithms using this "language" to help them along!
We've had failed projects since long before LLMs. I think there is a tendency for people to gloss over this (3.) regardless, but working with an LLM it tends to become obvious much more quickly, without investing tens/hundreds of person-hours. I know it's not perfect, but I find a lot of the things people complain about would've been a problem either way - especially when people think they are going to go from 'hello world' to SaaS-billionaire in an hour.
I think mastery of the problem domain is still important, and until we have effectively infinite context windows (that work perfectly), you will need to understand how and when to refactor to maximize quality and relevance of data in context.
well according to xianshou's profile they work in finance so it makes sense to me that they would gloss over the hard part of programming when describing how AI is going to improve it
Working in one domain does not preclude knowledge of others. I work in cybersec but spent my first working decade in construction estimation for institutional builds. I can talk confidently about firewalls or the hospital you want to build.
No need to make assumptions based on a one-line hacker news profile.
> 5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.
Not necessarily. You can get the same benefits you described in (1)-(3) by using clearly defined modules in your codebase, they don't need to be separate microservices.
I think the argument is that the extra value provided is a small enough context window for working with an LLM. Although I'd suggest making it a library if one can manage, that gives you the desired context reduction bounded by interfaces without taking on the complexities of adding an additional microservice.
I imagine throwing a test at an LLM and saying:
> hold the component under test constant (as well as the test itself), and walk the versions of the library until you can tell me where they're compatible and where they break.
If you tried to do that with a git bisect and everything in the same codebase, you'd end up varying all three (test, component, library) which is worse science than holding two constant and varying the third would be.
> I think the argument is that the extra value provided is a small enough context window for working with an LLM.
I'm not sure moving something that could work as function to a microservice would save much context. If anything, I think you are adding more context, since you would need to talk about the endpoint and having it route to the function that does what you need. When it is all over, you need to describe what the input and output is.
Oh certainly. I was arguing that if you need more isolation than a function gives you, don't jump to the conclusion that you need a service. Consider a library as a middle ground.
I wonder if we'll see a return of the kind of interface file present in C++, Ocaml, and Ada. These files, well commented, are naturally the context window to use for reference for a module.
Even if languages don't grow them back as a first class feature, some format that is auto generated from the code and doesn't include the function bodies is really what is needed here.
Indeed; I think there's a strong possibility that there's certain architectural choices where LLMs can do very well, and others where they would struggle.
There are with humans, but it's inconsistent; personally I really dislike VIPER, yet I've never felt the pain others insist comes with too much in a ViewController.
Yeah, I think monorepos will be better for LLMs. Easier to refactor module boundaries as context grows or requirements change.
But practices like stronger module boundaries, module docs, acceptance tests on internal dev-facing module APIs, etc are all things that will be much more valuable for LLM consumption. (And might make things more pleasant for humans too!)
So having clear requirements, a focused purpose for software, and a clear boundary of software responsibility makes for a software development task that can be accomplished?
If only people had figured out at some point that the same thing applies when communicating to human software engineers.
If you know what you are doing, then yes. If you are a domain expert and can articulate your thoughts clearly in a prompt, you will most likely see a boost—perhaps two to three times—but ten times is unlikely. And if you don't fully understand the problem, you may experience a negative effect.
I think it also depends on how much yak-shaving is involved in the domain, regardless of expertise. Whether that’s something simple like remembering the right bash incantation or something more complex like learning enough Terraform and providers to be able to spin up cloud infrastructure.
Some projects just have a lot of stuff to do around the edges and LLMs excel at that.
You don't need microservices for that, just factor your code into libraries that can fit into the context window. Also write functions that have clear inputs and outputs and don't need to know the full state of the software.
One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.
Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.
50K context is an interesting number because I think there's a lot to explore with software within an order of magnitude that size. With apologies to Richard Feynman, I call it, "There's plenty of room in the middle." My idea there is the rapid expansion of computing power during the reign of Moore's law left the design space of "medium sized" programs under-explored. These would be programs in the range of 100's of kilobytes to low megabytes.
Might be a boon for test-driven development. Could turn out that AI coding is the killer app for TDD. I had a similar thought about a year ago but had forgotten, appreciate the reminder
It doesn't have to be microservices.
You can use modular architecture. You can use polylith. You can have boundaries in your code and mock around them.
> 5. Therefore all new development should be ~~microservices~~ modules written from scratch and interacting via cleanly defined APIs.
We figured this out for humans almost 20 years ago. Some really good empirical research. It's the only approach to large scale software development that works.
But it requires leadership that gives a shit about the quality of their product and value long-term outcomes over short-term rewards.
By large scale do you mean large software or large amounts of developers? Because there's some absolutely massive software in terms of feature set, usefulness and even LoC (not that that is a useful measurement) etc out there made by very small teams.
I'm not sure that you've got the causal relationship the right way around here re: architecture:team size.
What does team size have to do with this? Small teams can (and should) absolutely build modularized software ...
You simply cannot build a [working/maintainable] large piece of software if everything is connected to everything and any one change may cause issues in conceptually unrelated pieces of code. As soon as your codebase is bigger than what you can fully memorize, you need modules, separation of concerns, etc.
I think you might be missing that Swizec edited the quote, crossing out microservices and correcting it to modular systems. It seems to me you're both in violent agreement.
The newline thing is the motivating example in the introduction, using Llama 3 8B Instruct with up to 200 newlines before the question. If you want to reproduce this example with another model, you might have to increase the number of newlines all the way to the context limit. (If you ask the API to give you logprobs, at least you won't have to run mutiple trials to get the exact probability.)
But the meat of the paper is the Shapley value estimation algorithm in appendix A4. And in A5 you can see that different models giving different results is to be expected.
One of those cases where the act of building the system serves as sufficient qualification in itself, even when the results of the system are mediocre.
I've thought about the optics of talking about my own automatic job application system or including it on my resume, but I thought it'd be risky given the topic (like how listing your own startup is frowned upon, if I'm not mistaken?). Is it normally considered a bit risky or taboo?
well, it was during the technical interview, and it was just a couple of engineers doing the interview. I was just geeking out on all the problems I was solving (and creating).
and to be honest, I was too nerves to think about if it was risky or taboo.
SWE-bench with a private final eval, so you can't hack the test set!
In a perfect world this wouldn't be necessary, but in the current research environment where benchmarks are the primary currency and are usually taken at face value, more unbiased evals with known methodology but hidden tests are exactly what we need.
$200 per month means it must be good enough at your job to replicate and replace a meaningful fraction of your total work. Valid? For coding, probably. For other purposes I remain on the fence.
The reality is more like: The frothy american economy over the past 20 years has created an unnaturally large number of individuals and organizations with high net worth who don't actually engage in productive output. A product like ChatGPT Pro can exist in this world because it being incapable of consistent, net-positive productive output isn't actually a barrier to being worth $200/month if consistent net-positive productive output isn't also demanded of the individual or organization it is augmenting.
The macroeconomic climate of the next ~ten years is going to hit some people and companies like a truck.
For 2024 prediction is 2.6% US and 4.8% China. I don't see how it's low compared to US.
> high unemployment
5.1% China vs 4.1% USA
> huge debt from infrastructure spending
What do you mean by "huge" and compared to whom? The U.S. is currently running a $2 trillion deficit per year, which is about 6% of GDP, with only a fraction allocated to investments.
> weakening capital markets and real estate
China's economy operates differently from that of the U.S. Currently, China records monthly trade surpluses ranging between $80 billion and $100 billion.
The real estate sector indeed presents challenges, leading the government to inject funds into local government to manage the resulting debt. The effectiveness of these measures remains to be seen.
There is a lot of wishful thinking on HN regarding the rivalry between China and the U.S
The comparison is not between US and China. I don't understand why people keep making that comparison when it's not at all apples-to-apples. It's featured in headlines constantly, but it's honestly a stick measuring contest. For starters, the US is a free economy and the China is a centrally planned one. There's significant chatter about China numbers being massaged to suit the state's narrative, leaving would-be investors extra cautious, whereas in the US data quality and availability is state-of-the-art.
The real questions are: can China deliver on long term expectations for its economy? Do the trends support the argument that it will become a leading developed economy? I don't think they do. If they don't, then is it an issue with the current economy plan that can be solved with a better plan or is it a systemic issue that can't be solved in the near to medium term? These are way more useful questions than "who's going to win the race?"
>> Low growth
> For 2024 prediction is 2.6% US and 4.8% China. I don't see how it's low compared to US.
> What do you mean by "huge" and compared to whom?
To answer in reverse: yes, the US also has a debt problem. That doesn't make the China problem less of an issue. The china debt crisis has been widely reported and is related to the other point about real estate. Those articles will definitely do a better job of explaining the issue than me, so here's just one: https://www.reuters.com/breakingviews/chinas-risky-answer-wa...
> There is a lot of wishful thinking on HN regarding the rivalry between China and the U.S
I'm arguing there's no rivalry. Different countries, different problems, different scales entirely. China is in dire straits and I don't expect it to recover before the crisis gets worse.
> For starters, the US is a free economy and the China is a centrally planned one.
USSR was a centrally planned economy, China is not. Do you mean subsidies (like the IRA and CHIPS Act in the US) for certain industries, which act as guidance to local governments and state banks? Is that what you call "centrally planned"?
> can China deliver on long term expectations for its economy? Do the trends support the argument that it will become a leading developed economy? I don't think they do. If they don't, then is it an issue with the current economy plan that can be solved with a better plan or is it a systemic issue that can't be solved in the near to medium term?
That's your opinion that they can't, and it's your right to have one. There were people 10 years ago saying exactly what you’re saying now. Time showed they were wrong.
> China is growing slower than historically and slower than forecasts which were it at 5%. Look at this chart and tell me if it points a rosy picture or a problematic one:
Oh come on, 4.8% vs. 5%? As for the chart, it's the most incredible growth in the history of mankind. No country has achieved something like this. It's fully natural for it to decline in percentage terms, especially when another major power is implementing legislation to curb that growth, forcing capital outflows, imposing technology embargoes, etc.
> China is in dire straits and I don't expect it to recover before the crisis gets worse.
Time will tell. What I can say is that over the last 20 centuries, in 18 of them, China was the cultural and technological center of the world. So from China’s perspective, what they are doing now is just returning to their natural state. In comparison, the US is only 2 centuries old. Every human organization, whether a company or state, will sooner or later be surpassed by another human creation, there are no exceptions to this rule in all of human history. We have had many empires throughout our history. The Roman Empire was even greater at its peak than the US is now, and there were also the British Empire, the Spanish Empire, etc. Where are they now? Everything is cyclical. All of these empires lasted a few centuries and started to decline after around 200-250 years, much like the US now.
> I'm arguing there's no rivalry.
Come on, there is obvious rivalry. Just listen to US political elites and look at their actions—legislation. It's all about geopolitics and global influence to secure their own interests.
I wouldn’t consider it a major problem, especially with the coming robotic revolution. Even if the population declines by half, that would still leave 700 million people so twice the population of the U.S. According to predictions, the first signs of demographic challenges are expected to appear in about 15–20 years from now. That’s a long time, and a lot can change in two decades. Just compare the world in 2004 to today.
It's a major mistake to underestimate your competition.
That's a long ways out. We're barely past the first innings of the chatbot revolution and it's already struggling to keep going. Robotics are way more complex because physics can be cruel.
Show me what was possible 20 years ago versus what we can do now. I think you have enough imagination to envision what might be possible 20 years from now.
I don't really follow this line of thinking. $200 is nothing—nothing—in the context of the fully loaded cost of an employee for a month (at least, for any sort of employee who would benefit from using an LLM).
As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.