Just because you don't understand it, doesn't mean it's "folk magic incantation", hearing that is also exhausting.
I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.
> Just because you don't understand it, doesn't mean it's "folk magic incantation"
It absolutely is folk magic. I think it is more accurate to impugn your understanding than mine.
> I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it.
This is exactly what I mean by folk magic. Incantations based on vibes. One's intuition is notoriously inclined to agree with one's own conclusions.
> If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
This doesn't really make much sense.
First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.
Further, even if it did leave the context, that doesn't then demonstrate that the model is "not paying attention". Presumably whatever is in the context is relevant to the task, so if your definition of "paying attention" is "it exists in the context" it's actually paying better attention once it has replaced the canary with relevant information.
Finally, this reasoning relies on the misguided idea that because the model produces an output that doesn't correspond to an instruction, it means that the instruction has escaped the context, rather than just being a sequence where the model does the wrong thing, which is a regular occurrence even in short sessions that are obviously within the context.
> First of all, system prompts and things like agent.md never leave the context regardless of the length of the session, so the canary has absolutely zero meaning in this situation, making any judgements based on its disappearance totally misguided and simply a case of seeing what you want to see.
You're focusing on the wrong thing, ironically. Even if things are in the context, attention is what matters, and the intuition isn't about if that thing is included in the context or not, as you say, it'll always will be. It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.
> It's about if the model will pay attention to it, in the Transformers sense, which it doesn't always do.
Right... Which is why the "canary" idea doesn't make much sense. The fact that the model isn't paying attention to the canary instruction doesn't demonstrate that the model has stopped paying attention to some other instruction that's relevant to the task - it proves nothing. If anything, a better performing model should pay less attention to the canary since it becomes less and less relevant as the context is filled with tokens relevant to the task.
> This is exactly what I mean by folk magic. Incantations based on vibes
So, true creativity, basically? lol
I mean, the reason why programming is called a “craft” is because it is most definitely NOT a purely mechanistic mental process.
But perhaps you still harbor that notion.
Ah, I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half). I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.” The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?
I’ll never forget the programmer _why. That guy’s Ruby code was 100% art and “vibes.” And yet it worked… Brilliantly.
Does relying on “vibes” too heavily produce poor engineering? Absolutely. But one can be poetic while staying cognizant of the haiku restrictions… O-notation, untested code, unvalidated tests, type conflicts, runtime errors, fallthrough logic, bandwidth/memory/IO costs.
Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?
Perhaps because humans are also nondeterministic, and yet we somehow manage to still produce working code… Mostly. ;)
> I was a Psych major, so code was always more “writing” than “gears” to me… It was ALWAYS “magic.
The magic is supposed to disappear as you grow (or you’re not growing). The true magic of programming is you can actually understand what once was magic to you. This is the key difference I’ve seen my entire career - good devs intimately know “a layer below” where they work.
> Perhaps because humans are also nondeterministic
We’re not, we just lack understanding of how we work.
I’m not talking about “magic” as in “I don’t understand how it works.”
I’m talking “magic” as in “all that is LITERALLY happening is that bits are flipping and logic gates are FLOPping and mice are clicking and keyboards are clacking and pixels are changing colors in different patterns… and yet I can still spend hours playing games or working on some code that is meaningful to me and that other people sometimes like because we have literally synthesized a substrate that we apply meaning to.”
We are literally writing machines into existence out of fucking NOTHING!
THAT “magic.” Do you not understand what I’m referring to? If not, maybe lay off the nihilism/materialism pipe for a while so you CAN see it. Because frankly I still find it incredible, and I feel very grateful to have existed now, in this era.
And this is where the connection to writing comes in. A writer creates ideas out of thin air and transmits them via paper or digital representation into someone else’s head.
A programmer creates ideas out of thin air that literally fucking DO things on their own (given a general purpose computing hardware substrate)
> so code was always more “writing” than “gears” to me… It was ALWAYS “magic.”
> I suddenly realized why half of all developers hate AI-assisted coding (I am in the other half).
Thanks for this. It helps me a lot to understand your half. I like my literature and music as much as the next person but when it comes to programming it's all about the mechanics of it for me. I wonder if this really does explain the split that there seems to be in every thread about programming and LLMs
That is an artful quality, not an engineering one, even if the elegance leads to superior engineering.
As an example of beauty that is NOT engineered well, see the quintessential example of quicksort implemented in Haskell. Gorgeously simple, but not performant.
Creativity is meaningless without well defined boundaries.
> it is most definitely NOT a purely mechanistic mental process.
So what? Nothing is. Even pure mathematics involves deep wells of creativity.
> Ah, I suddenly realized why half of all developers hate AI-assisted coding
Just to be clear, I don't hate AI assisted coding, I use it, and I find that it increases productivity overall. However, it's not necessary to indulge in magical thinking in order to use it effectively.
> The only job where literally writing down words in a certain way produces machines that eliminate human labor. What better definition of magic is there, actually?
If you want to use "magic" as a euphemism for the joys of programming, I have no objection, when I say magic here I'm referring to anecdotes about which sequences of text produce the best results for various tasks.
> Determinism. That’s what you’re mad about, I’m thinking. And I completely get you there- how can I consider a “flagging test” to be an all-hands-on-deck affair while praising code output from a nondeterministic machine running off arbitrary prompt words that we don’t, and can’t, even know whether they are optimal?
I'm not mad about anything. It doesn't matter whether or not LLMs are deterministic, they are statistical, and vibes based advice is devoid of any statistical power.
I think Marvin Minsky had this same criticism of neural nets in general, and his opinion carried so much weight at the time that some believe he set back the research that led to the modern-day LLM by years.
I view it more as fun and spicy. Now we are moving away from the paradigm that the computer is "the dumbest thing in existence" and that requires a bit of flailing around which is exciting!
Folk magic is (IMO) a necessary step in our understanding of these new.. magical.. tools.
I won't begrudge anyone having fun with their tools, but folk magic definitely isn't a necessary step for understanding anything, it's one step removed from astrology.
I see what you mean, but I think it's a lot less pernicious than astrology. There are plausible mechanisms, it's at least possible to do benchmarking, and it's all plugged into relatively short feedback cycles of people trying to do their jobs and accomplish specific tasks. Mechanical interpretability stuff might help make the magic more transparent & observable, and—surveillance concerns notwithstanding—companies like Cursor (I assume also Google and the other major labs, modulo self-imposed restrictions on using inference data for training) are building up serious data sets that can pretty directly associate prompts with results. Not only that, I think LLMs in a broader sense are actually enormously helpful specifically for understanding existing code—when you don't just order them to implement features and fix bugs, but use their tireless abilities to consume and transform a corpus in a way that helps guide you to the important modules, explains conceptual schemes, analyzes diffs, etc. There's a lot of critical points to be made but we can't ignore the upsides.
I'd say the only ones capable of really approaching anything like scientific understanding of how to prompt these for maximum efficacy are the providers not the users.
Users can get a glimpse and can try their best to be scientific in their approach however the tool is of such complexity that we can barely skim the surface of what's possible.
That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.
Frankly it would be enormously costly in both time and API costs to get anywhere near best practices backed up by experimental data let alone having coherent and valid theories about why a prompt technique works the way it does. And even if you built up this understanding or set of techniques they might only work for one specific model. You might have to start all over again in a couple of months
> That is why you see "folk magic", people love to share anecdata because.. that's what most people have. They either don't have the patience, the training or simply the time to approach these tools with rational rigor.
Yes. That's exactly the point of my comment. Users aren't performing anything even remotely approaching the level of controlled analysis necessary to evaluate the efficacy of their prompt magic. Every LLM thread is filled with random prompt advice that varies wildly, offered up as nebulously unfalsifiable personality traits (e.g. "it makes the model less aggressive and more circumspect"), and all with the air of a foregone conclusion's matter-of-fact confidence. Then someone always replies with "actually I've had the exact opposite experience with [some model], it really comes down to [instructing the model to do thing]".
> As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on.
This is not entirely true. They pay the most attention to the things that are the earliest in history and the most recent in it, while the middle between the two is where the dip is. Which basically means that the system prompt (which is always on top) is always going to have attention. Or, perhaps, it would be more accurate to say that because they are trained to follow the system prompt - which comes first - that's what they do.
Larger contexts are inherently more attention-taxing, so the more you throw at it, the higher the probability that any particular thing is going to get ignored. But that probability still varies from lower at the beginning to higher in the middle and back to lower in the end.
I don't know the merit to what parent is saying, but it does make some intuitive sense if you think about it. As the context fills up, the LLM places less attention on further and further back in the context, that's why the LLM seems dumber and dumber as a conversation goes on. If you put 5 instructions in the system prompt or initial message, where one acts as a canary, then you can easier start to see when exactly it stops following the instructions.
Personally, I always go for one-shot answer, and if it gets it wrong or misunderstands, restart from the beginning. If it doesn't get it right, I need to adjust the prompt and retry. Seems to me all current models do get a lot worse quickly, once there is some back and forth.