One of my agents is kinda like this too. The only operation is SPARQL query, and the only accessible state is the graph database.
Since most of the ontologies I'm using are public, I just have to namedrop them in prompt; no schemas and little structure introspection needed. At worst, it can just walk and dump triples to figure out structure; it's all RDF triples and URIs.
One nice property: using structured outputs, you can constrain outputs of certain queries to only generate valid RDF to avoid syntax errors. Probably can do similar stuff with GraphQL.
Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:
solve(WaterDrinker, ZebraOwner) :-
% H01: Five houses with positions 1..5.
Houses = [ house(1, _, norwegian, _, _, _), % H10: Norwegian lives in the first house.
house(2, blue, _, _, _, _), % H15: Since the Norwegian lives next to the blue house,
house(3, _, _, milk, _, _), % and house1 is Norwegian, house2 must be blue.
house(4, _, _, _, _, _),
house(5, _, _, _, _, _) ],
% H02: The Englishman lives in the red house.
member(house(_, red, englishman, _, _, _), Houses),
% H03: The Spaniard owns the dog.
member(house(_, _, spaniard, _, dog, _), Houses),
% H04: Coffee is drunk in the green house.
member(house(_, green, _, coffee, _, _), Houses),
% H05: The Ukrainian drinks tea.
member(house(_, _, ukrainian, tea, _, _), Houses),
% H06: The green house is immediately to the right of the ivory house.
right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
% H07: The Old Gold smoker owns snails.
member(house(_, _, _, _, snails, old_gold), Houses),
% H08: Kools are smoked in the yellow house.
member(house(_, yellow, _, _, _, kools), Houses),
% H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
% H12: Kools are smoked in a house next to the house where the horse is kept.
next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
% H13: The Lucky Strike smoker drinks orange juice.
member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
% H14: The Japanese smokes Parliaments.
member(house(_, _, japanese, _, _, parliaments), Houses),
% (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
% Finally, find out:
% Q1: Who drinks water?
member(house(_, _, WaterDrinker, water, _, _), Houses),
% Q2: Who owns the zebra?
member(house(_, _, ZebraOwner, _, zebra, _), Houses).
right_of(Right, Left, Houses) :-
nextto(Left, Right, Houses).
next_to(X, Y, Houses) :-
nextto(X, Y, Houses);
nextto(Y, X, Houses).
Seems ok to me.
?- solve(WaterDrinker, ZebraOwner).
WaterDrinker = norwegian,
ZebraOwner = japanese .
That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.
This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.
LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.
gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.
COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.
This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.
The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.
I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.
I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.
I'm certain models like o3-mini are capable of writing Prolog of this quality for puzzles they haven't seen before - it feels like a very straight-forward conversion operation for them.
My comment got eaten by HN, but I think LLMs should be used as the glue between logic systems like prolog, with inductive, deductive and abductive reasoning being handed off to a tool. LLMs are great at pattern matching, but forcing them to reason seems like an out of envelope use.
Prolog would be how I would solve puzzles like that as well. It is like calling someone weak for using a spreadsheet or a calculator.
I actually coincidentally tried this yesterday on variants of the "surgeon can't operate on boy" puzzle. It didn't help, LLMs still can't reliably solve it.
(All current commercial LLMs are badly overfit on this puzzle, so if you try changing parts of it they'll get stuck and try to give the original answer in ways that don't make sense.)
If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.
Not the user you’re replying to, but I would feel differently if the LLM responded with “This is a problem I can’t reliably solve by myself, but there’s a logic programming system called Prolog for which I could write a suitable program that would. Do you have access to a Prolog interpreter, or could you give me access to one? I could also just output the Prolog program if you like.”
Furthermore, the LLM does know how Prolog’s unification algorithm works (in the sense that it can provide an explanation of how Prolog and the algorithm works), yet it isn’t able to follow that algorithm by itself like a human could (with pen and paper), even for simple Prolog programs whose execution would fit into the resource constraints.
This is part of the gap that I see to true human-level intelligence.
If an LLM can solve a riddle of arbitrary complexity that is not similar to an already-solved riddle, have the LLM solve the riddle "how can this trained machine-learning model be adjusted to improve its riddle-solving abilities without regressing in any other meaningful capability".
It's apparent that particular riddle is not presently solved successfully by LLMs, as if it were solved, humans would be having LLMs improve themselves in the wild.
So, constructively, there exists at least one riddle that doesn't have a pattern similar to existing ones, where that riddle is unsolvable by any existing LLM.
If you present a SINGLE riddle an LLM can solve, people will reply that particular riddle isn't good enough. In order to succeed they need to solve all the riddles, including the one I presented above.
It's quite the opposite. Converting to words like yours, the argument is "could a powerful but not omnipotent god make themself more powerful", and the answer is "probably".
If the god cannot grant themself powers they're not very powerful at all, are they?
Good point. LLMs can be treated as "theories" and then they definitely meet falsifiability [1] allowing researchers finding "black swans" for years to come. Theories in this case can be different. But if the theory is of logical or symbolic solver then Wolfram's Mathematica may be struggle with understanding the human language as an input, but when evaluating the results, well, I think Stephen (Wolfram) can sleep soundly, at least for now
Care to elaborate? It’s pretty reasonable to be emotional if you feel like your loved ones are being manipulated, especially if they’re children. The article isn’t even that grumpy that it doesn’t deserve reasonable considerations from readers.
I like popcount for converting a 2^N-bit uniformly-distributed random number into a N-bit binomially-distributed one. Each bit of the input simulates a random coin flip.
Not if you already have 2^n bits at hand. In fact, if you have 2^n bits of entropy, popcount is probably more efficient than generating n more bits randomly.
Since most of the ontologies I'm using are public, I just have to namedrop them in prompt; no schemas and little structure introspection needed. At worst, it can just walk and dump triples to figure out structure; it's all RDF triples and URIs.
One nice property: using structured outputs, you can constrain outputs of certain queries to only generate valid RDF to avoid syntax errors. Probably can do similar stuff with GraphQL.