Hacker Newsnew | past | comments | ask | show | jobs | submit | isaacimagine's commentslogin

No mention of Decision Transformers or Trajectory Transformers? Both are offline approaches that tend to do very well at long-horizon tasks, as they bypass the credit assignment problem by virtue of having an attention mechanism.

Most RL researchers consider these approaches not to be "real RL", as they can't assign credit outside the context window, and therefore can't learn infinite-horizon tasks. With 1m+ context windows, perhaps this is less of an issue in practice? Curious to hear thoughts.

DT: https://arxiv.org/abs/2106.01345

TT: https://arxiv.org/abs/2106.02039


TFP cites decision transformers. Just using a transformer does not bypass the credit assignment problem. Transformers are an architecture for solving sequence modeling problems, e.g. the credit assignment problem as arises in RL. There have been many other such architectures.

The hardness of the credit assignment problem is a statement about data sparsity. Architecture choices do not "bypass" it.


TFP: https://arxiv.org/abs/2506.04168

The DT citation [10] is used on a single line, in a paragraph listing prior work, as an "and more". Another paper that uses DTs [53] is also cited in a similar way. The authors do not test or discuss DTs.

> hardness of the credit assignment ... data sparsity.

That is true, but not the point I'm making. "Bypassing credit assignment", in the context of long-horizon task modeling, is a statement about using attention to allocate long-horizon reward without horizon-reducing discount, not architecture choice.

To expand: if I have an environment with a key that unlocks a door thousands of steps later, Q-Learning may not propagate the reward signal from opening the door to the moment of picking up the key, because of the discount of future reward terms over a long horizon. A decision transformer, however, can attend to the moment of picking up the key while opening the door, which bypasses the problem of establishing this long-horizon causal connection.

(Of course, attention cannot assign reward if the moment the key was picked up is beyond the extent of the context window.)


You can do Q-Learning with a transformer. You simply define the state space as the observation sequence. This is in fact natural to do in partially observed settings. So your distinction does not make sense.


DT's reward-to-go vs. QL's Bellman incl. discount, not choice of architecture for policy. You could also do DTs with RNNs (though own problems w/ memory).

Apologies if we're talking past one another.


The article is about using nalgebra to create an intuitive library for tranforming between Earth's various coordinate systems. Not "another Rust matrix library".


OK, it's atop nalgebra. Of course, if you're using glam...


Very cool, thank you for sharing!


+10 respect, thank you <3


There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.

I profiled just now:

          | instrs (aarch64) | time 100k (s) | conway samples (%) | 
    | -O0 |              606 |        19.10s |             78.50% |
    | -O3 |              135 |          3.45 |             90.52% | 
The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like:

    0000000100002d6c ldr x8, [sp, #0x4a0]
    0000000100002d70 ldr x9, [sp, #0x488]
    0000000100002d74 orn x8, x8, x9
    0000000100002d78 str x8, [sp, #0x470]
Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization):

    0000000100003744 orr x3, x3, x10
    0000000100003748 orn x1, x1, x9
    000000010000374c and x1, x3, x1
    0000000100003750 orr x3, x8, x17
Looking more closely, there's actually surprisingly little register spilling.

I think the real question you're asking is, as I wrote:

> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?

Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.

Hope that answers your questions. Love your writing Gwern.


Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.

A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.

Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.


Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.


z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.

Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.


Glad you enjoyed it, and thanks for the tip!


Agree, it's much better to write up a journal at times when your colleagues would be https://xkcd.com/303


Chaotic energy haha, I like it. Thanks for the tips re: keeping a journal, I will do this more in the future. I usually keep development notes, though normally in markdown files scattered across the codebase or in comments, never by date in the README. In the future, I might make JOURNAL.md a standard practice in my projects? re:w&b, I used w&b when it first came out and I liked it but I'm sure it's come a lot further in the time since then. I will have to take a look!

Also lol "pretentious perfectionist" I'm glad to finally have some words to describe my design aesthetic. I like crisp fonts, what can I say.


  > Chaotic energy haha, I like it
My boss says I'm eccentric. I say that's just a nice word for crazy lol

> normally in markdown files scattered across the codebase or in comments

I used to do that too but they didn't end up helping because I could never find them. So I moved back to using a physical book. The wandb reports was the first time I really had something where I felt like I got more out of it than a physical book. Even my iPad just results in a lot of lost stuff and more time trying to figure out why I can't just zoom in on the notes app. I mean what is an iPad even for if it isn't really good for writing?

But the most important part of the process I talked about is the logging of all the parameters and options. Those are the details you tend to lose and go hunting for. So even if you never write a word you'll see huge benefits from this.

  > re:w&b
Wandb's best feature is that you can email them requesting a feature and they'll implement it or help you implement it. It's literally their business model. I love it. I swear, they have a support agent assigned to me (thanks Art! And if wandb sees this, give the man a raise. Just look at what crazy people he has to deal with)

  >  lol "pretentious perfectionist" I'm glad to finally have some words to describe my design aesthetic
To be clear, I'm actually not. Too chaotic lol. Besides, perfectionism doesn't even exist. It's more a question about personal tastes and where we draw the line for what is good enough. I wish we'd stop saying "don't let perfectionism get in the way of good" because it assumes like there's universal agreement about what good enough is.


Parameters and options, got it. I try to keep all configuration declarative and make building and running as deterministic as possible. Then I can commit whenever I do something interesting, that I can just checkout to revisit.


I think these are the two main headaches with experimenting. No matter what kind of experiment you're doing (computation, physics, chem, bio, whatever)

  - Why the fuck aren't things working
  - Why the fuck are things working
The second is far more frustrating. The goal is to understand and explain why things are the way they are. To find that causal structure, right? So in experimenting, getting things working means you're not even half way done.

So if you are "organized" and flexible, you can quickly test different hypotheses. Is it the seed? The model depth? The activation layers? What?

Without the flexibility it gets too easy to test multiple things simultaneously and lose track. You want to isolate variables as much as possible. Variable interplay throws a wrench into that so you should make multiple modifications at once to optimally search through configuration space but how can you do any actual analysis if you don't record this stuff. And I guarantee you'll have some hunch and be like "wait, I did something earlier that would be affected by that!" and you can go check to see if you should narrow down on that thing or not.

The reason experimenting is hard is because it is the little shit that matters. That's why I'm a crazy pretentious "perfectionist". Because I'm lazy and don't have the budgets or time to be exhaustive. So free up your ability so you can quickly launch experiments and spend more time working on your hypotheses, because that task is hard enough. You don't want to do that while also having to be debugging and making big changes to code where you're really just going to accidentally introduce more errors. At least that's what happens to my dumb ass, but I haven't yet met a person that avoids this, so I know I'm not alone.


Author here. Any questions, ask away.


Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?

I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.


That part about passthrough strongly reminded me of Turing’s Unorganized Machines (randomly wired NAND-gate networks): https://weightagnostic.github.io/papers/turing1948.pdf (worth a read from page 9)


Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.


Thank you, I'm glad you enjoyed it!

Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:

> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.

I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.

[0]: https://gwern.net/fiction/clippy


Is it ok if I ask you for your age?


how does ~300 gates you got compare to modern optimal implementations?

iirc it's around 30-40?


Was this result surprising?


Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.

The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.

But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.


what percentage of ops were passthru ?

ps. superb writeup and project


Thank you! Good question, Here are the NN stats, before lowering to C:

    total gates        | 2303 | 100.0%
    -------------------+------+-------
    passthrough        | 2134 |  92.7%
    gates w/ no effect | 1476 |  64.1%
Note the rows aren't mutually exclusive.


You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:

Any live cell with more than three live neighbours dies

Nit: > I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.

That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.


You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).

From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0


I tested this out with a promotional illustration from Neon Genesis Evangelion. The model works quite well, but there are some temporal artifacts w.r.t. the animation of the hair as the head turns:

https://goto.isaac.sh/neon-anisora

Prompt: The giant head turns to face the two people sitting.

Oh, there is a docs page with more examples:

https://pwz4yo5eenw.feishu.cn/docx/XN9YdiOwCoqJuexLdCpcakSln...


link's broke



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: