What a fascinating intersection of technology and human psychology!
"One thing I noticed toward the end is that, even though the robot remained expressive, it started feeling less alive. Early on, its motions surprised me: I had to interpret them, infer intent. But as I internalized how it worked, the prediction error fadedExpressiveness is about communicating internal state. But perceived aliveness depends on something else: unpredictability, a certain opacity. This makes sense: living systems track a messy, high-dimensional world. Shoggoth Mini doesn’t.
This raises a question: do we actually want to build robots that feel alive? Or is there a threshold, somewhere past expressiveness, where the system becomes too agentic, too unpredictable to stay comfortable around humans?"
Speaking as a drummer: yes, it’s completely different. The movements of a drummer are part of a single coordinated and complementary whole. Carrying on two conversations at once would be more like playing two different songs simultaneously. I’ve never heard of anyone doing that.
That said, Bob Milne could actually reliably play multiple songs in his head at once - in an MRI, could report the exact moment he was at in each song at an arbitrary time - but that guy is basically an alien. More on Bob: https://radiolab.org/podcast/148670-4-track-mind/transcript.
Nobody is expecting you to be able to derive and write automatic differentiation (AD) library from scratch but it's always good to know the fundamentals [1].
Andriy Burkov has written excellent trilogy books series on AI/LLMs namely "The Hundred-Page Machine Learning Book" and "Machine Learning Engineering" and the latest "The Hundred-Page Language Models Book" [2],[3],[4].
Having said that, the capability of providing useful AI/LLMs solutions for intuitive and interactive learning environment, training portal, standards documentation exploration, business and industry rules and regulations checking, etc based on the open-source local-first data repository with AI/LLMs are probably the killer application that're truly useful for end users, for examples here [5],[6].
[6] AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems. It serves as an educational aid integrated into the course’s learning environment:
Just curious, how "deep" have you gone into the theory? What resources have you used? How strong is your math background?
Unfortunately a lot of the theory does require some heavy mathematics, the type you won't see in a typical undergraduate degree even for more math heavy subjects like physics. Topics such as differential geometry, metric theory, set theory, abstract algebra, and high dimensional statistics. But I do promise that the theory helps and can build some very strong intuition. It is also extremely important that you have a deep understanding of what these mathematical operations are doing. It does look like this exercise book is trying to build that intuition, but I haven't read it in depth. I can say it is a good start, but only the very beginning of the theory journey. There is a long road ahead beyond this.
> how it makes me choose the correct number of neurons in a layer, how many layers,
Take a look at the Whitney embedding theorem. While this isn't a precise answer, it'll help you gain some intuition about the minimal number of parameters you need (and the VGG paper will help you understand width vs depth). In a transformer, the MLP layer post attention scales up 4x the dimensions before coming down, which allows for untangling any knots in the data. While 2x is the minimum, 4x creates a smoother landscape and so the problem can be solved more easily. Some of this is discussed in paper (Schaeffer, Miranda, and Koyejo) that counters the famous Emergent Abilities paper by Wei et al. This should be discussed early on in ML courses when discussing problems like XOR or the concentric circle. These problems are difficult because in their natural dimension you cannot draw a hyperplane discriminating them, but by increasing the dimensionality of the problem you can. This fact is usually mentioned in intro ML courses but I'm not aware of one that contains more details such as a discussion of the Whitney embedding theorem that allow you to better generalize the concepts here.
> the activation functions
There's a very short video I like that visualizes Gelu[0], even using the concentric circles! The channel has a lot of other visualizations that will really benefit your intuition. You may see where the differential geometry background can provide benefits. Understanding how to manipulate manifolds is critical to understanding what these networks are doing to the data. Unfortunately these visualizations will not benefit you once you scale beyond 3D as weird things happen in high dimensions, even as low as 10[1]. A lot of visual intuition goes out the window and this often leads people to either completely abandon it or make erroneous assumptions (no, your friend cannot visualize 4D objects[2,3] and that image you see of a tesseract is quite misleading).
The activation functions provide non-linearity to the networks. A key ingredient missing from the preceptron model. Remember that with the universal approximation theorem you can approximate any smooth, Lipschitz-continuious function, over a closed boundary. You can, in simple cases, relate this to Riemann Summation, but you are using smooth "bump functions" instead of rectangles. I'm being fairly hand-wavy here on purpose because this is not precise but there are relationships to be found here. This is a HN comment, I have to overly simplify. Also remember that a linear layer without an activation can only perform Affine Transformations. That is, after all, what a matrix multiplication is capable of (another oversimplification).
The learning curve is quite steep and there's a big jump from the common "it's just GMMs" or "it's just linear algebra" that is commonly claimed[4]. There is a lot of depth here, and unfortunately due to the hype there is a lot of stuff that says "deep" or "advanced mathematics" but it is important to remember that these terms are extremely relative. What is deep to one person is shallow to another. But if it isn't going beyond calculus, you are going to struggle, and I am extremely empathetic to that. But again, I do promise that there is a lot of insight to be gained by digging into the mathematics. There is benefit to doing things the hard way. I won't try to convince you that it is easy or that there isn't a lot of noise surrounding the topic, because that'd be a lie. If it were easy, ML systems wouldn't be "black boxes"![5]
I would also encourage you to learn some meta physics. Something like Ian Hacking's representing and Intervening is a good start. There are limitations to what can be understand through experimentation alone, famously illustrated in Dyson's recounting of then Fermi rejected his paper[6]. There is a common misunderstanding of the saying "with 4 parameters I can fit an elephant and with 5 I can make it wiggle its trunk." [6] can help provide a better understanding to this, but we truly do need to understand the limitation of empirical studies. Science relies on the combination of empirical studies and theory. They are no good without the other. This is because science is about creating causal models, so one must be quite careful and be extremely nuanced when doing any form of evaluation. The subtle details can easily trick you.
[5] I actually dislike this term. It is better to say that they are opaque. A black box would imply that we have zero insights. But in reality we can see everything going on inside, it is just extremely difficult to interpret. We also do have some understanding, so the interpretation isn't impenetrable.
I tried... it started with the idea was that log loss might not be the best option for training, and maybe it should be a loss related to how wrong the predicted word was. Predicting "dog" instead of "cat" should be less penalised than predicting "running".
That turns out to be an ultrametric loss, and the derivative of an ultrametric loss is zero in a large region around any local minimum, so it can't be trained by gradient descent -- it has to be trained by search.
I have an internal repo that does guided window attn. I figured out that One Weird Trick to get the model to learn how to focus so that you can move a fixed window around instead of full attn. I also built NNMemory (but that appears to be an idea others hae had now too [1]) and I have a completely bonkers mechanism for non-determanistic exit logic so that the model can spin until it thinks it has a good answer. I also built scale free connections between layers to completely remove residual connections. Plus some crazy things on sacrificial training (adding parameters that are removed after training in order to boost training performance with no prod penalty). There are more crazy things I have built but they aren't out there in the wild, yet. Some of the things I have built are in my repo. [2] I personally think we can get .5b models to outperform 8b+ SOTA models out there today (even the reasoning models coming out now)
The basic transformer block has been good at kicking things off, but it is now holding us back. We need to move to recurrent architectures again and switch to fixed guided attn windows + 'think' only layers like NNMemory. Attn is distracting and we know this as humans because we often close our eyes when we think hard about a problem on the page in front of us.
I've come up with a set of rules that describe our reactions to technologies:
1. Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of things.
- Douglas Adams
There might be some papers or other guides out there, but their advice will be based on whatever tools happened to be available at the time they were written and on the particular types of translations the authors cared about. The technology is advancing so rapidly that you might be better off just experimenting with various LLMs and prompts for texts and language pairs you are interested in.
I started using LLMs for translation after GPT-4 came out in March 2023—not that long ago! At first, the biggest problem was the context window: it wasn’t possible to translate more than a couple of pages at a time. Also, prompt writing was in its infancy, and a lot of techniques that have since emerged were not yet widely known. Even now, I still do a lot of trial and error with my prompts, and I cannot say with confidence that my current prompting methods are the best.
But, for what it’s worth, here are some strategies I currently use when translating with LLMs:
- In the prompt, I explain where the source text came from, how the translation will be used, and how I want it to be translated. Below is a (fictional) example, prepared through some metaprompting experiments with Claude:
- I run the prompt and source text through several LLMs and glance at the results. If they are generally in the style I want, I start compiling my own translation based on them, choosing the sentences and paragraphs I like most from each. As I go along, I also make my own adjustments to the translation as I see fit.
- After I have finished compiling my draft based on the LLM versions, I check it paragraph by paragraph against the original Japanese (since I can read Japanese) to make sure that nothing is missing or mistranslated. I also continue polishing the English.
- When I am unable to think of a good English version for a particular sentence, I give the Japanese and English versions of the paragraph it is contained in to an LLM (usually, these days, Claude) and ask for ten suggestions for translations of the problematic sentence. Usually one or two of the suggestions work fine; if not, I ask for ten more. (Using an LLM as a sentence-level thesaurus on steroids is particularly wonderful.)
- I give the full original Japanese text and my polished version to one of the LLMs and ask it to compare them sentence by sentence and suggest corrections and improvements to the translation. (I have a separate prompt for this step.) I don’t adopt most of the LLM’s suggestions, but there are usually some that I agree would make the translation better. I update the translation accordingly. I then repeat this step with the updated translation and another LLM, starting a new chat each time. Often I cycle through ChatGPT --> Claude --> Gemini several times before I stop getting suggestions that I feel are worth adopting.
- I then put my final translation through a TTS engine—usually OpenAI’s—and listen to it read aloud. I often catch minor awkwardnesses that I would overlook if reading silently.
This particular workflow works for me because I am using LLMs to translate in the same language direction I did manually for many years. If I had to translate to or from a language I don’t know, I would add extra steps to have LLMs check and double-check the accuracy of the translation and the naturalness of the output.
I was asked recently by some academics I work with about how to use LLMs to translate documents related to their research into Japanese, a language they don’t know. It’s an interesting problem, and I am planning to spend some time thinking about it soon.
Please note that my translation process above is focused on quality, not on speed. If I needed to translate a large volume of text more quickly, I would write a program to do the translation, checking, and rechecking through API calls, accepting the fact that I would not be able to check and polish the translation manually as I do now.
If anyone here would like to brainstorm together about how to use LLMs for translation, please feel free to email me. My website, with my email address on the Contact page, is linked from my HN profile page.
* I recommend everyone with a prostate get their PSA tested by age 45, or earlier if it runs in the family. It's a simple blood test but sometimes you have to ask for it.
* Early detection is key. A biopsy can assess how aggressive it is (the Gleason score). They can even genetically profile the tumor to further gauge risk (the Decipher score).
* Most cases are slow growing and may never escape the prostate. These are usually treated by simply monitoring them with more PSA testing, imaging and the occasional biopsy. Years ago these cases uses to be over-treated but now they have high quality data showing the safety of monitoring.
* If it hasn't left the prostate (stage 1 or 2) and they think you need treatment there are a few options. In lesser cases you can do focal therapy which is like zapping a small tumor. But it tends to be a multi-focal disease, so often you'll end up getting "definitive treatment": removing the prostate surgically or saturating it with radiation.
* Surgery typically uses a cool-ass remote-controlled robot called "the DaVinci" that operates through a set of small incisions.
* Radiation is done either via external beam -- you just lie there and get zapped -- or by inserting radioactive seeds directly into the prostate (brachytherapy).
* The main side effect of surgery is erectile dysfunction, because the nerves that control that run right along the prostate and nerves do not like getting manipulated. Recovery can take up to two years and you don't always get all your function back. I think that stat is something like 1/3rd get back to baseline, 1/3rd get there but with the help of meds (Viagra), and 1/3rd suffer permanent decline / need other interventions. There is also some incontinence but that typically resolves within weeks/months.
* Radiation also causes ED, as well as urinary bother and potentially some other issues. But these typically kick in later, possibly years later. There are also some short term side effects while undergoing treatment. It's a good option for older patients for sure but "younger" patients (<60 y.o.) have to consider the effects of radiation on healthy tissue decades out. You may also have to undergo a 6-18 month course of hormone therapy which I'll discuss next.
* If it escapes the prostate (stage 3 or 4) then there is an awesome new scan called a PSMA PET that can locate where it is with radiomarkers. They might do focal treatment on hot spots, but the main course of treatment is hormone therapy.
* Hormone therapy is essentially chemical castration. It removes all the testosterone from your body, and this weakens prostate cancer cells. By all accounts it is not fun: kills libido, saps energy levels. But it's also not as rough as chemotherapy, so we're "lucky" prostate cells work that way. There has been a lot of development here; often when you hear there's a new prostate cancer drug it's for hormone therapy.
* We all die but you don't want to die from prostate cancer. It's typically drawn out and painful. I watched my dad go through it, and it's why I got tested early. Fortunately it is very treatable if caught early.
Very cool to see an article that discusses Crutchfield's Epsilon machine formalism. It's one of those rare theories that is conceptually powerful but also simple and concrete enough that it can be implemented in a couple hundred lines of code.
For those interested, [1] is a readable (if quirky) introduction to the theory. The paper discussed in this article seems to discuss a way of "stacking" epsilon machines, so that you have a machine that describes the state transitions of a machine that describes a data set. I wonder if this gets around the main weakness of the e-machine formalism, namely that for a process with non-finite memory, there's no obvious next class of automata to try after finite state machines. In a sense FSMs are the only non-arbitrary model of computation; everything else basically boils down to augmenting a finite-state control with a gadget for storing data, like a stack (pushdown automata), register (counter/register machines), random-access tape (Turing machines), random-access tape but you're only allowed a tape the size of the input (linear-bounded automata) etc. You can constrain that gadget in pretty much arbitrary ways, which makes it difficult to choose a computational model for a non-finite process.
All the graphics for these are made in Chalk which is a python port of Haskell's Diagrams library to https://github.com/chalk-diagrams/chalk . Honestly I mostly make the puzzles as an excuse to hack on the graphics library which I find pretty interesting.
"One thing I noticed toward the end is that, even though the robot remained expressive, it started feeling less alive. Early on, its motions surprised me: I had to interpret them, infer intent. But as I internalized how it worked, the prediction error faded Expressiveness is about communicating internal state. But perceived aliveness depends on something else: unpredictability, a certain opacity. This makes sense: living systems track a messy, high-dimensional world. Shoggoth Mini doesn’t.
This raises a question: do we actually want to build robots that feel alive? Or is there a threshold, somewhere past expressiveness, where the system becomes too agentic, too unpredictable to stay comfortable around humans?"