More

arjie · 2026-04-24T14:40:18 1777041618

I find that building a personal blocklist extension for myself lets me treat such threads as fertile grounds. I no longer get annoyed because I am pleased that I can quickly remove a lot of low quality commenters at once. Recommend writing one for yourself (trivial with LLM).

Original comment was clever and subsequent commenters were uninteresting to me. In this case, I only saw it because I’m on my phone which doesn’t have Chrome extensions. Turns out I’d already blocked them.

arjie · 2026-04-23T19:16:41 1776971801

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.

dataviz1000 · 2026-04-23T19:38:08 1776973088

> A harness is just supportive scaffolding to run something.

Thank you for the perfect explanation.

Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.

I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.

arjie · 2026-04-23T19:11:05 1776971465

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

arjie · 2026-04-23T18:59:18 1776970758

Get the actual prompt and have Claude Code / Codex try it out via curl / python requests. The full prompt will yield debugging information. You have to set a few parameters to make sure you get the full gpt-5 performance. e.g. if your reasoning budget too low, you get gpt-4 grade performance.

IMHO you should just write your own harness so you have full visibility into it, but if you're just using vanilla OpenClaw you have the source code as well so should be straightforward.

pantulis · 2026-04-23T19:41:54 1776973314

> IMHO you should just write your own harness

Can you point to some online resources to achieve this? I'm not very sure where I'd begin with.

arjie · 2026-04-23T19:55:12 1776974112

Ah, I just started with the basic idea. They're super trivial. You want a loop, but the loop can't be infinite so you need to tell the agent to tell you when to stop and to backstop it you add a max_turns. Then to start with just pick a single API, easiest is OpenAI Responses API with OpenAI function calling syntax https://developers.openai.com/api/docs/guides/function-calli...

You will naturally find the need to add more tools. You'll start with read_file (and then one day you'll read large file and blow context and you'll modify this tool), update_file (can just be an explicit sed to start with), and write_file (fopen . write), and shell.

It's not hard, but if you want a quick start go download the source code for pi (it's minimal) and tell an existing agent harness to make a minimal copy you can read. As you build more with the agent you'll suddenly realize it's just normal engineering: you'll want to abstract completions APIs so you'll move that to a separate module, you'll want to support arbitrary runtime tools so you'll reimplement skills, you'll want to support subagents because you don't want to blow your main context, you'll see that prefixes are more useful than using a moving window because of caching, etc.

With a modern Claude Code or Codex harness you can have it walk through from the beginning onwards and you'll encounter all the problems yourself and see why harnesses have what they do. It's super easy to learn by doing because you have the best tool to show you if you're one of those who finds code easier to read that text about code.

wild_egg · 2026-04-23T19:51:36 1776973896

At the core, they're really very simple [1]. Run LLM API calls in a loop with some tools.

From there, you can get much fancier with any aspect of it that interests you. Here's one in Bash [2] that is fully extensible at runtime through dynamic discovery of plugins/hooks.

[1] https://ampcode.com/notes/how-to-build-an-agent

[2] https://github.com/wedow/harness

vidarh · 2026-04-23T21:46:00 1776980760

Here's a starting point in 93 lines of Ruby, but that one is already bigger than necessary:

https://radan.dev/articles/coding-agent-in-ruby

Really, of the tools that one implements, you only need the ability to run a shell command - all of the agents know full well how to use cat to read, and sed to edit.

(The main reason to implement more is that it can make it easier to implement optimizations and safeguards, e.g. limit the file reading tool to return a certain length instead of having the agent cat a MB of data into context, or force it to read a file before overwriting it)

stavros · 2026-04-23T21:53:52 1776981232

Just use Pi core, no need to reinvent the wheel.

jswny · 2026-04-23T20:02:53 1776974573

Codex is fully open source…

arjie · 2026-04-23T15:17:05 1776957425

Board games are great fun and also provide an excuse to hang out with your friends on a schedule. Some of my favourites are:

Power Grid: An ancient one. You compete to connect cities to your power network by buying resources on a market with a fixed replenishment cycle (so the book depletes as each player goes) and buying plants in auction.

Forbidden Stars: WH40k game. The interesting device in this game is that you commit to your actions ahead of time and others stack their actions on top of yours so yours will happen last but you can activate each map section available at your convenience. Combat with card draws and figurines.

Twilight Struggle: The US and the USSR struggle for control of the world. You play cards that represent various pivotal moments in history to give you influence in various parts of the world. You're allowed to coup and realign countries. Dice rolls are significant. An amusing self-confession is that I can't bring myself to play the USSRs. Nuclear Subs as a headline just makes me flush with pride https://twilightstrategy.com/2012/09/10/nuclear-subs/

I haven't played the latter two in recent times but ones I have played recently are:

Mahjong: An old classic. Trick taking with tiles. We most enjoy playing with the Chinese Official scoring rules https://web.archive.org/web/20250219225547/http://mahjong.wi...

But the Taiwanese style are easier to start with

Terraforming Mars: Tableau-building game (you have points based on the cards you've played) with an economy and map placement. I like the Venus and Colonies expansions. Best played with 3d printed parts to keep your nezos in place.

These are all great fun!

arjie · 2026-04-23T14:38:10 1776955090

I can't see why I would want this, but I do love Tailscale so I'm excited to see what new stuff he comes up with here.

arjie · 2026-04-23T04:57:21 1776920241

The Last Ringbearer is some more fan-fiction in the LOTR world that does this as well. I found it fairly entertaining, though I think LOTR as it stands is extraordinary, especially when told from the lens of not being the main story but a later side questy bit.

arjie · 2026-04-23T00:24:59 1776903899

There's a datacenter around the corner from where I live in San Francisco. More than a decade ago[0], I worked at a company that had hundreds of machines there. Recently I was looking to colocate a server and found that Hosting.com on 3rd street sold off datacenter operations and the buyer shut them down at that location. Sad. Hurricane Electric is still running in Fremont and it's only an hour away, but I would have preferred to have just walked next door. Ah well, such is life. I imagine the building is much more valuable as an empty tenant since it's a block away from the VCs at South Park.

I do wish, selfishly, that it was still a datacenter though. It would be sick to be able to walk down the street to my servers. I'm still procrastinating on readying my GPU servers because of the one hour of travel.

0: back when individuals didn't have petabytes or 1 TiB RAM machines or 1 GiB CPU cache machines

arjie · 2026-04-22T19:32:02 1776886322

I have never found any utility in that. After all, you can still just review the diffs and ask it for explanation for sections instead.

pavel_lishin · 2026-04-22T19:33:40 1776886420

> After all, you can still just review the diffs

anonu has explicitly said that they've wiped a database twice as a result of agents doing stuff. What sort of diff would help against an agent running commands, without your approval?

exe34 · 2026-04-22T20:21:51 1776889311

Hah I run my agent inside a docker with just the code. Anything clever it tries to do just goes nowhere.

arjie · 2026-04-23T00:04:06 1776902646

Agent does not have to run in your user context. It is easy mistake to make in yolo mode but after that it's easy to fix. e.g. this is what I use now so I can release agent from my machine and also constrain its access:

    $ main-app git:(main) kubectl get pods | grep agent | head -n 1 | sed -E 's/[a-z]+-agent(.*)/app-agent\1/'
    app-agent-656c6ff85d-p86t8                          1/1     Running     0             13d

Agent is fully capable of making PR etc. if you provide appropriate tooling. It wipes DB but DB is just separate ephemeral pod. One day perhaps it will find 0-day and break out, but so far it has not done it.

ModernMech · 2026-04-22T19:51:45 1776887505

> After all, you can still just review the diffs

The diff: +8000 -4000

arjie · 2026-04-23T00:20:33 1776903633

You can ask it to make the changes in appropriate PRs. SOTA model + harness can do it. I find it useful to separate refactors and implementations, just like with humans, but I admittedly rely heavily on multi-provider review.

arjie · 2026-04-22T02:04:17 1776823457

Besides that flagship vehicle, their other more standard cars are also pretty good. We just returned from Hong Kong, and the cars there were the same brands we saw in South America: Maxus et al. with some MGs. To be honest, they seemed very good. Unless something is secretly wrong with them regarding safety or reliability, the American and European car industries are in huge trouble.

A friend's dad just restored his ancient MG up here in California and it was funny to me to see that car and then go up to Hong Kong and see the modern incarnation of the same marque.

cpursley · 2026-04-22T10:21:26 1776853286

Regarding the MGs, I believe the design is still done in the UK, which explains the style. And from my understanding a lot of of the really good looking Chinese cars are actually designed by European design shops (Italian, Swedish etc). It seems like a pretty good strategy actually to let the Chinese handle the manufacturing while the Europeans handle the design and performance.