This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
You're defending X/Grok as if it's a public social platform.
It is a privately controlled public-facing group chat. Being a chat-medium does not grant you the same rights as being a person. France isn't America.
If a company operates to the detriment and against the values of a nation, e.g. not paying their taxes or littering in the environment, the nation will ask them to change their behavior.
If there is a conspiracy of contempt, at some point things escalate.
I'm launching a SaaS to create yet another solution to the AI Sandboxing problem in linux.
My friends and I have spent a lot of time quietly injecting support down into the kernel without anybody raising a flag, and we finally have the infrastructure in place to solve this problem.
We have also poisoned all the LLMs training data with our approach, so our marketing is primed and we wont even need to learn Claude to use our tool.
We’re planning a soft launch this month, or maybe next month. Depending on how "in the vibe" (our new word for flow :) our team gets.
We’re calling it `useradd`.
Yes, the man page is intimidating, and the documentation is terrible. But once you're over the learning curve, it puts your machine into a kind of 'main frame' mode where multiple 'virtual teletypes' and users can operate on the same machine.
DM me if you want a beta key.
---
Sorry for the snark, but i cringe at the monuments to complexity I see people building, at least this solution is relative simple and free. Still, dont really see what it buys me.
I get where this is coming from, and it's not a terrible solution, but VMs are still better in terms of security and isolation. Typical workstation systems are not designed to be secure from their own users, and frontier models are going to get scary good at cracking systems soon.
Fully sandboxed VMs are more secure but not everyone is looking for the most secure option. They are looking for the option that works the best for them. I want to be able to share my development environment with the agent, I have a project with 30 1gb and one 30gb sqlite database. I back it up daily and they can all be reconstructed from the code but it takes a long time. When things change I don't want to have to copy them into a separate vm bloating my storage and using excess resources and then having to rectify them, I want to be sharing the same environment with my agent so I can work side-by-side.
I would rather just have the agent not accidentally delete files outside of its working environment but I am not worried about malicious prompt injection or someone stealing my code.
For me I see the LLM as a dumb but positive actor that is trying to do its best but sometimes makes mistakes, so I want to put training wheels on it while still allowing it to share my working space.
I have used a separate user, but lately I have been using rootless podman containers instead for this reason. But I know too little about container escapes. So I am thinking about a combination.
Would a podman container run by a separate user provide any benefit over the two by themselves?
I love using different users for separating services I run on the same box!
For development, I want to be able to access/run/modify/delete the files alongside the AI agent. This can be done if groups and group permissions are set correctly (and the agent correctly chmods everything...), but that feels more fiddly than just isolating it with bubblewrap, systemd, or whatever, and preserving the uid/gid.
Hey Senko, did you consider using ZFS or BTRFS snapshotting feature to simplify some of the things you need?
For GH auth tokens, you could also pull that outside the sandbox, and have the agent push to a local clone exposed to the host, and local host with no agent automatically push on inotify inside the repo — eg. agent has access to your /agents/scratchpad/my-git-repo, and sync to actual git hosting service like GH (or Launchpad ;) happens with simple script outside it.
I use this amazingly niche and hipster approach of giving the agent its own account, which through inconceivably highly complex arcane tweaking and configurations can lock down what they can and cant do.
---
Can somebody for the love of god tell me why articles keep bringing up why this is so difficult?
I have antigravity in its own account and that has worked pretty well so far. I also use devcontainers for the cli agents and that has also worked out well. It's one click away in my normal dev flow (I was using this anyway before for python projects).
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
The simple approach is great, chef's kiss, don't change a thing. Orchestration at the harness level tends not to be great anyhow, it's not built for the type of review that's needed.
I dont understand what these two have to do with anything? The db-use is almost trivial, and SQLite can be embedded. Why would we want wasted effort and configuration complexity on supporting postgres?
With that kind of logic you wouldn't need headscale and would just ask your favorite LLM to write a similar tool for your with your own requirements and nothing else.
No, not really necessary to extrapolate the logic any further. You have deemed a very specific and focused task as "wasted effort." So the logic leads to putting in the effort you do not find "wasteful" and outsource the remainder to the LLM do this very specific thing.
I'm very critical of all the schemes proposed but this is just a fundamental misconception on your part.
> If there was a legitimate drive to protect kids from the worst of the Internet
As with any disease, the impact heavily depends on virality.
The worst the internet has to offer to children, is not the gore or porn for the few that look for it (usually individually).
The worst it does to children is the attention algorithm that captures practically everybody.
Exactly. The problem is no one wants to address that maybe some of these business models just need to go extinct.
Like maybe ad supported infinite feeds can't be done in a socially responsible way and just need to be banned. If that takes down or substantially limits certain web service sizes...so be it.
While I agree with this, I also find that the "but think of the children" ironic retort also usually ignores the very real problems that technology can cause children (and society at large). In this issue in particular, if banning social media for children makes it less likely for adults to use it, I see it as pretty much a win-win.
Literally every society mandates tons of restrictions for children, because we understand that children aren't yet developed enough to be able to understand the full consequences of personal freedoms.
Should it also be the Role of parents to prevent their children from being kidnapped by crime syndicates? Maybe we should also abolish schools because it should be the parent's role to educate their children.
This individualistic line of thinking is downright insane. It's preposterous. We live in a fucking society, no one can do anything on their own. For God's sake parent's shouldn't be expected to fight alone against MULTI-FUCKING TRILLION CORPORATIONS.
Fat load of help all that anti-regulation talk did when the current US Gov can just get all the data it wants from those megacorps.
Yeah let's also abolish laws preventing sale of tobacco and alcohol to children. This will surely lead to a prosperous national.
If you look at the longterm trend in government intrusion into our personal lives, you'll see it's largely increasing, so if anything, the cause of any "collapse" would be the opposite of what you're purporting.
You can get pretty decent initial results if you explicitly tell them to first make a detailed description with exact coordinates and then feed the description back into them to build the SVG.
Interesting take. I'm using btrfs (instead of ext4) with compression enabled (using zstd), so most of the files are compressed "transparently" - the files appear as normal files to the applications, but on disk it is compressed, and the application don't need to do the compress/decompress.
They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.
What I think might be the next obstacle is self-knowledge. The new agents seem to have picked up ever more vocabulary about their context and compaction, etc.
As a next benchmark you could try having 1 agent and tell it to use a coding agent (via tmux) to build you a pelican.
reply