I’ve idly wondered about this sort of thing quite a bit. The next step would seem to be taking a project’s implementation dependent tests, converting them to an independent format and verifying them against the original project, then conducting the port.
Give coding agent some software. Ask it to write tests that maximise code coverage (source coverage if you have source code; if not, binary coverage). Consider using concolic fuzzing. Then give another agent the generated test suite, and ask it to write an implementation that passes. Automated software cloning. I wonder what results you might get?
> Ask it to write tests that maximise code coverage
That is significantly harder to do than writing an implementation from tests, especially for codebases that previously didn't have any testing infrastructure.
Give a coding agent a codebase with no tests, and tell it to write some, it will - if you don’t tell it which framework to use, it will just pick one. No denying you’ll get much better results if an experienced developer provides it with some prompting on how to test than if you just let it decide for itself.
If you’ve actually tried this, and actually read the results, you’d know this does not work well. It might write a few decent tests but get ready for an impressive number of tests and cases but no real coverage.
I did this literally 2 days ago and it churned for a while and spit out hundreds of tests! Great news right? Well, no, they did stupid things like “Create an instance of the class (new MyClass), now make sure it’s the right class type”. It also created multiple tests that created maps then asserted the values existed and matched… matched the maps it created in the test… without ever touching the underlying code it was supposed to be testing.
I’ve tested this on new codebases, old codebases, and vibe coded codebases, the results vary slightly and you absolutely can use LLMs to help with writing tests, no doubt, but “Just throw an agent at it” does not work.
But, did you actually give the agent access to a tool to measure code coverage?
If it can't measure whether it is succeeding in increasing code coverage, no wonder it doesn't do that great a job in increasing it.
Also, it can help if you have a pair of agents (which could even be just two different instances of the same agent with different prompting) – one to write tests, and one to review them. The test-writing agent writes tests, and submits them as a PR; the PR-reviewing agent read the PR and provides feedback; the test-writing agent updates the tests in response to the feedback; iterate until the PR-reviewing agent is satisfied. This can produce much better tests than just an agent writing tests without any automated review process.
This highlights something that I wish was more prevalent, Path Coverage. I'm not sure of what testing suites handle path coverage, but I know XDebug for PHP could manage it back when I was doing PHP work. Simple line coverage doesn't tell you enough of the story while path coverage should let you be sure you've tested all code paths of a unit. Mix that with input fuzzing and you should be able to develop comprehensive unit tests for critical units in your codebase. Yes, I'm aware that's just one part of a large puzzle.
I think I've asked this before on HN but is there a language-independent test format? There are multiple libraries (think date/time manipulation for a good example) where the tests should be the same across all languages, but every library has developed its own test suite.
Having a standard test input/output format would let test definitions be shared between libraries.
I’ve got to imagine a suite of end to end tests (probably most common is fixture file in, assert against output fixture file) would be very hard to nail all of the possible branches and paths. Like the example here, thousands of well made tests are required.
I appreciate the even tempered question. I’ve been using mypy since its early days, and when pyright was added to vs code I was forced to reckon with their differences. For the most part I found mypy was able to infer more accurately and flexibly. At various times I had to turn pyright off entirely because of false positives. But perhaps someone else would say that I’m leaning on weaknesses of mypy; I think I’m pretty strict but who knows. And like yourself, mine is a rather dated opinion. It used to be that every mypy release was an event, where I’d have a bunch of new errors to fix, but that lessened over the years.
I suspect pyright has caught up a lot but I turned it off again rather recently.
For what it’s worth I did give up on cursor mostly because basedpyright was very counterproductive for me.
I will say that I’ve seen a lot more vehement trash talking about mypy and gushing about pyright than vice versa for quite a few years. It doesn’t quite add up in my mind.
I’ve added ecosystem regression checks to every Python type checker and typeshed via https://github.com/hauntsaninja/mypy_primer. This helped a tonne with preventing unintended or overly burdensome regressions in mypy, so glad to hear upgrades are less of an Event for you
> I will say that I’ve seen a lot more vehement trash talking about mypy and gushing about pyright than vice versa for quite a few years. It doesn’t quite add up in my mind.
agreed! mypy's been good to us over the years.
The biggest problem we're looking to solve now is raw speed, type checking is by far the slowest part of our precommit stack which is what got us interested in Ty.
I jumped through a bunch of hoops to get claude code to run as a dedicated user on macOS. This allowed me to set the group ownership and permissions of my work to control exactly what claude can see. With a few one-liner bash scripts to recursively set permissions it worked quite well. Getting the oauth token token into that user's keychain was an utter pain though. Claude Code does a fancy authorization flow that puts the token into the current user's login keychain, and getting it into the other user's login keychain took a lot of futzing. Maybe there is a cleaner way that I missed.
When that token expired I didn't have the patience to go through it again. Using an API key looked like it would be easier.
My first programming "job" was a sort of summer internship when I was 14 for a family owned company called Signature Systems (signature.net). They are still in business. Their product is an operating system called Comet, that if I'm not mistaken was originally a compatibility play bringing software from the previous era of 16 bit microcomputers onto DOS PCs, and then later into Windows. I may be misremembering some of the details but I think at one point a Comet system ran ticket sales at Madison Square Gardens. My summer project was to build a demo using their new support for Windows GUI elements. The last time I spoke with the owners, they told me that they still had customers, including in the textiles industry where loom patterns had been coded in Basic. I often think about it as an example of a legacy system, and smile at the idea of someone thinking they need to rewrite their plaid weave in TypeScript or Rust.
Separately, I have spent the last three years building a web app that replaced a heap of automation scripts for a ~50 person business. These were much more modern than what the OP describes but it had some of the same qualities. The scripts mostly generated google sheets and emails. The replacement is a python web app using SQLite. Moving the company data into a proper database has been a very significant step for them. In some ways, the project feels a lot like custom business software that got built in the 90s.
One tidbit that I don't see mentioned here yet is that ATTACH requires a lock. I just went looking for the documentation about this and couldn't find it, especially for WAL mode (https://www.sqlite.org/lockingv3.html mentions the super-journal, but the WAL docs do not mention ATTACH at all).
I have a python web app that creates a DB connection per request (not ideal I know) and immediately attaches 3 auxiliary DBs. This is a low traffic site but we have a serious reliability problem when load increases: the ATTACH calls occasionally fail with "database is locked". I don't know if this is because the ATTACH fails immediately without respecting the normal 5 second database timeout or what. To be honest I haven't implemented connection pooling yet because I want to understand what exactly causes this problem.
> I have a python web app that creates a DB connection per request (not ideal I know)
FWIW, "one per request per connection is bad" (for SQLite) is FUD, plain and simple. SQLite's own forum software creates one connection per request (it creates a whole forked process per request, for that matter) and we do not have any problems whatsoever with that approach.
Connection pools (with SQLite) are a solution looking for a problem, not a solution to a real problem.
There's nothing specific to read about it, just plenty of anecdotal evidence. People use connection pools because connecting to _remote_ databases is slow. SQLite _is not remote_. It's _in-process_ and _fast_. Any connection-pool _adds_ to the amount of work needed to get an SQLite instance going.
It's _conceivable_ that pooling _might_ speed it up _just a tad_ for databases with _very large schemas_ because parsing the schema (which is not done at open-time, but when the schema is first needed) can be "slow" (maybe even several whole milliseconds!).
I have never seen the word “partition” used in this way before. Hard to search for examples because unrelated computer graphics articles about surface partitioning dominate. I did find this:
Partitioning is the distribution of a solute, S, between two immiscible solvents (such as aqueous and organic phases). It is an equilibrium condition that is described by the following equation:
S(aq) ⇄ S(org)
Interesting to think that a surface can play a role comparable to a solvent. I wonder what a chemist would have to say about it.
I'm a materials scientist/chemist and the word partition made sense in this context. The VOC/solute is preferentially on surfaces vs floating in the air. This finding doesn't seem super surprising to me given the large surface area of all the stuff in a home.
In the UK a non-structural wall is called a partition wall -- they're usually plasterboard (I think that is called sheetrock in USA) over wooden studs whilst ordinarily walls are plaster on brick/stone.
I wonder which partitions more VOCs/SOCs, partition or structural walls.
Also to separate a computer network into two or more disconnected networks, the P in the CAP theorem stands for "partition tolerance" (i.e. that a system can keep working in case its components end up in a partitioned network).
The trivial partition n = n also usually counts as a partition. This is useful if you want to be able to dualize partitions, and want n = 1 + 1 + ... + 1 to have a dual partition.
Gypsum board is a considerably more specific, less generic, term than partition. My wooden house has some internal non-structural walls but none of them use gypsum boards (called plasterboard in British English).
Neither are they skimmed with plaster. They are instead faced with a very dense and flat hardboard.
Gypsum board is the term for a type of wall covering, which itself is part of a partition.
A partition is an interior wall assembly typically consisting of framing, (optional) insulation, and a wall covering (like gypsum board, but this could be anything: wood, shiplap, masonry, lath and plaster, etc.)
There are multiple levels of drywall finishing. A level 5 (highest grade) finish involves skim coating the entire gypsum wallboard with joint compound.
It’s not very common, but it is used in some commercial settings.
I think it would depend on what paint is used. Although I would strongly suspect exposed porous surfaces like plaster, masonry, drywall to have a large reservoir capacity due to their surface area at the microscopic level
In separation science a partitioning coefficient can be described for an undesirable contaminant, inbetween a solid adsorbent having a certain degree of retention, versus a solvent where it is soluble to its own certain degree, under static equilibrium conditions.
IOW the smoke will have different affinity for different types of furniture, carpets, and window coverings, and when it comes in contact with these they soak it up like a sponge. Because the adsorbent materials are physically like a sponge more often than not, whether on a macro, micro, or molecular level.
The solvent is plain air, but the "solubility" of the raw smoke in air is not a factor because the smoke is not actually dissolved in the solvent (air) at this point, or ever really. The smoke consists of a lot of solid particles that have been forcefully dispersed into the air at uneven concentrations. The smoke itself is not a chemical contaminant that dissolves in the air, it's just dispersed in the air not much differently than an unwanted chemical, for a least a good period of time.
But the solids will eventually settle if they are not purged beforehand. What you're left with after that is then chemical equilibrium.
In a confined enclosure, static equilibrium will eventually be reached between the amount of chemical contaminants dissolved in the air at that temperature, versus the amount adsorbed onto available surfaces. After which no more odor can be released from the furniture once the air is saturated. To really get rid of the smell you're going to have to replace the saturated air with fresh air and one compete air exchange is not usually enough. Also the more efficient air exchange the better, and the fresher the better. If one person smoked one time, or you burned some popcorn and did not let out the smoke right away, that's not much contamination and it's not constant, but it's also not unusual to still smell it a week later when you first walk in from a fresh outdoor air environment. But just don't open the windows when something like a diesel truck is idling outside, new odor could then be coming in in greater quantities than the old odor can escape, one roomful at a time.
You may have grams of "odor" soaked into the carpet along with 100 grams of dirt & dust. But what if the chemical causing the odor only "evaporates" into the air a few milligrams at a time? Because the heavier the liquid, the slower the evaporation and the resulting partitioning coefficient using air as a solvent is such a low number. And it's not too unintuitive to figure that things which are semi-solid like tars or true solids like some pesticides hardly evaporate at all, but can really stink when there's only a few milligrams in the air.
Stuff like that is not going away without a solvent much stronger than air, and also a more concentrated solvent than a gaseous fluid can make contact by the gram much faster than a gram of fresh air can eventually flow by the unwanted material to be removed.
Plain water may not be any better as a solvent at dissolving cooking oils and tars than air is a solvent, but you sure can get a lot more grams into contact with a surface or macro adsorbent quicker compared to air as a gas.
Plain steam dissolves things so much better just from the added heat of the liquid turning it into a stronger solvent, plus so much of the water evaporates so fast at that temperature there is also a purging effect.
Then there's the carpet-cleaning liquids that can improve the partitioning coefficient of water so it will dissolve otherwise insoluble materials without nearly as much heat as steam. Like grams of detergent added to volumes of water to clean a certain area of carpet, or hundreds of grams of water-soluble organic solvent over the same area instead. Or both, simultaneously, or sequentially. Then when you do the math you see how much more effective sequentially is.
Now without doing any carpet cleaning, when you enhance the air exchange rate to do as good a job removing odors as that can accomplish, you are then trying to establish a dynamic equilibrium so odors are being purged outward at an enhanced rate due to increased fresh solvent (air) flow. Kind of like sequential carpet cleaning. One window fan blowing in and one blowing out at opposite ends of the structure can sometimes be more effective than all windows open whether or not using the same fans.
>I wonder what a chemist would have to say about it.
I wouldn't be surprised if people are still wondering :)
Edit: Hopefully they're wondering even more about a lot of things where they didn't know there were equations, actually ;)
Speaking only for myself, and in all sincerity: every year, there is some feature of the latest CPython version that makes a bigger difference to my work than faster execution would. This year I am looking forward to template strings, zstd, and deferred evaluation of annotations.
I’ll second this, and add that docstrings are becoming ever more useful as modern editors learn how to show them inline when I hover over a symbol. Starlette lacks docstrings entirely and it’s a real miss in my opinion.
I went from VS Code to Cursor, then got frustrated with Cursor breaking keybindings and other things, tried to go back to VS Code but missed the superior tab completion. Then I gave Zed a long hard try, but after over a month of daily usage I went back to Cursor again, just for the tab completion quality.
I don't use any of the chat or agent features, but for me Cursor's tab completion is a step forward in work efficiency that Zeta and Copilot were not. Sometimes it's subtle, and sometimes it is very obvious. Cursor seems to have sources of context that the others don't, like file names from the directory tree, and maybe even the relevant .pyi type annotations and docs for python modules. It also jumps to the next relevant problem site very effectively. It feels like the Cursor devs have done a ton of practical work that will be hard to match with anything other than a full-on competitive effort.
I want to see Zed succeed. I think it's very important that VS Code and its ultra-funded derivatives not dominate the modern editor landscape too thoroughly. Tab completion used to seem like a straightforward thing, but if the state of the art requires a very elaborate, whole-workspace-as-context environment to operate in, then I wonder if it's going to become a go big or go home kind of feature.
I can't help wonder what the actual internal API for this kind of thing is going to look like in the future. It used to be something like, what's the current token behind the cursor, and look in a big prefix tree of indexed words. Then maybe it got more elaborate with things like tree-sitter, like what's the incomplete parse tree up to this point. Then when editors started using AI, I stopped having any idea of what the actual inputs are. I'd love to hear about real implementation experience at any stage of this evolution.
I think we don’t talk enough about tab completion model quality. Recently Copilot’s model got a lot better (probably trying to catch up to Cursor) but I feel like there’s still so much room here (and I assume Zeds is worse from your description).
Smart context / big context is a really interesting question, I’m kind of surprised Google isn’t building here given how much effort they’ve put into big context (they have Jules and Gemini CLI but no tab completion UX).
On further thought I think one of the big 3 (OpenAI, Google, Anthropic) should partner (ideally not buy) with Zed to get a foothold.
For Copilot the quality of the tab complete is less of an issue for me than the fact that it is often very slow, or doesn't trigger at all when I would expect it to. I'll sit there feeling like an idiot for 10 seconds and then glance at the bottom bar to discover that it's not even doing inference, and have to randomly move the cursor, or delete and retype code until it finally works.
I have the opposite experience, tab completion by Copilot just got significantly worse for me recently (the last week or so), both for Rust and Python code.
I've been working on a better tab completion model that stays as an extension: https://ninetyfive.gg/
The main feature I really care about is low latency, which is my main gripe about Copilot. There's still a ways to go to match Cursor's quality but I'm chipping away at it!
I used Zed for about a year and a half exclusively, without using any AI features, and then switched to cursor to try AI features out. When Zed released its agent mode, I switched back to Zed.
I absolutely agree that Cursor’s tab completion is far superior to Zeds. The difference is night and day. Cursor’s really is that good. But the Zed agent mode works very well for me and Zed is, IMHO just so much better than an editor. I really hate having to use vscode or a vscode-based editor after using Zed so much (I used vscode exclusively before switching to Zed). And that’s enough for me to give up on the superior tab completion.
I hope Zed eventually improve theirs to a similar experience to cursor, but regardless, I love Zed.
nah, I've been using zed exclusively for a couple of months and Zed's agent mode is still worse than Cursor's, but I do agree the quality gap is smaller than on tab completion
I'm in a very similar position, using Cursor just for their Tab model. My ideal choice would be Neovim, but I can't replicate the productivity I have with Cursor Tab.
reply