Say more? What kind of protocol do you mean, and what comes after browsers?

__MatrixMan__ · on Oct 22, 2023

"turn my back on browsers" was perhaps hyperbole. I do want to make them irrelevant, but realistically I'm just targeting a subset of their domain. I'm still getting a handle on the idea, so pardon the lack of brevity. One day I'll be able to put it succinctly.

Working title: Semantic Paint.

---

The plan is to take what our web does poorly and do those things well. From that I have two main goals, the first of which is:

Permissionless Annotation -- I should be able to attach annotations to datasets (or subsets thereof) that I find in the wild without having write access to those datasets. Links are implemented as annotations (as are edits). Unlike our web, they are undirected and might link more than two things. Instead of a directed graph between documents we now have simplices which connect (sub)sequences of arbitrary bytes. These connections are typed (I'm calling these types "colors").

Have you ever played Mad Libs? It's a game which has partial sentences, like: "____ had a great time ____ing the ______". Fun is had by filling in the words before you know the sentence and then laughing about how silly the sentence is. In semantic paint, colors are like that: they're tuples with an associated partial sentence, each tuple element goes with a blank. At any one moment, your client will be configured to display (or act on) one or more "colors". A color is a list of tuples in this form.

So you might have a 3-color:

______ (code) is malicious, writing it to stdin of ______ (executable code) with ______ (parameters) will write a non-malicious copy to stdout.

This color would be used for annotating malicious javascript with enough metadata to fix it automatically. It functions as a link between three items. If you have any one of them, you can find the other two.

Here's the browser-killing part: at some point we stop annotating the malicious parts so that we can cleanse them, and instead we target the desirable parts so that we can make them more accessible. Embrace, enhance, extinguish.

Note that we're not talking about the filename or the server whence the malicious script came, we're talking about the data itself. Naming things is hard, so I want to see how far we can get without naming them at all. Instead, a user can just just point at the thing without naming it, and apply paint---er apply annotations--to the thing they're pointing at. It operates on fragments of data scraped from a screen, tee'd from a pipe, or OCR'd from a camera--not on files or other named abstractions.

The tuple values are pairs: a cryptographic hash, and a list of features that come out of a rolling hash (think rsync). The later is used to re-anchor the tuple (brushstroke) even if the canvas is paginated differently or has other small differences. Fingers crossed: I can keep false positives down to a tolerable level.

For instance, if you copy some code from stackoverflow into your project, and later I annotate that code while browsing stackoverflow, you should then see my annotations on your code as viewed in your IDE (that is, provided you have opted in to seeing my paint, and are running your IDE through a semantic paint client--software which will sit between you and the IDE. The first draft is shaping up to resemble tmux).

One could imagine similar functionality on a piece of paper you found blowing in the wind. Point your camera at it, extract the features, query... maybe there are annotations on that text which will tell you more about its origins. If Fermat had had this tech, he wouldn't have complained about the margin being too small, he'd just have linked the proof with a brushstroke.

Imagine also people with allergies leaving annotations on menus at restaurants: "they say this doesn't have gluten, but it totally does," that sort of thing.

In this sense you can think of it as a sort of distributed search algorithm, where either a cryptographic hash of content, or this list of fuzzy-hash features, is the search query. You'd just sort of leave it running as a filter over whatever data you're working with. I kind of imagine it like augmented reality... for data.

---

The second thing I want do do well is Partition Tolerance.

I want apps using this protocol to function without a persistent internet connection. They'll function slowly, but how up-to-date do you really need that blog post to be anyway? For most things, days or even weeks of latency is ok.

If you're in your car, stopped at a light, your device will be gossiping with others that are stopped at the same stoplight. Pedestrians in range may also end up participating. Delivery drivers put nodes on their vans, which silently gossip brushstrokes while the drivers deliver packages. Imagine a train full of people with gossiping devices... Sneakernet, on autopilot, even in a disaster or a protest.

Secure Scuttlebutt Protocol comes to mind here, but that's append-only. This is unordered. You just grab all of the brushstrokes you're interested in and provide strokes that your peers are interested in. Retention policies and algorithms for deciding what "interested in" means, are the domain of the app. Convergence will be hard to orchestrate, but that's no reason not to try (or maybe instead we diverge).

Peers come and go, but since everything is content addressed (cryptographically, or fuzzily), what really matters is whether those peers are interested in the same colors that you are. I know it sounds crazy ambitious, but if you don't have to protect the referential integrity of a globally consistent name, lots of problems go away.

The goal is to keep data nearest the things in the real world that it is relevant to. If you run across contradictory entries within a color, you can scrutinize by author (who do you trust more?) or by which peers gossiped it (which is more local?). I anticipate that handling trust explicitly like this (and focusing on data, not server names) will change the game re: misinformation.

One thing I like about this strategy is that you can synchronize this gossip (like cicadas synchronize their mating habits). I want to be ad-hoc wifi/bluetooth tolerant, but now imagine a node running in the cloud. Rather than leaving it on 24/7 so that you're ready to respond to a user at any moment, you can have your node on a 5 minute cycle: Sleep for 4:30, gossip for 0:30, repeat. That's paying ten cents instead of a dollar for server uptime. Yeah, users will have to tolerate data that's 5 minutes stale, but for most applications that's fine. If the data is relevant to them, it should make it to their node before they go to look for it.

Another benefit is that you can enlist your node to someone else's cause without having to talk with them first. Much like how IPFS lets you "pin" data published by somebody else so that that data doesn't go away if their node goes offline, you could instruct your node to notice square pegs and square holes and publish annotations about having fitted the peg in the hole. This means that if you don't want to pay somebody in money, you can instead pay them in operational support:

> I don't want to pay you $5 / mo. Instead I've been hosting way more than my share of your service on this stack of hard disks for a month. Let that be my payment.

I think there are some things that capitalism and zero-sum games are doing poorly, that cooperation and reciprocity could do well. That idea is not fundamental to the protocol, but it is fundamental to why I want to build the protocol.

----

Whew, I could go for pages and pages, but that's my best shot at a sketch. Thanks for asking, sorry it's not elevator-pitch-grade.

jazzyjackson · on Oct 22, 2023

I like your madlib and paint metaphors. IMO that's the most important part of UI/UX, what metaphor can people use to apply their world models to an application that can technically work in infinite ways.

The file/folder metaphor has had fantastic success but it leaves people constrained to hierarchical thinking when databases can actually represent arbitrary graphs, cycles and all. When prototyping my knowledge graph I wanted to build directly on existing filesystems because readdir is fast and cached by the OS, a lot of work done already. But I was stymied by the fact you can't hardlink a directory to multiple directories, because cycles aren't allowed, so I'm stuck with softlinks.

I also weighed the pros and cons of tuples (unnamed associations) and triples (subject verb object, verb being the name of the association) and decided there's utility in having both without too much added conplexity - tuples are just triples with NULL for a verb.

Let me ask you this - architecturally, I'm still weiging building on top of unix filesystems, with git for history/multi-user collab sync VS sqlite and figuring out history/sync later, but take the abstract case of tables with columns: Does it make more sense to have only two tables - one for associating hashes/uuids with strings/buffers and the second table being "left id, right id" for links, or is it worth the added complexity to create a new table for each type - a specified set of named attributes each with specified type, such that when you want to pull up all the metadata on an mp3 file you already know which table to read, instead of doing dozens of reads on the simpler model and then verifying the type ex post facto... would love to know your thoughts on how to structure the graph on disk...

fwiw the tagline for my project is "global media graph of associations, attributes, & annotations"

__MatrixMan__ · on Oct 22, 2023

Thanks for the feedback. Working on this has been a lot of fun for me, but I'm also aware that it's too big to digest in a sentence or even a paragraph, so there are very few people I can talk about it with. From what you shared about yours it seems we're thinking in similar directions.

Aside:

I think it's a tragedy that bit torrent is associated with piracy because it's just a better way to move data around in general. In order to not have that happen to me, I'm seeking a first application that is unlikely to ruffle feathers, so I've been taking a bioinformatics class. As I get familiar with workflows in the genomics/proteomics world, I'm finding that this:

> The file/folder metaphor has had fantastic success but it leaves people constrained to hierarchical thinking

Is resoundingly true. We end up with filenames like: "ncrassa.H3K9me3.ChIPseq.subtracted.merged.bed" which is not a name so much as a directed graph, a recipe of how this file came to be. Coming up with names for the resultant files at each stage in the graph and keeping them straight is this chore which could be dispensed with if what you were looking at was the graph that we're all holding in our minds while we write these names.

Semantics Note:

> tuples are just triples with NULL for a verb.

The way I think of it, pairs are just triples with null for a verb. When I say tuples I mean things that can have arbitrarily many values. Anyhow, I think I know what you mean.

As for data representation, I don't really like triples in the subject, verb, object sense. It's very Semantic Web™, particularly it's the vision of the semantic web which Doctorow argues has failed here: https://people.well.com/user/doctorow/metacrap.htm

I agree with his critiques, and I would add this one:

> People struggle enough to do their jobs as it is, if yours is a system that asks them to do something more in the name of useful metadata, you will be ignored.

If you have write access to important data, you have way too much responsibility to way too many people. We need to split that responsibility up so that different personae can handle different things. If I'm the one that wishes that this data had annotations, let me be the one to annotate it, and don't make me seek approval first.

A consequence of this is that you now need an "according to whom" field on all of your triples. Maybe you don't trust that particular author. Fine, quadruples then? Maybe, but when I actually try to get a complex problem to lay out nicely in this way I end up with a sort of predicate soup--which is how I ended up with my mad libs approach: arbitrary n-tuples. In some sense I'm passing the buck and making it the app developer's problem (which is, of course, also me, but wearing a different hat).

I'm not enamored of reddit these days, but when it was new it was quite innovative to just let any-old user create any-old subreddit. Most of them fizzled out into obscurity, but occasionally a community had its shit together, and those thrived. That's a dynamic that I want to lean into. If the data sucks, don't find a different protocol, find a different community of data curators using the same protocol.

So that relates to your question because unless you're going to be in-the-loop as arbitrator of everything, people are going to disagree, and there will be a certain amount of churn as they align themselves with different sides of that disagreement.

I have to imagine that at some point somebody is going to want to express something new. To borrow from my world, a mad lib:

> ___ and ___ played in superbowl ____ and ____ won (according to ____)

I'm pretty sure that this quintuple can be normalized down into some set of triples which sort of reference each other when the proposition needs to be reconstructed, but when you decompose it like that you lose the feeling that it's a single thing--a thing that can be trusted or not by a user--a thing that can be gossiped between users--a thing that can be purged once it's no longer needed. It may be the case that having to reconstruct the more complex thing while performing these operations is actually more complex than just paying the up-front cost of having them be more complex things in the first place.

So I guess I have to ask: Is it definitely more complex to have a separate table for each type? Or does that complexity just bite you at different times?

I once had to test assertions about a rather large database over a rather slow connection. I was very fortunate that I could strip off the foreign keys and just flat-out ignore tables that were not related to my testing (this required that I sync much less data). That's harder to do if the "real" things are expressed in terms of simpler things which now have to be teased apart.

My opinion is relevant to the "sqlite" side of your dilemma, I think. I'm personally a big fan of git and filesystems--they are among the most potent tools that I feel comfortable using. I also considered wrapping git to handle conflict resolution. But ultimately it's the road I didn't take, so I can't really weigh in there.

I avoided it for.... weird reasons:

Generally, a git repo is run by somebody with authority. They accept or reject PRs in a rather top-down way. That feels like the "consistency" side of the CAP theorem, and I have chosen "partition tolerance" instead. It's an idea I got from Unison (https://www.unison-lang.org/). If you don't do globally unique names (such as the URL for a git repo) you don't get fights over where the authoritative name points. Instead, each fork is created with equal validity and its up to the users to decide which one to install.

It's kind of like back in the day when BTC forked. The optics of the scenario was that segregated witness won, and the losers of that fight went off and created BCH. But realistically it was was just that there was first one protocol, and then there was a choice between two new protocols. The notion that one side of the conflict got to carry the old name is well... not as content-addressed as I'd have liked.