> Having LLMs book flights by interacting with the DOM is sort of like having them code a web app using assembly.
The DOM is merely inexpensive, but obviously the answer can't be solely in the DOM but in the visual representation layer because that's the final presentation to the user's face.
Also the DOM is already the subject of cat and mouse games, this will just add a new scale and urgency to the problem. Now people will be putting fake content into the DOM and hiding content in the visual layer.
I had the same thought that really an LLM should interact with a browser viewport and just leverage normal accessibility features like tabbing between form fields and links, etc.
Basically the LLM sees the viewport as a thumbnail image and goes “That looks like the central text, read that” and then some underlying skill implementation selects and returns the textual context from the viewport.
The DOM is merely inexpensive, but obviously the answer can't be solely in the DOM but in the visual representation layer because that's the final presentation to the user's face.
Also the DOM is already the subject of cat and mouse games, this will just add a new scale and urgency to the problem. Now people will be putting fake content into the DOM and hiding content in the visual layer.