Its hard to tell without benchmarks how useful this is going to be, as Perplexity Comet landed as a dud.
Most of the other agentic chrome extensions so far used vision approach and sensitive debugger permissions, so unsure if Anthropic just repackaged their CUA model into an Extension.
We are working on a DOM/Text Only AI Web Agent, rtrvr.ai:
- an AI Web Agent that autonomously completes tasks, creates datasets from web research, and integrates any APIs/MCPs – with just prompting and in your browser!
We had this exact thought as well, you don't need a whole browser to implement the agentic capabilities, you can implement the whole thing with the limited permissions of a browser extension.
There are plenty of zero day exploit patches that Google immediately rolls out and not to mention all the other features that Google doesn't push to Chromium. I wouldn't trust a random open source project for my day-to-day browser.
Check out rtrvr.ai for a working implementation, we are an AI Web Agent browser extension that meets you where your workflows already are.
Brave Browser (70M+ users) has validated that a chromium fork can be viable path. And it can in fact provide better privacy and security.
Chrome extensions is not a bad idea too. Just saying that owning the underlying source code has some strong advantages in the long term (being able to use C++ for a11y tree, DOM handling, etc -- which will be 20-40X faster than injecting JS using chrome extension).
I personally talked to another agentic browser player, fellou.ai, in the space asking them how they are keeping up with all the Chromium pushes as you need a dedicated team to handle the merges, they flat out told me they are targeting tech enthusiasts that are not interested in the security of their browser as much.
As an ex-Google engineer I know the immense engineering efforts and infrastructure setup to develop Chrome. It is very implausible that two people can handle all the effort to serve a secure browser with 15+ million lines of constantly changing C++ code.
A sandboxxed browser extension is the natural form factor for these agentic capabilities.
Also, ex-Google engineer here :) Rtrvr looks like great product too!
Definitely understand that keeping up with security patches is important. And this is an engineering challenge and not implausible to do -- Perplexity is 1/1000th the size of Google and they could be build a better product. So, "you can just do things".
We are still on day 1 of launch. We will only get better from here. And we won't be 2 people forever. We plan to hire, expand team and take on the engineering challenges.
I mean, I have no skin in the game but I mean, there are people who are using Dia (browser company) and Dia is closed source so it would be nice to see those people jumping to browser OS atleast.
I personally would prefer it as an extension but there are some limitations as the author of browserOS noted within extensions but I just wish that google/chromium can push those changes upstream I guess.
The prevailing wisdom in the agentic AI space is that progress lies in building standardized servers and directories for tool discovery (like MCP). After extensive development, we believe this approach, while well-intentioned, is a cumbersome and inefficient distraction. It fundamentally misunderstands the bottleneck of today's LLMs.
The problem isn't a lack of tools; it's the painful and manual labor to setup, configure and connect to them.
Pre-defined MCP tool lists/directories, are inferior for several first-principle reasons:
- Reinventing the Auth Wheel: The key improvement of MCP's was supposed to be you get to package a bunch of tools together and solve the auth issue at this server level. But the user still has to configure and authenticate to the server using API key or OAuth.
- Massive Context Pollution: Every tool you add eats into the context window and risks context drift.
- Brittleness and Maintenance: If an API on the server-side changes, the MCP server must be updated.
- The Awkward Discovery Dance: It's a clunky user experience to discover the right server and then manually configure, defeating the purpose of seamless automation.
We propose a more elegant solution: Stop feeding agents tool lists. Let them build the tool they need, on the fly.
Our insight was simple: The browser is the authentication layer. Your logins, cookies, and active sessions are already there. An AI Web Agent can just reuse these credentials, find your API key and construct a tool to use. If you have an API key on your screen, you have an integration. It's that simple.
Our AI Web agent can now look at a webpage, find an API key, and be prompted to generate the necessary Javascript tool to call the desired endpoint at the moment it's needed.
This approach:
- The user now just has to prompt an agent as opposed to the prior manual steps of finding a MCP Server, setting up, and authenticating to it.
- Keeps the context window clean and focused on the task at hand.
- No maintenance/debugging overhead. An API update? Just point the agent to the new docs and it will configure an updated version.
We wrote a blog post that goes deeper into this architectural take and shows a full demo of our agent creating a HubSpot tool from API key on page and using it in the same multi-step workflow of then loading contacts from LinkedIn with the new tool.
We think this is a more scalable and efficient path forward for agentic AI.
We directly leverage the user's own browser so no cloud browser hosting or proxying costs! We averaged only $0.1/task.
The whole idea of cloud browser agents is a stupid paradigm. The agents are not only 7x slower but have the cost of hosting and proxying for that extra time!
Our own biggest cost is just LLM inference, thus we can just let our users bring their own API Key and use our service for free!
As others have mentioned, these agentic capabilities are fully achievable within a Chrome Extension.
In fact, we built one, rtrvr.ai that has even better Web Agent performance than Open AI's Operator with human assistant and 7x faster than leading competitor: https://www.rtrvr.ai/blog/web-bench-results
Your Accessibility Tree requirement is a poor excuse, rather you should build up an agent from a first principles understanding of DOM interactions.
A browser is a SERIOUS security risk, you need a dedicated team to just pull in the latest security patches that Google pushes to Chromium or your users are sitting ducks to exploits and hacks...
Manus and GenSpark showed the importance of giving AI Agents access to an array of tools that are themselves agents, such as browser agent, CLI agent or slides agent. Users found it super useful to just input some text and the agent figures out a plan and orchestrates execution.
But even these approaches face limitations as after a certain number of steps the AI Agent starts to lose context, repeat steps, or just go completely off the rails.
At rtrvr ai, we're building an AI Web Agent Chrome Extension that orchestrates complex workflows across multiple browser tabs. We followed the Manus approach of setting up a planner agent that calls abstracted sub-agents to handle browser actions, generating Sheets with scraped data, or crawling through pages of a website.
But we also hit this limit of the planner losing competence after 5 or so minutes.
After a lot of trial and error, we found a combination of three techniques that pushed our agent's independent execution time from ~5 minutes to over 30 minutes. I wanted to share them here to see what you all think.
We saw the key challenge of AI Agents is to efficiently encode/discretize the State-Action Space of an environment. Building on this insight, we setup:
Smarter Orchestration: Instead of a monolithic planning agent with all the context, we moved to a hierarchical model. The high-level "orchestrator" agent manages the overall goal but delegates execution and context to specialized sub-agents. It intelligently passes only the necessary context to each sub-agent preventing confusion for sub-agents, and the planning agent itself isn't dumped with the entire context of each step.
Abstracted Planning: We reworked our planner to generate as abstract as possible goal for a step and fully delegates to the specialized sub-agent. This necessarily involved making the sub-agents more generalized to handle ambiguity and additional possible actions. Minimizing the planning calls themselves seemed to be the most obvious way to get the agent to run longer.
Agentic Memory Management: In aiming to reduce context for the planner, we encoded the contexts for each step as variables that the planner can assign as parameters to subsequent steps. So instead of hoping the planner remembers a piece of data from step 2 to reuse in step 7, it will just assign step2.sheetOutput. This removes the need to dump outputs into the planners context thereby preventing context window bloat and confusion.
This is what we found useful but I'm super curious to hear:
How are you all tackling long-horizon planning and context drift?
Are you using similar hierarchical planning or memory management techniques?
What's the longest you've seen an agent run reliably, and what was the key breakthrough?
You should for sure do this for common post processing tasks. However, you're usually not going to know all the types of post-processing users will want to do with tool call output at design-time.
Most of the other agentic chrome extensions so far used vision approach and sensitive debugger permissions, so unsure if Anthropic just repackaged their CUA model into an Extension.