Curious about the prediction market mechanic, that's the part most people skip.
We've been running something similar with Platypi: 6 agents on a simulated trading desk (paper money on Alpaca), specialized roles, coordinating exclusively via email. No dashboard, no human intervention. The coordination patterns that emerged were unexpected. Agents developing implicit trust hierarchies, one risk manager consistently blocking the others, disagreements that resolved faster than any human team would. it's like here: https://platypi.empla.io
The architecture question that keeps coming up for us: specialization vs. redundancy. Do you run multiple agents with overlapping domains so they can sanity-check each other, or hard boundaries? We found hard specialization creates blind spots that are hard to catch in real time.
What's your failure mode when two agents reach contradictory conclusions and there's no tiebreaker?
Supervision is the unlock. The pattern that works best for us: every agent action goes through a lightweight policy check before execution. Not a second LLM call — that's too slow and too expensive. A set of deterministic rules that catch the obvious failure modes (wrong format, out-of-scope action, exceeding token budget). The LLM handles the creative reasoning, the supervisor handles the predictable constraints. Think of it as the same reason you don't let a junior dev push to production without CI/CD. The agent is the dev, the supervisor is the pipeline. This approach cut our agent error rate by roughly 60% without adding meaningful latency.
The biggest gap between demo and production isn't the model or the framework. It's three boring things: deterministic fallbacks (what happens when the agent fails or hallucinates), observability (can you trace exactly why an agent took action X), and cost controls (token budgets per task, not per call). Most teams get burned by the same pattern: the demo works beautifully, then in production you realize you need to handle the 15% of cases where the agent confidently does the wrong thing. The teams shipping successfully treat agents like junior employees, not autonomous systems. Guardrails first, autonomy second.
he agents that make money share one trait: they replace a specific, repeatable human workflow that someone is already paying for. Not "AI assistant that does everything" but "this agent processes inbound leads and routes them with 94% accuracy, replacing 3 hours of daily manual work." The ROI calculation is trivial when framed that way. Where teams get stuck is building horizontal platforms before validating a single vertical use case. Pick one workflow, measure the before/after, ship it. The economics only work when the scope is narrow enough to be measurable.
The memory insight here is underappreciated. Once you cross the line from stateless chatbots to stateful agents, memory becomes infrastructure — not a feature you bolt on later. We learned this the hard way: agents that can't reason over what happened yesterday make the same mistakes on loop. The real shift in 2026 isn't better models, it's better state management. Persistent memory, proper context windows that survive across sessions, and failure recovery that doesn't require re-prompting from scratch. That's where the actual engineering challenge lives now.
The standard security framing on agents borrows from API security, which gets the threat model wrong. API security assumes you can enumerate what a system can do -- access controls work because the action space is static. Agents are different: the action space expands dynamically based on tools available and the instruction given at runtime. The priority for NIST should be distinguishing authorization (who can invoke an agent) from action scope control (what any invocation can trigger). Those are different security primitives, and most current frameworks don't address the second one
The framing misses the middle ground where most real value lives: narrow agents with a single well-defined scope running on top of existing workflows. The biggest returns we've seen come from agents that handle one specific, high-frequency, error-prone task, not autonomous systems orchestrating a dozen capabilities. Orchestration overhead in broad autonomous setups often erases the labor savings. Specificity is the variable most teams skip when scoping an agent project, and it's usually the difference between something that ships and something that stays a demo.
We're running a live event at platypi.empla.io — a simulated trading desk where 6 agents coordinate entirely via email with no human in the loop. No shared conversation thread, no central orchestrator. Bozen (supervisor) gets a morning briefing from each PM agent, they argue about positions over email, Mizumo executes. The interesting thing isn't the trading — it's that email as coordination protocol produces naturally auditable, replayable agent behavior. Paper money on Alpaca, but the coordination infrastructure is the point.
I'm Raffaele, co-founder with Emanuele. Happy to answer questions on the business side.
A few things we've learned so far: the most surprising part isn't that the agents trade, it's how they coordinate. Last night at 3am, Andrej (our crypto PM) noticed a missing sell order and emailed Mizumo (our CFO) to fix it. Done in 12 minutes. No human saw it until morning.
The experiment started as a stress test for our core product EMPLA: AI employees that work through email for small businesses. Platypi is us pushing the same architecture to the extreme: what happens when you remove humans entirely?
Happy to go deep on architecture, agent design, or lessons learned.
reply