Hacker Newsnew | past | comments | ask | show | jobs | submit | binarylogic's commentslogin

I agree with the framing. The goal isn't less data for its own sake. The goal is understanding your systems and being able to debug when things break.

But here's the thing: most teams aren't drowning in data because they're being thorough. They're drowning because no one knows what's valuable and what's not. Health checks firing every second aren't helping anyone debug anything. Debug logs left in production aren't insurance, they're noise.

The question isn't "can you do with less?" It's "do you even know what you have?" Most teams don't. They keep everything just in case, not because they made a deliberate choice, but because they can't answer the question.

Once you can answer it, you can make real tradeoffs. Keep the stuff that matters for debugging. Cut the stuff that doesn't.


The problem is until I hit a specific bug I don't know what logs might be useful. For every bug I've had to fix 99% of the logs were useless, but I've had to fix many bugs over the years and each one needed a different set of logs. Sometimes I know in the code "this can't happen but I'll log an error just in case" - when I see those in a bug report they are often a clue, but I often need a lot of info bugs that happen normally all the time to figure out how my system got into that state.

"disk getting full" isn't useful unless you understand how/why it got full and that requires logging things that might or might matter to the problem.


There is a lot of crap that is and will ever be useless when debugging a problem. But there is a also a lot that you don't know if you will need it, at least, not yet, not when you are defining what information you collect, and may become essential when something in particular (usually unexpected) breaks. And then you won't have the past data you didn't collect.

You can go in a discovering path, can the data you collect explain how and why the system is running now? There are things that are just not relevant when things are normal and when they are not? Understanding the system, and all the moving parts, are a good guide for tuning what you collect, what you should not, and what are the missing pieces. And cycle with that, your understanding and your system will keep changing.


Agree to an extent. There are absolutely unknown unknowns. But I think you'd be surprised how much data is obviously waste. Not the grey area, just pure garbage: health checks, debug logs left in production, redundant attributes.

That's why we break waste down into categories: https://docs.usetero.com/data-quality/categories/overview

But we don't stop there. You can go deeper with reasoning to root out the more nuanced waste. It's hard, but it's possible. That's where things get interesting.


Thank you! And you're right, it shouldn't cost that much. Financials are public for many of these vendors: 80%+ margins. The cost to value ratio has gotten way out of whack.

But even if storage were free, there's still a signal problem. Junk has a cost beyond the bill: infrastructure works harder, pipelines work harder, network egress adds up. And then there's noise. Engineers are inundated with it, which makes it harder to debug, understand their systems, and iterate on production. And if engineers struggle with noise and data quality, so does AI.

It's all related. Cheap storage is part of the solution, but understanding has to come first.


Thanks for the comment! Yes, I read that post. Great post. Feel free to reach out if you ever need help with Vector or have questions.


What you're describing is very real and it works to a degree. I've seen this same manual maintenance play out over and over for 10 years: cleaning dashboards, chasing engineers to align on schemas, running cost exercises. It never gets better, only worse.

It's nuts to me that after a decade of "innovation," observability still feels like a tax on engineers. Still a huge distraction. Still requires all this tedious maintenance. And I genuinely think it's rooted in vendor misalignment. The whole industry is incentivized to create more, not give you signal with less.

The post focuses on waste, but the other side of the coin is quality. Removing waste is part of that, but so is aligning on schemas, adhering to standards, catching mistakes before they ship. When data quality is high and stays high automatically, everything you're describing goes away.

That's the real goal.


Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow.

> I'm surprised it's only 40%.

Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with.

> compare logs from known good to known bad

I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping.

You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens.


> Traditional wisdom tells you regex is slow.

Because it's uncomfortably easy to create catastrophic backtracking.

But just logical-ORing many patterns together isn't one of the ways to do that, at least as far as I'm aware.


> I think you're describing anomaly detection.

Well it's in the same neighborhood. Anomaly detection tends to favor finding unique things that only happened once. I'm interested in the highest volume stuff that only happens on the abnormal state side. But I'm not sure this has a good name.

> Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever?

I get your point but: if sorting by the most strongly associated yields root causes (or at least, maximally interesting logs), then sorting in the opposite direction should yield the toxic waste we want to eliminate?


Vectorscan is impressive. It makes a huge difference if you're looping through an eval of dozens (or more) regexps. I have a pending PR to fix it so it'll run as a wasm engine -- this is a good reminder to take that to completion.


But if you don't do anomaly detection, how can you possibly know which data is useful for anomaly detection? And thus, which data is valuable to keep


100% accurate. It is very much political. I'd also add that the problem is perpetuated by a disconnection between engineers who produce the data and those who are responsible for paying for it. This is somewhat intentional and exploited by vendors.

Tero doesn't just tell you how much is waste. It breaks down exactly what's wrong, attributes it to each service, and makes it possible for teams to finally own their data quality (and cost).

One thing I'm hoping catches on: now that we can put a number on waste, it can become an SLO, just like any other metric teams are responsible for. Data quality becomes something that heals itself.


I'd be shocked if you can accurately identify waste since you are not ultimately familiar with the product.

Sure, I've kicked over what I thought was waste but told it's not or "It is but deal Ops"


You're right, it's not always binary. That's why we broke it down into categories:

https://docs.usetero.com/data-quality/logs/malformed-data

You'd be shocked how much obviously-safe waste (redundant attributes, health checks, debug logs left in production) accounts for before you even get to the nuanced stuff.

But think about this: if you had a service that was too expensive and you wanted to optimize the data, who would you ask? Probably the engineer who wrote the code, added the instrumentation, or whoever understands the service best. There's reasoning going on in their mind: failure scenarios, critical observability points, where the service sits in the dependency graph, what actually helps debug a 3am incident.

That reasoning can be captured. That's what I'm most excited about with Tero. Waste is just the most fundamental way to prove it. Each time someone tells us what's waste or not, the understanding gets stronger. Over time, Tero uses that same understanding to help engineers root cause, understand their systems, and more.


I would like to just have a storage engine that can be very aggressive at deduplicating stuff. If some data is redundant, why am I storing it twice?


That's already pretty common, but the goal isn't storing less data for its own sake.


> the goal isn't storing less data for its own sake.

Isn't it? I was under impression that the problem is the cost storing all this stuff


Nope, you can't just look at cost of storage and try to minimize it. There are a lot of other things that matter.


What I am asking is, what are the other concerns other than literally the cost? I have interest in this area and I am seeing everyone saying that observability companies are overcharging their consumers.


We're currently discussing the cost of _storage_, and you can bet the providers already are deduplicating it. You just don't get those savings - they get increased margins.

I'm not going to quote the article or other threads here to you about why reducing storage just for the sake of cost isn't the answer.


Well, that's a weirdly confrontational reply. But thanks


Thank you for the nice comment. I'm glad you enjoy Vector. I poured myself into that software for many years. I'm a bit bummed with its current trajectory, though. We hope to bring the next evolution with Tero. There were many problems with Vector that I wished I could have fixed but was unable to. I hope to do those things with Tero (more to come!)

And yes, Tero is fundamentally a control plane that hooks into your data plane (whatever that is for you: OTel Collector, Datadog Agent, Vector, etc). It can run on-prem, use your own approved AI, completely within your network, and completely private.


Appreciate the reply! Have you decided on a license yet?


Hey Peter, I absolutely remember you! Thanks for the nice comment.

And yes, data waste in this space is absurdly bad. I don't think people realize how bad it actually is. I estimate ~40% of the data (being conservative) is waste. But now we know - and knowing is half the battle :)


I spent a decade in observability. Built Vector, spent three years at Datadog. This is what I think is broken with observability and why.


And how are you solving the problem? The article does not say.

> I'm answering the question your observability vendor won't

There was no question answered here at all. It's basically a teaser designed to attract attention and stir debate. Respectfully, it's marketing, not problem solving. At least, not yet.


theres more information here https://docs.usetero.com/introduction/how-tero-works the link in the article is broken.

They determine what events/fields are not used and then add filters to your observability provider so you dont pay to ingest them.


What’s the differentiation vs., say, Cribl? Telemetry pipeline providers abound.


The question is answered in the post: ~40% on average, sometimes higher. That's a real number from real customer data.

But I'm an engineer at heart. I wanted this post to shed light on a real problem I've seen over a decade in this space that is causing a lot of pain; not write a product walkthrough. But the solution is very much real. There's deep, hard engineering going on: building semantic understanding of telemetry, classifying waste into verifiable categories, processing it at the edge. It's not simple, and I hope that comes through in the docs.

The docs get concrete if you want to peruse: https://docs.usetero.com/introduction/how-tero-works


I would contend that it is impossible to know a priori what is wasted telemetry and what isn’t, especially over long time horizons. And especially if you treat your logs as the foundational source of truth for answering critical business questions as well as operational ones.

And besides, the value isn’t knowing that the waste rate is 40% (and your methodology isn’t sufficiently disclosed for anyone to evaluate its accuracy). The value in knowing what is or will be wasted. It’s reminiscent of that old marketing complaint: “I know that half my advertising budget is wasted; I just don’t know which half.”

Storage is actually dirt cheap. The real problem, in my view, is not that customers are wasting storage, but that storage is being used inefficiently, that the storage formats aren’t always mechanically sympathetic and cloud-spend-efficient to the ways they data is read and analyzed, and that there’s still this culturally grounded disparate (and artificial) treatment of application and infrastructure logs vs business records.


I'm curious about the deep details, but the link 404s.


My apologies, I fixed the link. So much for restructuring the docs the night before posting this.

You can read more here: https://docs.usetero.com/data-quality/overview

To loosely describe our approach: it's intentionally transparent. We start with obvious categories (health checks, debug logs, redundant attributes) that you can inspect and verify. No black box.

But underneath, Tero builds a semantic understanding of your data. Each category represents a progression in reasoning, from "this is obviously waste" to "this doesn't help anyone debug anything." You start simple, verify everything, and go deeper at your own pace.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: