Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I worked with an admin like that. We had a huge cluster, but he was greedy with the storage space for a service that was critical for the operation of the org.

And I get that this is a good mindset for not wasting space overall, but if a single backup fills 90% of your storage space in test use, that machine is not ready for production. And we are not talking about a lot of space here. The backup was maybe 30 Gb the disk 40 Gb. He could have easily just allocated 100 Gb and call it a day, this way we had to go to him 3 times to scale it up in 10 Gb steps each time, including a the stress of figuring out why things are failing (something that the admin should have seen on his monitoring system).

Admins are my heros, but please if you allocate disk space, just take the biggest expected backup and multiply it by π. And if you need to be stingy with storage for some reason, be stingy, but decide when and where to be stingy — and at least keep an eye on the monitoring and upsize the storage before it is too late.



Having been on both sides (admin and developer), developers are notoriously bad at estimating how much space they need. You can't give them carte blanche to the storage because they'll waste it and consume as much as they're given without a thought to conserving it. And then when you put limits in, they'll whine and complain until they get what they want. Being an Artifactory service provider for a large IT dept gave me a direct view in how hard it is to manage storage for developers. And as a developer using Artifactory, I don't want to worry about storage, I just want my builds and CI pipelines to complete.


This reminds me of a time we were helping a dev team bring logging in house because they weren't liking the features of their logs-as-a-service provider.

They set all applications to "debug" level logs in production and were generating multiple gigabytes of logs per hour.

They wanted 90 days retention, and the ability to do advanced searching through the live log data so they could debug in production (they didn't really use their dev or stage environments, or have a process for documenting and reproducing bugs).


90 days retention is only 2,160 hours. Even at 999 GB/hr that is only ~2160 TB of storage. So, if we stretch the definition of “multiple gigabytes”, is maybe $100k in storage which is around 3-6 developer-months. If we use a more reasonable definition like 10 GB/hr, then that is 20 TB, so maybe $1k in storage which is around 1 developer-day.

Seems pretty reasonable to me.


A few years ago I joined a company aggressively trying to reduce their AWS costs. My jaw got the floor when I realized they were spending over a million a month in AWS fees. I couldn't understand how they got there with what they were actually doing. I feel like this comment perfectly demonstrates how that happens.


AWS also purposefully makes it easy to shoot yourself in the foot. Case in point that we were burned on recently:

- set up some service that talks to a s3 bucket

- set up that bucket in the same region/datacenter

- send a decent but not insane amount of traffic through there (several hundred Gb per day)

- assume that you won’t get billed any data transfer fees since you’re talking to a bucket in the same data center

- receive massive bill under “EC2-Other” line item for NAT data transfer fees

- realize that AWS routes all traffic through NAT gateway by default even though it’s just turning around and going back into the data center it came from and billing exorbitant fees for that

- come to the conclusion that this is obviously a racket designed to extract money from unsuspecting people because there is almost no situation where you would want to do that by default and discover that hundreds to thousands of other people have been screwed in the exact same way for years (and there is a documented trail of it[1])

1: https://www.lastweekinaws.com/blog/the-aws-managed-nat-gatew...


If you're up for it... https://github.com/AndrewGuenther/fck-nat

Even has ha mode.


Developer will do the simplest thing to solve the problem.

If the solutions are:

* rewrite that part to add retention, or use better compression, or spend next month deciding which data to keep and which can be removed early

* Wiggle a thing in panel/API giving it more space

The second will win every single time unless there is pushback or it hits the 5% of the developers that actually care to make good architecture not just deliver tickets.


They're pricing hot storage at $50/TB (not per month). That is definitely not AWS or anything like it.

On a per-month basis, the grossly exaggerated number is in the single thousands. The non-exaggerated number is down in the double digits.

$50/TB is a lowball if you want much of the data to be on SSDs, but taking an analysis server and stuffing in 20TB of SSD (plus RAID, plus room for growth) is a very small cost compared to repeated debugging sessions. Especially because the SSD has to deal with about 0.01 DWPD.


Cloud truly monetizes the tar pit.


... only 2PB? You might be using a different scale than some of us.


Their scale was money. Saying something is "only" a single digit number of developer months makes sense in this context.

And that was a number hundreds of times higher than what they were replying to, just to make a point.


Plus, logs have enormous compression potential since their entropy is so low. That's the property exploited by every logging-as-a-service out there.


Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].

I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].

[1] https://www.uber.com/en-BR/blog/reducing-logging-cost-by-two... [2] https://clickhouse.com/docs/en/integrations/data-formats/jso...


https://messagetemplates.org/

The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.

However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.


Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.


In what world is 2160TB $100k?

Current single disk solutions are around $25/TB for HDDs and ~$100/TB for NVMe.

At a minimum you're looking at $54k just for raw capacity-- assuming no backup, no chassis, no networking, and no redundancy.

More reasonable estimations would be in excess of $400/TB.


Sure, whatever, a factor of 10 here or there hardly matters. I literally misinterpreted “multiple gigabytes per hour” as 999 GB/hr, not a much more reasonable 10 GB/hr. I literally overestimated data rates by a factor of 10,000% and the number still comes out “reasonable” i.e. a cost that can be paid if the cost/benefit is there.

Unless you want to claim storage costs $5,000/TB for 3 MB/s of I/O “multiple gigabytes per hour” with 90 day retention for a team worth of logging is not stupid on its face. Not to say that is a efficient or smart solution, but certainly not a “look at this insane request by developers” the person I was originally responding to was making it out to be.

Personally, I would probably question the competence of the team if they had that sort of logging rate with manual logging statements, but I am merely pointing out that “multiple gigabytes per hour” for 90 days is not crazy on its face and a plausible business case could be made for it even with a relatively modest engineering team.


My recent discussions with multiple SAN vendors as well as quoting out cost to DIY storage has that number being far away from "reasonable". I do not claim storage is $5,000/TB but it is substantially higher than the $50/TB you're estimating.

It's difficult to estimate the log throughput in this scenario. Cisco on debug all can overload the device's CPU; systems like sssd can generate MB of logs for a single login.

All of this is really missing the core issue though. A 2PB system is nontrivial to procure, nontrivial to run, and if you want it to be of any use at all you're going to end up purchasing or implementing some kind of log aggregation system like Splunk. That incurs lifecycle costs like training and implementation, and then you get asked about retention and GDPR.... and in the process, lose sight of whether this thing you've made actually provides any business value.

IT is not an ends in itself, and if these logs are unlikely to be used the question is less about dollars-per-developer-hour and more about preventing IT scope creep and the accumulation of cruft that can mature into technical debt.


But you wouldn't use a SAN here. SAN pricing is far away from reasonable for this situation.

For the 20TB case, you can fit that on 1 to 4 drives. It's super cheap. Plus probably a backup hard drive but maybe you don't even need to back it up.

For the 2PB case, you probably want multiple search servers that have all the storage built in. There's definitely cost increases here, but I wouldn't focus too much on it, because that was more of a throwaway. Focus more on the 20TB version.

> That incurs lifecycle costs like training and implementation

Those don't relate much to the amount of storage.

> and then you get asked about retention and GDPR....

It's 90 days. Maybe you throw in a filter. It's not too difficult.

> if these logs are unlikely to be used

The devs are complaining about the search features, it sounds like the logs are being used.

> preventing IT scope creep and the accumulation of cruft that can mature into technical debt

Sure, that's reasonable. But that has nothing to do with the amount of storage.


> Current single disk solutions are around $25/TB for HDDs

More like $15/TB. $100K for 2 PB of storage with redundancy and backups is quite reasonable.


I'm showing Exos x20 20TBs for ~$500 new.

$300 is moving towards refurb / shucked prices.


> I'm showing Exos x20 20TBs for ~$500 new.

Where? For new prices I'm seeing $350 at amazon, $350 at B&H, $280 direct from newegg, $280 at serverpartsdeals.


> In what world is 2160TB $100k?

When you buy a SAN to present a bunch of disks as one thing to the rest of the machines.


…what? Without any other context on what they’re working on or the size of the company, an extra developer of cost is automatically reasonable?


For their use case it sounds like they wanted to index the heck out of it for near instant lookups and similar too. So probably need to double the data size (rough guess) to include the indexes. And it may need some dedicated server nodes just for processing/ingestion/indexing, etc.


The idea that anyone would find storing 20TB of plain text logs for a normal service reasonable is quite amusing.

Don't get me wrong, I understand that a single-digit kUSD/month is peanuts against developer productivity gains, but I still wouldn't be able to take a developer making that suggestion seriously. I would also seriously question internal processes, GDPR (or equivalent) compliance, and whether the system actually brings benefit or if it is just lazy "but what if" thinking.


With silliness like that, you can bet it was the cost of their logging provider, as the feature they didn't like.


This doesn’t seem terrible if the business benefits justify the costs. There is a cost/benefit to this, presumably.


You have to fill out a load chart to fly a plane, they should have to fill out something like a storage chart to get a production allocation. What size are your objects? How many per unit of time and served entity? What is the lifetime of those objects? How is that lifetime managed?


If you agree to add a few months of development time and reduce future velocity to make sure these limits are enforced, sure. Usually adding storage costs about as much as 1 developer’s salary cost for what, an hour? A day?


It's not for saving storage. It's for making sure it will actually not overflow.

End number doesn't matter, what matters is developers thinking about how long data should be stored and what data should be stored.

Not doing that analysis and overprovisioning 4x will just cause disaster in 2 years instead of 6 months.


You missed the part where I said they are "notoriously bad at estimating". We really do suck at estimating everything... storage, work estimates, etc. Why can't we just say "it'll be done when it's done and I'll use ALL the storage until I'm done"?


I mean in my case it was literally a database file filled with the number of (dummy) people who are currently in our org. So that database size was the size of the project. He just didn't plan for the size of backups (backups were his job, not ours).


Allocated storage should come directly from the consuming team's budget. Divvy up the total storage cost and allocate in proportion to requested limits.


Sure, what are you going to bill me for a 30GB VM on a 100TB cluster? Whether I want 30GB or 100GB for an absolute central service for the whole org shouldn't matter. If we are talking about personal pet projects or user accounts — sure — but that wasn't my complaint here.


> You can't give them carte blanche to the storage because they'll waste it

So what? Just buy more. Storage is cheap.

It's hard to have a discussion here without understanding the scales involved. Is the problem that they're wasting 100 GB or 100 TB? And if the issue is truly that they're wasting 100 TB, then clamp down on it as part of cost reduction efforts. The truth in most organizations is you get rewarded for eliminating mountains of waste, but trying to prevent the waste in the first place brands you as someone difficult to work with who is standing in the way. Why not lean into that?


Be stingy in a smart way. I'd like a description why you need the storage, an estimation how much and a projection of growth over the next 6 - 12 month, though the latter one can wait for a month or three for something new. Beyond a certain scale we'd need a PO or a project to write up the cost too. And yes, we'll start bugging you again once it starts filling up to 80% or more, because we don't want your systems to fail :)

But this way, I directly get an idea of the increase in storage we will need over the next year in order to plan the next hardware expansion.


This is why one should generally be using network/cloud storage, with soft/hard limits.

As soon as the soft limit is hit, fire off an alert. Have the hard limit set at double or more.


On my MacBook, I like to keep a few giant blank files that I can delete in a disk space emergency.


Since apfs, “large” empty files can also take only the 4K or so for the fs entry on disk, gotta make sure it’s not a sparse file, and doing so is a bit tricky, better to fill it with junk that doesn’t compress well.


And my mother sets all her clocks ahead 10 minutes so she's never late.


It is not an entirely unreasonable idea. If a system runs out of disk space, an unexpectedly large number of operations will fail. Which can make recovery more problematic than you would assume. If you can immediately recover some disk space and have breathing room, it could make the difference in restoring service.


Keeping a bit of disk reserved for recovery is extremely common with copy-on-write filesystems like ZFS & BTRFS. Even deletion takes some extra space, so without a reservation it's effectively impossible to delete any files from a full disk.


Or, you know, have alert on disk space like adults.


This is an alert that must be fixed immediately, with an escape hatch to fix it quickly in case you truly don’t have time to manage your actual files.

It is also trivial to set up, and does not require me to figure out how to set up an OS alert, or trust that whatever alert process is running. So it is an essentially fail proof alert that works the same on any OS.

What’s a good argument against it?


In my experience this makes the problem worse. People either compensate for it, or stop trusting clocks at all. Usually a mix of both of those resulting in even less punctuality.


How about a cron job that checks disk usage once per day and prints the top 3 culprits by file type if du exceeds 90%


Can't you set up overprovisioning so the storage controller can do something usefull with it while you don't need the space?


This was a VM with a ceph cluster.


I agree with most of your points, but with system resources, sometimes it is simply that you can give them, but you can never take them back.


If cost accounting is done correctly, business units will give them back willingly.


"It is difficult to get a man to understand something, when his salary depends upon his not understanding it." – Upton Sinclair


I don't get it... If you have a good reason to use 2 TB, i'm happy to allocate it for you.

If you just "I want 20 GB of storage", i'm not going to give it to you.

Storage is cheap in relation to other things. Just have a good reason to why you need it.


> If you have a good reason

So, you are not expecting that your co-workers have good reasons for what they are doing? Maybe the hiring bar at your place is too low then.

I prefer to work at places where my default assumption is that everybody around me is smart and responsible. Lifts lots of worries off my shoulders (and tends to benefit the stock price over time too and thereby my income).


My coworkers have called out gaps in my thinking thousands of times when I have explained perceived needs to them, that's one of the main value-adds one gets from working in a team.

If I wanted unquestioned control, I'd run my own shop. If I want the best product, then I hope that people question my assumptions.


We are not in disagreement here. Bouncing off ideas and thoughts is a good thing.

The way this was phrased was more from the angle "who knows what these guys were thinking; if they can't give me a good reason, no way they will get storage space as I don't trust that they make good decisions on their own".


Generally? No. Not because they are not smart, but because in a large company, each individual have different goals and priorities - that's why we have e.g. SREs as dedicated roles - and it takes a bit of effort to find the intersection between all these.

Let's say I work in DevOps and want to optimize cloud costs. In that case, I would challenge the size of everything, the use of higher-costs services, the number of regions, all that - but the team might want more regions and bigger resources to improve latency and performance, and use more high-cost services for developer experience, and ship features without having to think about utilization.

It's a tug of war, and only works when you have forces on both sides to balance out. Being too conservative might stall innovation or make things too slow to save a buck, not being conservative enough might drain funds or make things impossible to scale.


> It's a tug of war,

Yeah, any workplace in which the word "war" was used in the context of colleague interaction saw me leave within a few months.

I like to plan those things ahead of time with all stakeholders involved, then we work together instead of against each other.


I believe you are intentionally misunderstanding. The term "tug of war" is not used to indicate armed conflict or even a problem. It indicates balancing forces that you want to maintain - pull the rope too far to one side, and you end up in a suboptimal extreme.

Unless you work with clones of yourself, there will always be differences in opinions and priorities, and not every feature and bug fix can be a company-wide stakeholder meeting, and you certainly will not get any social points for trying to micro-manage other teams.


Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders. Of course not the whole company.

But the attitude needs to be "let's put the requirements on the table and see what we can do" instead of "you don't get what you want unless you give me a good reason". The latter comes from an angle of distrust which I'm arguing against. The former comes from an angle of collaborative problem solving.

In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.

> I believe you are intentionally misunderstanding.

You are free to believe what you like. Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.


> Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.

Your response hitched on a single word ("war") within a common phrase ("tug of war", a game). While it might have been accidental, such answers mislead from the actual discussion (and tends to be used as distractions when no good answer is present).

> Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders.

When you discuss new architectures or large projects, this is a given, but this covers only a small portion of company operation - the rest is organic day-to-day work, which slowly but surely distorts initial assumptions. Slowly boiling the frog, so to speak. Think one team making changes that affect request patterns, another team making something that is accidentally quadratic, and a third team suddenly asking for a large number of cloud resources to carry this that should absolutely be challenged.

And at the same time, teams are under different organization units with different budgets, schedules, leaderships and priorities - and most certainly don't care about daily scrum work of other teams.

> In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.

No one said "we'll stop talking to you", but "you get what can be justified". If you take offense to be challenged and would rather work somewhere else, you do you, but if you can't justify your request I'd argue that you are not doing your job properly in the first place.


If your smart colleagues can't write a sentence like "we need extra 1TB for next 3 years of growth", they are not smart and you're not either...


Why wouldn't you assume that if I'm asking for it I have a good reason?

Are you going to rearchitect my system for me?


Accountability? Is that attitude any different from just asking for money and refusing to explain what it’s for?


Wouldn’t you expected to have to provide some level of justification if you were, say, requesting a new development machine?


There is a difference between spending $2000+ on a new computer and $10, which is about what a terabyte costs. Probably just having the discussion itself would waste more resources than just giving the storage space.


The meeting to bring in the relevant stakeholders and discuss that reasoning literally costs more than just fucking buying some cloud space.


Had a boss that would swoop into “suspicious” meetings.

“There’s 10 people here whose time I bill out at $250/hr each, spending an hour discussing whether to buy a $1,000 software license? Why?”


"Because you don't give us access to the financials, so we have no frigging idea if we can afford it, Frank."

"Some of us don't like jumping into things without looking."

I wager your boss would not be amused with me.


dealing with the gate keeping often costs more in dev time then just approving. Especially when the DevOps think they know better - thank goodness a tech director can step in and bust the impasse.


We tried. Devs when give more space just didn't bother to clean old crap and exact same thing happened but with few months of delay. But then we generally just ask how much do they need and bill project for it so that's generally also on them.

But, unlike Toyota, we do have disk space alerts.

Sometimes the problem is also entirely political, the management needs to tell client and charge them for more storage and won't accept the change till that happens. Meanwhile clock is ticking...


In my case the data was a dummy database of more dummy users than are at our org (maybe 50% more). So once this goes to production it would likely get smaller.

The problem in this case was twofold:

- admin had the job to implement database backups. He didn't factor in backup size when allocating disk space. So this was wntirely his own fault.

- the database does store certain transactions for a certain period, so this grew initially until it it setteled at a certain level. Because the margin of storage was slim, this caused the problem


...did you communicate any of that ?

Because most of our cases where that happens was either lack of planning or lack of communicating that plan. By far most common one was "neither dev nor client knows the data volume in longer period". Which is fine as long as that's also communicated, but that's also often a problem.

But I'm not denying of course that there are just shitty incompetent ops departments, just for the other customer we had dealing with ops department that had:

* backup storage (which was some remote FTPS server IIRC) provisioned so slow the backup wouldn't copy within 24 hours. And the backup size was below TB. * weeks long delays with any resize.


Nitpick: "b" is for bits, "B" is for bytes. Please don't casually mislabel units.


He was just training the users to go the shadowIT route with self bought non controlled storage, most probably in the form of personal usb drives, and/or departamental consumer NAS devices.


Just wondering, why pi?


I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.


> I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.

Good to know. FWIW i should also be avoided. It's tempting to use, since most programs use it as a counter, so it /should/ standardize the logfilesizes. But in practice it's very tricky to get a definitive disk space requirement with it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: