More

fizwhiz · on Sept 6, 2023

I'm sorry but this was easily the most pretentious book I've read all year.

skmurphy · on Sept 8, 2023

Do you disagree with any of the six points I posted from the book?

skmurphy · on Sept 6, 2023

There were several good suggestions clearly expressed, but I agree the signal-to-noise ratio was not great--also true for a lot of books and articles you can still learn from.

What would you recommend as better books or articles on the topic of structuring serious conversations?

fizwhiz · on July 28, 2023

Why hasn't their stock plummeted like Google's?

fizwhiz · on July 27, 2023

> This section explains the leader election algorithm at a high level. It is by no means exhaustive and deliberately avoids any formal specification or proof. Readers looking for an exhaustive explanation should refer to the Raft paper, which acts as a strong inspiration for BlazingMQ’s leader election algorithm.

So their own homegrown leader election algorithm?

> BlazingMQ’s leader election and state machine replication differs from that of Raft in one way: in Rafts leader election, only the node having the most up-to-date log can become the leader. If a follower receives an election proposal from a node with stale view of the log, it will not support it. This ensures that the elected leader has up-to-date messages in the replicated stream, and simply needs to sync up any followers which are not up to date. A good thing about this choice is that messages always flow from leader to follower nodes.

> BlazingMQ’s elector implementation relaxes this requirement. Any node in the cluster can become a leader, irrespective of its position in the log. This adds additional complexity in that a new leader needs to synchronize its state with the followers and that a follower node may need to send messages to the new leader if the latter is not up to date. However, this deviation from Raft and the custom synchronization protocol comes in handy because it allows BlazingMQ to avoid flushing (fsync) every message to disk. Readers familiar with Apache Kafka internals will see similarities between the two systems here.

"a new leader needs to synchronize its state with the followers and that a follower node may need to send messages to the new leader if the latter is not up to date". I thought a hallmark of HA systems was fast failover? If I come to your house and knock on the door, but it takes you 10mins to get off the couch to open the door, it's perfectly acceptable for me to claim you were "unavailable". Pedants will argue the opposite.

anentropic · on July 27, 2023

"deliberately avoids any formal specification or proof"

how is this possibly a selling point?

fizwhiz · on July 27, 2023

FWIW they mention this at the bottom of their document

> Just like BlazingMQ’s other subsystems, its leader election implementation (and general replicated state machinery) is tested with unit and integration tests. In addition, we periodically run chaos testing on BlazingMQ using our Jepsen chaos testing suite, which we will be publishing soon as open source. We have also tested our implementation with a TLA+ specification for BlazingMQ’s elector state machine.

callbacker · on July 27, 2023

One of the authors here. Thanks for pointing that out. TLA+ spec can be found here -- https://github.com/bloomberg/blazingmq/tree/main/etc/tlaplus.

amelius · on July 27, 2023

How is this "deliberately avoiding any formal specification or proof"?

peheje · on July 27, 2023

I think they are referring to the specific section below the notice.

callbacker · on July 28, 2023

Correct

scottlamb · on July 27, 2023

> "deliberately avoids any formal specification or proof"

> how is this possibly a selling point?

Context.

In an overview doc? The informal version is accessible to more audiences.

As the canonical design doc for the system? It's certainly not.

anentropic · on July 28, 2023

Apologies! I was only blindly responding to the parent comment...

FWIW the phrase seems to come from here: https://bloomberg.github.io/blazingmq/docs/architecture/elec...

The full sentence reads:

"This section explains the leader election algorithm at a high level. It is by no means exhaustive and deliberately avoids any formal specification or proof."

Which makes a lot more sense.

And as noted in a sibling comment there is actually a TLA+ spec for the leader election: https://github.com/bloomberg/blazingmq/tree/main/etc/tlaplus

KRAKRISMOTT · on July 27, 2023

HA aside, is this faster than Aeron?

fizwhiz · on July 14, 2023

Lots of haters on the part1 segment of this article wondering "wtf does this VP even do?!"

> I have friends who are line managers at larger companies who take home more than I do in my current role

I wonder if this is the "liquid" portion of Emily's comp. If it's the "total" comp then that tells me a VP Eng at a Series D startup makes < $400k all in??

slt2021 · on July 15, 2023

startup equity is nothing more than a lottery ticket, unless there is robust secondary market for vested options (probably mostly applicable to unicorns)

afro88 · on July 15, 2023

I was one if them. This post is what I was looking for in part 1.

fizwhiz · on June 20, 2023

What's min-maxing?

reaperman · on June 20, 2023

Concept that comes from RPG's, or any sufficiently complex optimization problem where you have a limited number of total points to spend so you "min"-imize the least helpful stats and "max"-imize the most helpful. For a fighter character, you'd obviously max out strength, as well as constitution and dexterity. You'd minimize intelligence, charisma, and wisdom - not because they’re not helpful but because you only have a limited number of points to spend. In the context of:

> The trick to effective min-maxing in social contexts is to make sure people can't tell you're min-maxing.

It means that, while in interview situations there is an expectation that you say "I did this", in social situations you might get more benefit from seeming to be more humble and appear to withhold your accomplishments, perhaps doing it in a way where someone else fills the gaps for you, or it entices the other parties into doing a bit of digging on their own and perhaps find some well-placed bios online that look like they weren't written by you / at your behest which do your bragging for you.

This avoids the situation where someone will say: "I work on Copilot now, and I can say this guy is totally full of himself." as other people respond "huh. yeah. that makes sense."

yazaddaruvala · on June 20, 2023

Its a synonym for "optimizing" or "playing the game perfectly".

littlestymaar · on June 20, 2023

2023 slang for “optmizing”, comes from: https://en.m.wikipedia.org/wiki/Minimax

maleldil · on June 20, 2023

Not 2023, and not from the AI thing. It comes from RPGs, like what the sibling comment explained. It must be as old as D&D.

littlestymaar · on June 20, 2023

While it may have existed for a long time, it's gotten significantly more use in recent days in a non-RPG settings. Quick illustration from HN comment section[1]:

- From 2010 to 2019: 40 occurrences in HN's comments

- Compare that to over 70 for just the past two years.

Basically, it's as if you said “Woke” wasn't a contemporary word, because it used to exist in a niche for a long time.

[1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

TeMPOraL · on June 20, 2023

FWIW, while I don't know the origin of this term, I learned it first as a noun, used in the context of AI - the old-fashioned AI, as applied to game development, some 15 years ago. The verb form, however, is something I've only noticed people using in the last few months; it's possible this is due to some recent events that made the term more widely known / popular.

plorkyeran · on June 20, 2023

I have never seen min-max used as a noun and I'm not really sure what it'd even mean as a noun, but I've been using it as a verb for several decades.

TeMPOraL · on June 21, 2023

It's an old-school algorithm family for determining optimal play in turn-based PvP games.

fizwhiz · on May 23, 2023

Obligatory: https://danluu.com/sounds-easy/

fizwhiz · on May 17, 2023

These roles likely comprise less than 5% of all engineers at big tech companies so I would be pretty surprised if such a book exists.

fizwhiz · on April 26, 2023

nit: Cockroach was founded by a Xoogler but there's no public evidence that they were on the Spanner team at any point.

dekhn · on April 27, 2023

IIRC Kimball worked on the bigtable team and was working on spanner in the very early days.

fizwhiz · on March 1, 2023

Exactly Once = At least once + Idempotence

crazygringo · on March 1, 2023

Which the author admits three quarters of the way through:

> The way we achieve exactly-once delivery in practice is by faking it. Either the messages themselves should be idempotent, meaning they can be applied more than once without adverse effects, or we remove the need for idempotency through deduplication.

Honestly I don't get why this is "faking it" though. It seems like the author's definition of "exactly once" is so purist as to essentially be a strawman. This is "exactly once" in practice.

Like are there other people claiming that this purist version of exactly-once does exist?

nimih · on March 1, 2023

> Like are there other people claiming that this purist version of exactly-once does exist?

In my experience, the purist version of "exactly-once" exists as a vague, wishy-washy mental model in the brains of developers who have never thought hard about this stuff[0]. Like, once you sketch out why idempotency is important and how to do it, folks seem to pick up on it pretty quickly, but not everyone has trained their intuition to where they automatically notice these sorts of failure modes.

[0] I don't mean this as a slight against those developers--the issues that arise from distributed systems are both myriad and subtle, and if you've spent your time learning how to make beautiful web pages or cool video games or efficient embedded systems, it seems reasonable to not know anything about the accursed problems of hypothetical Byzantine Generals. Or maybe you're fresh out of a bootcamp or an undergraduate program and haven't yet been trained to expect computers to always and constantly fail in every possible way.

cowl · on March 1, 2023

Because both of this "solutions" are not part of the delivery mechanism but part of your problem space. So the delivery system is not guaranteeing even a fake exactly-once delivery, it's you usage that makes it a fake exactly once. What's more both of these solutions are very hard in practice. Idempontency can be applied only on special circumstances when you can design it that way. "Prepare an order" message for example can't be idempotent, it has side effects and it will prepare a new order every time you recieve the message, so you go the deduplication Route by considering the OrderID but if you have several Workers that process these messages how do you handle DeDuplication? if the first worker has never Ack-ed the processing, do you deliver it to a new Worker in the queue? How does the new Worker know if someone else is processing the same OrderID? Central Database? you are only hitting the can down the road...

majormajor · on March 1, 2023

It can be very hard to get idempotency right.

It can get way harder when your initial design made incorrect assumptions about the delivery semantics you were using, so you didn't know you'd need it.

Edit for example:

Someone could have a low-latency problem that seems like it could be a fit for a streaming application. They could look at docs and see "ooh, with Flink I can do exactly-once writes to Kafka" in one place, and choose to use that. But if they don't dig deeply into what that means, they may miss the latency impacts of having to checkpoint every time to commit a set of writes to Kafka. And by the time they figure this out, managing both "low latency" and "exactly once" in the code they wrote might be a really hairy problem.

hn_go_brrrrr · on March 1, 2023

The distinction is how you design. You don't need idempotence with a mythical "exactly once" system. Conversely, when you're debugging a system built on top of "at least once", you need to keep that property in mind in case the bug you're tracking down is lost idempotence.

kevincox · on March 1, 2023

Because idempotence can be very hard to achieve. You usually can't just write the message ID to a DB and ignore messages with a matching ID because if you crash while processing then you need to start over again. But you can't just write it at the end because then all of your processing steps need to be idempotent (so why are you bothering to write the ID?).

I've seen very few systems that have general idempotency baked in. Often it ends up being specific to the application. In some cases you can have simple solutions like upon crashing reload all of the state from an authoritative source. In some cases your messages result in simple idempotent operations such as "insert message with a unique ID" or "mark a message with a unique ID as read" but even then these are becoming quite related to business logic.

Basically idempotency is a powerful tool to create a solution but it is no silver bullet. That is why it is important to understand the underlying problem.

pksebben · on March 2, 2023

reading your comment, it dawned on me; there is a way to theoretically ensure exactly-once delivery.

1. buy plane ticket 2. bring box to recipient 3. plug in Ethernet & send message

keep an eye out for our IPO

yencabulator · on March 2, 2023

That's at-most-once.

jerf · on March 1, 2023

I think we need to keep the concepts separate because otherwise people get confused. You can not receive a message exactly once. Yes, it's not that hard, if you know this is an issue, to build a system where receiving the same message more than once won't cause a bad thing to happen. There's a few principled ways to do this, and some less principled ways that will still mostly work.

But that's not because you built a system that successfully delivers messages exactly once... you build a system that successfully processes messages exactly once, even if delivery occurs multiple times. The delivery still occurred multiple times. Even if your processing layer handled it, that may have other consequences worth understanding. Wrapping that up in a library may present a nice API for some programmer, but it doesn't solve the Byzantine General problem.

Whenever someone insists they can build Exactly Once with [mumble mumble mumble great tech here] I guarantee you there's a non-empty set of human readers coming away with the idea they can successfully create systems based on exactly-once delivery. After all, I built some code based on exactly-once delivery last night and it's working fine on my home ethernet even after I push billions of messages through it.

We're really better of pushing "There is no such thing as Exactly Once, and the way you deal with is [idempotence/id tracking/whatever]", not "Yes there is such a thing as Exactly Once delivery (see fine print about how I'm redefining this term)". The former produces more accurate models in human brains about what is going on and is more likely to be understood as a set of engineering tradeoffs. The latter seems to produce a lot of confusion and people not understanding that their "Exactly Once" solution isn't a magic total solution to the problem, but is in fact a particular point on the engineering tradeoff spectrum. In particular, the "exactly once" solutions can be the wrong choice for certain problems, like multiplayer game state updates, where it may be a lot more viable to think 1-or-0 and some timestamping and the ability to miss messages entirely and recover, rather than building an "exactly once" system.

naasking · on March 1, 2023

> But that's not because you built a system that successfully delivers messages exactly once... you build a system that successfully processes messages exactly once, even if delivery occurs multiple times.

I think the difference might be partly semantic. If processing at the messaging level is idempotent + at least once, then message delivery to the application level is exactly once. People mostly only care about the application level not the lower levels where they might just build on a library or system that handles that logic for them.

jerf · on March 1, 2023

I'd say it's entirely semantic. I'm very much arguing for where to draw the definition lines in the terms we use. It won't change the code one bit (give or take a few different names on things). I definitely think understanding carefully the issues involved in delivery, and understanding the various solutions to that problem, is the way to go, not to blur the questions of delivery and handling into one atomic element. They're not atomic.

Alternatively we could come up with names for all the other combinations of delivery mechanism and handling mechanism, but since you can easily see we hit an NxM-type problem on that, this may well help elucidate why I think it's a bad idea to try to combine the two into one term. It visibly impairs people's ability to think about this topic clearly.

naasking · on March 1, 2023

Well, my argument for erasing that line is that you generally don't care about TCP packets or SSL handshakes and such, so why is this one property relevant if it can be punted to a lower layer just like those others?

I'll grant that it matters if you're trying to debug some problem and trying to find at what layer it failed, but it's basically the same process you use to debug all of those other layers too, so I'm not sure why this layer deserves special consideration.

doctor_eval · on March 1, 2023

AFAIK the point of exactly once delivery, in the context of message passing, is to abstract delivery concerns away from the application layer and into the messaging layer, so that the application can depend on the exactly-once semantics without having to write logic for it.

The problem with this is similar to the problems with two-phase commit in distributed databases: there are unavoidable failure cases. Most of the time it works just fine, but if you write your application to depend on this impossible feature, and it fails - which, given enough time, will certainly happen - then the cleaning up the mess can be much more effort (and have much wider business implications) than simply dealing with the undesirable behaviour of reality in the first place.

Or to put it another way: exactly once semantics can never be reliably extracted away from the application, so if you need it, it needs to be part of your application.

tunesmith · on March 1, 2023

This is called "Effectively Once".

tunesmith · on March 2, 2023

(first heard it coined by Viktor Klang)

FooBarWidget · on March 1, 2023

Theoretically true, and easy to say. But the hard part is actually implementing this in the context of business problems. What if you need to call external services that you don't control, and they don't provide idempotence? Like sending emails. Or worse: you send a message to a warehouse to deliver an item, and they deliver duplicates...

lll-o-lll · on March 1, 2023

Yeah the duplicate email thing is a classic problem, but I’m not sure it’s one of “idempotence”. This can happen in any (intended to be) transactional operation that creates a side affect.

Hit an error, roll-back, side-affect can’t be rolled back. Retry - side-affect happens again.

Wouldn’t the general approach be to have unique message identifiers and queue side-affects? Maybe I’m missing lots of subtleties.

Mavvie · on March 1, 2023

Email is absolutely something that requires idempotence to avoid sending duplicates. Even if your code is perfect and you don't send emails until after you commit your transaction, the actual http request to the email provider could fail in a way where the email is sent but your system can't tell.

Idempotency (either via a token in the request, or another API to check the result of the previous request) is required to prevent duplicates. And this requires the third party service to support idempotency; there's nothing you can do on your side to enable it if their service doesn't support it.

HiJon89 · on March 3, 2023

What if your system is the one actually sending the emails (ie, you are the 3rd party in this scenario)

purpleblue · on March 1, 2023

It's not "equal".

If you guarantee "exactly once", you design your systems differently than "at least one with idempotence". A system designed for exactly once will be less complicated than a system designed for at least once + idempotence, which is why it is ideal but impossible.

stonemetal12 · on March 1, 2023

Until the bill comes anyway. Having to provision extra bandwidth for useless dups, extra processing power for useless updates, etc.

paxys · on March 1, 2023

So, the opposite of exactly once

fizwhiz · on March 1, 2023

With idempotence, you shift the problem from "deliver X exactly once" to "make it seem like X was delivered exactly once". In most systems, exactly-once is really "effectively exactly once".

paxys · on March 1, 2023

That's my point. You are simply converting the problem to a new form, not actually solving it.

Hey here's a solution to the halting problem – always assume yes, and then figure out the edge cases. How do you do that? Well that's on you, I did my job.

In a distributed system that needs exactly-once delivery, implementing perfect idempotence is equally impossible.

burnished · on March 1, 2023

Converting a problem to a new form that you know how to better solve, or at least hope is more tractable, is a time honored mathematical and CS tradition

nawgz · on March 1, 2023

Idempotency - famously complex. No one has ever successfully implemented it, great point.

paxys · on March 1, 2023

If you don't think idempotency can be complex then you haven't really worked on serious distributed computing problems.

nawgz · on March 1, 2023

If you don't think your analogy is a miss then you haven't really read any serious literature.

naasking · on March 1, 2023

It can be exactly once at the application level just not exactly once at the more fine-grained message level. The fact that it's not exactly once at that lower level doesn't really matter, the semantics at the application level is what we care about.

kybernetikos · on March 1, 2023

Exactly. In practice there are probably a bunch of other things happening over the wire we also don't care about, handshakes and keepalives and negotiation and splitting and encryption and latency measuring and backpressure... It doesn't matter, in a variety of systems, at the application layer it is fine for the user to assume they will see a delivery to their code exactly once and that's what the user cares about. A delivery didn't mean some internal bytes travelled across the wire, it means your clients received a call.

That's why if you search for exactly once delivery you'll see a bunch of products claiming to have it (e.g kafka).

sokoloff · on March 1, 2023

Not exactly. If you have a business problem where you’re thinking “But I really, really need the effect of exactly-once; what can I do?”, GP’s post has the answer.

fsckboy · on March 1, 2023

OP's idea should be

idempotence + at least once

idempotence isn't necessarily commutative.

echelon · on March 1, 2023

No, if your datastore is online (the only way you're functioning anyway), store an idempotency key, vector clock, etc. with transactional semantics.

In active / active setups, there are other strategies such as partitioning and consensus.

dilyevsky · on March 2, 2023

Good thing most things in the real world are idempotent then!

hackerdad · on March 1, 2023

This!

fizwhiz · on Feb 8, 2023

How do you reconcile this with the importance of crosstalk and contention as espoused by the USL? https://wso2.com/blog/research/scalability-modeling-using-un...

paulmd · on Feb 8, 2023

not OP but the hyperdex paper is interesting along these lines: https://www.cs.cornell.edu/people/egs/papers/hyperdex-sigcom...

the central thesis being that actually at some point scaling tasks across multiple machines increases contention. having every request hit every machine runs that one query real fast since it's using the resources of a large number of machines, but running a lot of queries in parallel produces long queues and lots of network traffic.

in a way this is sort of what microservices do explicitly, but, you can partition the data implicitly into hyperdimensional spaces and then queries will only hit certain shards in the cluster. If there are shards that are particularly loaded up, you can increase the resources of those particular shards.

I think you could probably do the same thing in a lot of databases that use sharding, but, it does a good job of outlining the issue and the tension of one fast request vs good aggregate throughput. And this was 2012 which really was before noSQL caught fire and maybe before the maturity of some of those nosql systems around sharding/etc.

mjb · on Feb 8, 2023

The universal scalability law has this term κp(p-1), which suggests that interactions are O(p^2) for systems of p components, and those interactions happen at some relative frequency κ. Right?

So why might we expect coherency to cost O(p^2)? It sure does if the data that we're trying to keep coherent is everywhere, but that's not true in typical sharded systems. It also does if we're trying to do something like atomic commitment across all nodes, but again it's not clear that's what databases actually do. It also raises the question of how we calculate κ, and how the work we're trying to get done translates into a particular value of κ.

As a conceptual model, I quite like the USL. But it doesn't seem as universal as it claims, and I haven't read anything that helps with parameter selection.

So instead we can take a step back and pick another parameter (call it ⍺), which is a random variable whose distribution is the number of shards that a database needs to coordinate over to meet its consistency and isolation goals. Then, for N requests, total work done is proportional to E[⍺] * N. Why might we believe E[⍺] is O(N) too? It could be if we're trying to be serializable and most transactions go 'everywhere'. On the other hand, with key-value accesses or weak consistency E[⍺] could be O(1). With index structures, it could be O(log N). Or whatever.

Anyway, I'm not sure that makes sense, but it does seem like the USL makes some untested assumptions about the ways systems coordinate, which makes it less useful.