Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An optimal algorithm for bounded random integers (github.com/apple)
244 points by smasher164 on Sept 2, 2021 | hide | past | favorite | 41 comments


Note that this is equivalent to computing `upperbound * randbits(nbits + 64) >> (nbits + 64)` combined with an early-exit optimization, albeit one that becomes less effective as `upperbound` increases: if `upperbound = 2^nbits - 1`, the probability of using the early exit is basically zero. (As the author points out, Lemire's optimization may be used to improve this exit test).

The early exit optimization, however, may leak information about the random output, since it tells someone with access to timing information whether one or two outputs was used. With standard rejection sampling, the entire sample is discarded if a bias would be produced, so the output is independent of the number of samples. However, here, certain outputs cannot be produced by the early exit (for sufficiently large upper bounds), meaning that an attacker could narrow down the random number if the algorithm exited early.

I think that, to avoid this timing attack vector, the early exit would need to be eliminated - in which case the algorithm would perform no better (and usually worse) than Lemire's approach in terms of expected calls to the random number generator.


Right, this is a generator that's used in contexts where a timing side-channel is not a concern, but it's still a great distinction to point out.

More broadly, the Swift language--like almost every other language--doesn't have any notion of constant-time operation, so it's not clear what you would do with an RNG free of timing side channels if you had one; use it in a computation that can't make any guarantees about side channels?*

You're correct that the number of expected calls collapses to 2 (matching the worst case for Lemire) even in that case, but (for our purposes) it's a "better" 2 (at least with most random sources) since (a) it still never divides (b) it is branch-free, so it never mispredicts and (c) it's always 2 instead of 1 with probability 1/2, 2 with probability 1/4, 3 with probability 1/8 and so on.

* I am generally very (probably too) negative on “best effort” constant time operations in high-level languages, but having seen innumerable schemes undone by compiler optimizations, I’m jaded. If you need to eliminate side channels, use hardware designed to do so.

[All that said, I added a note to the PR to make this clear.]


Cryptographic operations come to mind. I could see someone reaching for this function to get a uniform value below some prime, for example, which might be reasonable if it were backed by a good CSPRNG for the underlying bits - and a good CSPRNG might well be slow enough to make this timing channel easily observable.

Some of the good modern crypto libraries make fairly strong guarantees about constant-time or at least data-independent operations. Granted, these may not be 100% side-effect-free thanks to processor bugs, but at least those are fairly time-intensive to attack. In any case, the concern isn’t the variable number of calls, but the fact that the number of calls could leak information about the results.

EDIT: Well, two of the places your PR has been linked to are RustCrypto (https://github.com/RustCrypto/utils/issues/453) and OpenSSL (https://github.com/openssl/openssl/issues/16500), two places where I definitely would not want to see a significant timing vulnerability. I wish that the folks posting these to crypto libraries would be more aware that this timing issue exists!

To concretely demonstrate the problem: suppose that the 64-bit upper bound is `2^64 * 3/4 = 13835058055282163712`. In this case, if the early exit is triggered, the result modulo 3 must be 0 or 2 (with 50% probability for each): numbers congruent to 1 mod 3 cannot come out of the early exit. By contrast, Lemire's method always outputs a uniform distribution of outputs at each iteration. Lemire's method has a smaller timing leak in that the division doesn't always happen - and since divisions are expensive, this could be significant, but it won't outweigh the cost of an unbuffered read of urandom, for example.


Having worked on implementations of side-channel resistant crypto libraries, processor bugs mostly don’t scare me, but uArch and compiler optimizations very much do.


Hi, author here. Note this is not really a "new idea"; in a very reasonable sense it's simply the most obvious thing possible. I do think that it's likely the first mainstream "productized" implementation, however (though I would be interested to hear of other places that it's been tried).


You may be interested in this work going in the opposite direction, holding the bias constant and extracting the most ranges from a constant amount of randomness:

https://github.com/sipa/writeups/tree/main/uniform-range-ext...

This work is interesting because generating random bits via strong generators is typically much more expensive than extraction-- so it can be very useful in cases where a number of small random values are needed in a tight loop such as for some kinds of hash table and generating permutations.


I'm a bit confused about this method.

So the standard argument against such a procedure is that if you generate N random bits, each of {0,1}^N bitstrings is equiprobable and therefore no mapping of {0,1}^N to {0..U-1} can map an equal number of bitstrings to each output.

A priori the method seems to work around this by conditionally sampling more bits, so that your domain is not one of fixed-length bitstrings. But then there is this paragraph:

> More intriguing still, this algorithm can be made unconditional by removing the early out, so that every value computed requires word size + 64 bits from the stream, which breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible. This is an especially powerful advantage when paired with bitstream generators that allow skip-ahead such as counter-based generators or PCG.

But now we are mapping {0, 1}^N into {0..U-1} right? So this mapping ought not be uniform? In fact I'm not sure why the early-termination method even works, because for a fixed U we know the maximum depth of the generated-bitstring-tree, so we can just pad it with imaginary random bits to arrive at the contradiction that U does not divide 2^N.

I imagine I missed something, what is it?

EDIT: thinking about it more, if you sample finitely many bits then each possible output has a probability that is a dyadic rational (fractions of the form a/2^k) which does not include all integer reciprocals.


That's exactly right. It's not uniform, but the deviation from uniform is so small that there's no measurement you can perform that can observe the difference. If you need an formally uniform distribution for an information-theoretic proof or something similar, you wouldn't use this termination condition.


> no measurement you can perform that can observe the difference

That seems like a very particular phrasing, do you mean something specific by that?


It's in the OP:

> this new algorithm opens an intriguing possibility: we can compute just 64 extra bits, and have a probability of 1 - 2^{-64} of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2^128 samples, which is prohibitive).

Using the "just use 64 extra bits" approach, there is a 1 in 2^64 chance that the result you compute will be wrong, and correspondingly a 99.9999999999999999946% chance that the result you compute will be right. That is the amount of bias in the algorithm.


It's made clearer in the PR:

"detecting any difference [from uniform] would require about 2^128 samples, which is prohibitive"


I see -- thank you for clarifying. So for a formally uniform distribution, is rejection sampling your only option ?


Not at all! You can simply compute additional product bits (you don’t need to go bit-by-bit) until the “can this possibly carry out” test (which is just a comparison) says “no”. The probability that you need more bits falls off exponentially, so this happens very quickly.

What you lose by doing this is each value no longer consumes a bounded number of bits from the source, which is how it’s able to be “truly” uniform.


Apologies if you already knew this:

It feels as if this is still related to rejection sampling, in spite of not being quite that. If the algorithm samples some bits, and still hasn't made a decision, then it's as if all the previous samples have been "rejected". The difference seems to be that this algorithm has a memory of previously rejected samples, which causes it to be less likely to reject subsequent samples. "Adaptive rejection sampling" appears to be relevant: https://en.wikipedia.org/wiki/Rejection_sampling#Adaptive_re...


Yes, although the algorithm here just accepts a certain small amount of bias rather than reject. But the "corrected" version of this says that if I am rolling a 1dN the first random byte will have N-1 different "rejection" criteria, the next byte and every byte after will only have 1 "rejection" criterion, and so on.

Something that would be more interesting: if you had a random number primitive which could generate 1 with probability 1/2, 2 with probability 1/6, 3 with probability 1/12, and so on (p = 1/(N * (N+1))), that primitive would allow you to construct random irrational numbers uniformly sampled from [0, 1) as continued fractions. Since the rationals ℚ are a non-dense subset of ℝ, ignore them as measure-0.

Such a primitive would allow you to do a sort of rejection-sampling with only a finite number of rejections before you are guaranteed an answer, as a continued fraction for a rational has a finite expansion.


Q is a dense subset of R, but it's countable, so it has measure 0.


Nit: the rationals are dense in R.


It helps to work out an example, maybe.

So let's try to roll a 1d5. We generate a real number p uniformly in [0, 1). In terms of base-2 decimal notation we should return

    1, if p < 0.0011 0011 0011 0011...
    2, if p < 0.0110 0110 0110 0110...
    3, if p < 0.1001 1001 1001 1001...
    4, if p < 0.1100 1100 1100 1100...
    5, otherwise
These repetitions are conveniently 4 bits long and could be written in hexadecimal, though of course they don't have to be (a 1d7 repeats after 3 bits, a 1d11 repeats after 10 bits).

Well, there's a problem. Our computers don't have infinitely long precise decimals like this, instead we generate a finite stream of bits. But suppose we generate those streams in "chunks". Let's say we generate one byte b at a time, this says for the first byte,

    if b  < 0x33, return 1.
    if b == 0x33, we'll need another byte b2 to decide.
        if b2 < 0x33, return 1.
        if b2 == 0x33, repeat with a new b2
        if b2 > 0x33, return 2.
    if b  < 0x66, return 2.
    if b == 0x66, we'll need another byte b2 to decide.
        if b2 < 0x66, return 2.
        if b2 == 0x66, repeat with a new b2
        if b2 > 0x66, return 3.
    if b  < 0x99, return 3
    [ if b == 0x99, we'll need another byte ]
    if b  < 0xCC, return 4
    [ if b == 0xcc, we'll need another byte ]
    otherwise, return 5.
You can kind of call this rejection sampling but notice that the post-rejection loop has different bounds than the non-rejection loop, it only has a 1-in-256 chance of repeating and not the 1-in-64 chance of rejection sampling.

I want to say that the probability of observing something hinky using K samples of a 1dN die with 2^B bits is not as simple as sqrt(K)_= 2^B but something more pessimistic like maybe K = 2^(B+1) / N^4 or so? So if you're generating histograms with 2^32 iterations of a million-sided die you actually don't have a huge safety margin here, 2^130 / 2^112 is only one in 2^18 or so, one in 250,000 runs could maybe correctly detect the bias. But it's still presumably "good enough" for all that.


Thanks for this explanation, it made it click for me. The benefit of doing far less additional sampling in the rejection case is very intuitive, especially when the numbers are very large. It also seems this technique could be extended to other scenarios, such as 1/pi and other weird non-uniform bucket sizes.


You have to read that “more intriguing still” paragraph as being in the same scope as the preceding one, which says

> we can compute just 64 extra bits, and have a probability of 1 - 2⁻⁶⁴ of terminating. This is so close to certainty that we can simply stop unconditionally without introducing any measurable bias (detecting any difference would require about 2¹²⁸ samples, which is prohibitive)

So, it’s about a variation on the unbiased algorithm that is slightly biased, but gains some other desirable properties.


I did the work that Lemire links there.

It started with realizing that Lemire's rejection method becomes more likely to need a division/modulo as the range maxes out (the gate on division is fraction < range, so as range -> maxint the division case approaches 100%, if the fraction and range have the same precision). If the range is 32 bit but the fraction is 64, the division case is 2^32 times less likely, so I worked out Lemire's method for 32.64 fixed point.

That turned out to be similar to the infinite-precision case described here, as we only need 32 bits of fraction if it's far enough from the edges, ie nothing can carry into the integer part if we extend to 64.

Once I worked out when to extend to 64 bits in the rejection method, I realized that the infinite-precision case is similar, using a loop instead of a conditional. It's simpler since there's no rejection, just a decision to continue as long as it's possible to carry.


This is essentially arithmetic encoding to do base conversion from binary to n-ary. Once you have enough bits, you're almost certainly not on the boundary of a bucket. Funny how the same thing seems to have been reinvented...


I don’t think that’s fair. The article clearly knows about prior art, and ‘only’ claims:

“This PR introduces a novel algorithm that:

- never divides

- avoids rejection sampling entirely

- achieves a theoretically optimal bound on the amount of randomness consumed to generate a sample

- delivers actual performance improvements for most real cases”


I don't mean it as a negative!

Except perhaps I could have understand what they did faster if they described it as arithmetic encoding.


That's exactly right (in fact, thinking about AC is what prompted me to implement it).


This PR should be submitted, as is, to a peer reviewed journal and published again as is. It would be great if the DOI resolved to the PR itself!

I don't mean this as a joke, I mean this as a way to contribute (slightly humorously, or perhaps even not) to upending an outdated, yet valuable, way of disseminating scientific research.


What scientific research values the most is references and being able to read them.

What tech values the least is the ability to access two years old blogposts


My concern is that pull requests are mutable, and the thread and code can be edited, deleted by the author or repo owner, or even disappeared entirely if the author gets banned by GitHub (like github.com/volth, whose account got shadowbanned since it changed from a real person to a bot account used by archive.is to scrape GitHub).


The author himself has stated upthread that this is not novel.


Papers are hardly outdated. But sure, having a reference implementation to go with it is good.


Lemire (cited in the article) is also the author of some work in fast integer division by a constant.


Lemire is not only cited in the article. He responded to it on the gh page.


You mean the Division by Invariant Multiplication? It feels very related.


The insight here is more the application of fast integer division by a constant (to generating random numbers) than work "in" it.


I thought because of the branch name they used for their PR that it was going to be a joke PR, but the mathematics of it is outside of what I understand well enough to judge if it’s legit. But from the way it reads it sounds honest.

So assuming that it is legit, it sounds really intriguing especially because of the claim they make that it “breaks the loop-carried dependency for fast generators, unlocking vectorization and parallelization where it was previously impossible”.


If I understand correctly the reason for that claim is they know in advance how many bits they will need to essentially guarantee a good sample, so instead of needing to perform the operation first to find out where the next operation can start they can just chunk it out and do it in parallel.


The author is a vector & numerics expert at Apple.


It's legit.



Are there any benchmarks (performance)?


wow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: