On the optimal bounds for integer division by constants
Published:
It is well-known that the integer division is quite a heavy operation on modern CPU’s - so slow, in fact, that it has even become a common wisdom to avoid doing it at ALL cost in performance-critical sections of a program. I do not know why division is particularly hard to optimize from the hardware perspective. I am just guessing, maybe (1) every general algorithm is essentially just a minor variation of the good-old long division, (2) which is almost impossible to parallelize. But whatever, that is not the topic of this post.
Rather, this post is about some of the common optimization techniques for circumventing integer division. To be more precise, the post can be roughly divided into two parts. The first part discusses the well-known Granlund-Montgomery style multiply-and-shift technique and some associated issues. The second part is about my recent research on a related multiply-add-and-shift technique. In this section, I establish an optimal bound, which perhaps is a novel result. It would be intriguing to see whether modern compilers can take advantage of this new bound.
Note that by integer division, we specifically mean the computation of the quotient and/or the remainder, rather than evaluating the result as a real number. More specifically, throughout this entire post, integer division will always mean taking the quotient unless specified otherwise. Also, I will confine myself into divisions of positive integers, even though divisions of negative integers hold practical significance as well. Finally, all assembly code provided herein are for x86-64 architecture.
Turning an integer division into a multiply-and-shift
One of the most widely used techniques is converting division into a multiplication when the divisor is a known constant (or remains mostly unchanged). The idea behind this approach is quite simple. For instance, dividing by 44 is equivalent to multiplying by 0.250.25, which can be further represented as multiplying by 2525 and then dividing by 100100, where dividing by 100100 is simply a matter of moving the decimal dot into left by two positions. Since we are only interested in taking the quotient, this means throwing away the last two digits.
In this particular instance, 44 is a divisor of 100100 so we indeed have a fairly concise such a representation, but in general the divisor might not divide a power of 1010. Let us take 77 as an example. In this case, we cannot write 1717 as m10km10k for some positive integers m,km,k. However, we can still come up with a good approximation. Note that
17=0.142857142857⋯,17=0.142857142857⋯,(1)so presumably something like 14285810000001428581000000 would be a good enough approximation of 1717. Taking that as our approximation, we may want to compute n/7n/7 by multiplying 142858142858 to nn and then throwing away the last 66 digits. (Note that we are taking 142858142858 instead of 142857142857 because the latter already fails when n=7n=7. In general, we must take ceiling, not floor nor half-up rounding.)
This indeed gives the right answer for all n=1, ⋯ ,166668n=1, ⋯ ,166668, but it starts to produce a wrong answer at n=166669n=166669, where the correct answer is 2380923809, whereas our method produces 2381023810. And of course such a failure is expected. Given that we are using an approximation with nonzero error, it is inevitable that the error will eventually manifest as the dividend nn grows large enough. But, the question at hand is, can we estimate how far it will go? Or, how to choose a good enough approximation guaranteed to work correctly when there is a given limit on how big our nn can be?
Obviously, our intention is to implement this concept in computer program, so the denominators of the approximations will be powers of 22, not 1010. So, for a given positive integer dd, our goal is to find a good enough approximation of 1d1d of the form m2km2k. Perhaps one of the most widely known formal results in this realm is the following theorem by Granlund-Montgomery:
Theorem 1 (Granlund-Montgomery, 1994).
Suppose mm, dd, kk are nonnegative integers such that d≠0d≠0
2N+k≤md≤2N+k+2k.2N+k≤md≤2N+k+2k.(2)Then ⌊n/d⌋=⌊mn/2N+k⌋⌊n/d⌋=⌊mn/2N+k⌋ for every integer nn with 0≤n<2N0≤n<2N.
Here, dd is the given divisor and we are supposed to approximate 1d1d by m2N+km2N+k. An assumption here is that we want to perform the division n/dn/d for all nn from 00 to 2N−12N−1, where NN is supposed to be the bit-width of the integer data type under consideration. In this setting, the mentioned theorem establishes a sufficient condition under which we can calculate the quotient of n/dn/d by first multiplying nn by mm and subsequently right-shifting the result by (N+k)(N+k)-bits. Note that we will need more than NN-bits; since our dividend is of NN-bits, the result of the multiplication mnmn will need to be stored in 2N2N-bits. Since we are shifting by (N+k)(N+k)-bits, the lower half of the result is actually not needed, and we just take the upper half and shift it by kk-bits.
We will not talk about the proof of this theorem because a much more general theorem will be presented in the next section. But let us see how results like this are being applied in the wild. For example, consider the following C++ code:
std::uint64_t div(std::uint64_t n) noexcept {
return n / 17;
}
Modern compilers are well-aware of these Granlund-Montgomery-style division tricks. Indeed, my compiler (clang) adeptly harnessed such tactics, as demonstrated by its translation of the code above into the following lines of assembly instructions:
div(unsigned long):
mov rax, rdi
movabs rcx, -1085102592571150095
mul rcx
mov rax, rdx
shr rax, 4
ret
Note that we have N=64N=64 and d=17d=17 in this case, and the magic constant −1085102592571150095−1085102592571150095 you can see in the second line, interpreted as an unsigned value 1736164148113840152117361641481138401521, is the mm appearing in the theorem. You can also see in the fifth line that kk in this case is equal to 44. What the assembly code is doing is as follows:
- Load the 6464-bit input nn into the 6464-bit
rax
register. - Load the 6464-bit constant 1736164148113840152117361641481138401521 into the 6464-bit
rcx
register. - Multiply them together, then the result is of 128128-bits. The lower half of this is stored back into
rax
, and the upper half is stored in therdx
register. But we do not care about the lower half, so throw it away and copy the upper half back intorax
. - Shift the result to the right by 44-bits. The result must be equal to n/17n/17.
We can easily check that these mm and kk chosen by the compiler indeed satisfy the inequalities in the theorem:
295147905179352825856=264+4≤md=17361641481138401521⋅17=295147905179352825857≤264+4+24=295147905179352825872.295147905179352825856=264+4≤md=17361641481138401521⋅17=295147905179352825857≤264+4+24=295147905179352825872.We have not yet seen how to actually find mm and kk using a result like the theorem above, but before asking such a question, let us instead ask: “is this condition on mm and kk the best possible one?”
And the answer is: No.
Here is an example, take N=32N=32 and d=102807d=102807. In this case, the smallest kk that allows an integer mm that satisfies
2N+k≤md≤2N+k+2k2N+k≤md≤2N+k+2k(3)to exist is k=17k=17, and in that case the unique mm satisfying the above is 54757939975475793997. This is kind of unfortunate, because the magic constant m=5475793997m=5475793997 is of 3333-bits, so the computation of nmnm cannot be done inside 6464-bits. However, it turns out that we can take k=16k=16 and m=2737896999m=2737896999 and still the equality ⌊nd⌋=⌊nm2N+k⌋⌊nd⌋=⌊nm2N+k⌋ holds for all n=0, ⋯ ,2N−1n=0, ⋯ ,2N−1, although the above inequality is not satisfied in this case. Now, the new constant 27378969992737896999 is of 3232-bits, so we can do our computation inside 6464-bits. This might result a massive difference in practice!
It seems that even the most recent version of GCC (13.2) is still not aware of this, while clang knows that the above mm and kk work. (In the link provided, GCC is actually trying to compute ⌊nm2N+k⌋⌊nm2N+k⌋ with the 3333-bit constant m=5475793997m=5475793997. Detailed explanation will be given in a later section.)
Then what is the best possible condition? I am not sure who was the first for finding the optimal bound, but at least it seems that such a bound is written in the famous book Hacker’s Delight by H. S. Warren Jr. Also, recently (in 2021), Lemire et al. showed the optimality of an equivalent bound. (EDIT: according to a Reddit user, there was a report written by H. S. Warren Jr in 1992, even earlier than Granlund-Montgomery, which contained the optimal bound with the proof of optimality.)
I will not write down the optimal bounds obtained by these authors, because I will present a more general result proved by myself in the next section.
Here is one remark before getting into the next section. The aforementioned results on the optimal bound work for nn from the range {1, ⋯ ,nmax}{1, ⋯ ,nmax} where nmaxnmax is not necessarily of the form 2N−12N−1. However, even recent compilers do not seem to leverage this fact. For example, let us look at the following code:
std::uint64_t div(std::uint64_t n) {
[[assume(n < 10000000000)]];
return n / 10;
}
Here we are relying on a new language feature added in C++23: assume
. GCC generated the following lines of assemblies:
div(unsigned long):
movabs rax, -3689348814741910323
mul rdi
mov rax, rdx
shr rax, 3
ret
My gripe with this is the generation of a superfluous shr
instruction. GCC seems to think that kk must be at least 6767 (which is why it shifted by 33-bits, after throwing away the 6464-bit lower half), but actually k=64k=64 is fine with the magic number m=1844674407370955162m=1844674407370955162 thanks to the bound on nn, in which case we do not need this additional shifting.
How about clang? Unlike GCC, clang currently does not seem to understand this new language feature assume
. But it has an equivalent language extension __builtin_assume
, so I tried with that:
std::uint64_t div(std::uint64_t n) {
__builtin_assume(n < 10000000000);
return n / 10;
}
div(unsigned long):
mov rax, rdi
movabs rcx, -3689348814741910323
mul rcx
mov rax, rdx
shr rax, 3
ret
And there is no big difference😥
Turning multiplication by a real number into a multiply-and-shift
Actually, during the development of Dragonbox, I was interested in a more general problem of multiplying a rational number to nn and then finding out the integer part of the resulting rational number. In other words, my problem was not just about division, rather about multiplication followed by a division. This presence of multiplier certainly makes the situation a little bit more tricky, but it is anyway possible to derive the optimal bound in a similar way, which leads to the following generalization of the results mentioned in the previous section. (Disclaimer: I am definitely not claiming to be the first who proved this, and I am sure an equivalent result could be found elsewhere, though I am not aware of any.)
Theorem 2 (From this paper).
Let xx be a real number and nmaxnmax a positive integer. Then for a real number ξξ, we have the followings.
If x=pqx=pq is a rational number with q≤nmaxq≤nmax, then we have ⌊nx⌋=⌊nξ⌋⌊nx⌋=⌊nξ⌋ for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax if and only if x≤ξ<x+1vqx≤ξ<x+1vq holds, where vv is the greatest integer such that vp≡−1 (mod q)vp≡−1 (mod q) and v≤nmaxv≤nmax.
If xx is either irrational or a rational number with the denominator strictly greater than nmaxnmax, then we have ⌊nx⌋=⌊nξ⌋⌊nx⌋=⌊nξ⌋ for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax if and only if p∗q∗≤ξ<p∗q∗p∗q∗≤ξ<p∗q∗ holds, where p∗q∗p∗q∗, p∗q∗p∗q∗ are the best rational approximations of xx from below and above, respectively, with the largest denominators q∗,q∗≤nmaxq∗,q∗≤nmax.
Note that ⌊nx⌋⌊nx⌋ is supposed to be the one we actually want to compute, while ξξ is supposed to be the chosen approximation of xx. For the special case when x=1dx=1d, nmax=2N−1nmax=2N−1, and ξ=m2N+kξ=m2N+k, we obtain the setting of Granlund-Montgomery.
So I said rational number in the beginning of this section, but actually our xx can be any real number. Note, however, that the result depends on whether xx is “effectively rational” or not over the domain n=1, ⋯ ,nmaxn=1, ⋯ ,nmax, i.e., whether or not there exists a multiplier nn which makes nxnx into an integer.
Since I was working on floating-point conversion problems when I derived this theorem, for me the more relevant case was the second case, that is, when xx is “effectively irrational”, because the numerator and the denominator of xx I was considering were some high powers of 22 and 55. But the first case is more relevant in the main theme of this post, i.e., integer division, so let us forget about these jargons like best rational approximations and such. (Spoiler: they will show up again in the next section.)
So let us focus on the first case. First of all, note that if p=1p=1, that is, when x=1qx=1q, then vv has a simpler description: it is the last multiple of qq in the range 1, ⋯ ,nmax+11, ⋯ ,nmax+1 minus one. If you care, you can check the aforementioned paper by Lemire et al. to see that their Theorem 1 exactly corresponds to the resulting bound. In fact, in this special case it is rather easy to see why the best bound should be something like that.
Indeed, note that having the equality
⌊nx⌋=⌊nξ⌋⌊nx⌋=⌊nξ⌋(4)for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax is equivalent to having the inequality
⌊nx⌋n≤ξ<⌊nx⌋+1n⌊nx⌋n≤ξ<⌊nx⌋+1n(5)for all such nn. Hence, it is enough to find the largest possible value of the left-hand side and the smallest possible value of the right-hand side. Since we are assuming that the denominator of xx is bounded by nmaxnmax, obviously the maximum value of the left-hand side is just xx. Thus, it is enough to find the minimum value of the right-hand side. Note that we can write
n=q⌊nq⌋+r=q⌊nx⌋+rn=q⌊nq⌋+r=q⌊nx⌋+r(6)where rr is the remainder of the division n/qn/q. Replacing ⌊nx⌋⌊nx⌋ by nn and rr using the above identity, we get
⌊nx⌋+1n=(n−r)/q+1n=n+(q−r)qn=1q+q−rqn.⌊nx⌋+1n=(n−r)/q+1n=n+(q−r)qn=1q+q−rqn.(7)Therefore, the minimization of the left-hand side is equivalent to the minimization of q−rnq−rn. Intuitively, it appears reasonable to believe that the minimizer nn must have the largest possible remainder r=q−1r=q−1, because for example if rr were set to be q−2q−2 instead, then the numerator gets doubled, necessitating a proportionally larger nn to achieve a diminished value of q−rnq−rn. Also, among nn’s with r=q−1r=q−1, obviously the largest nn yields the smallest value of q−rnq−rn, so it sounds rational to say that probably the greatest nn with r=q−1r=q−1 is the minimizer of q−rnq−rn. Indeed, this is quite easy to prove: suppose we call such nn as vv, and suppose that there is nn which is even better than vv:
q−rn≤1v,(q−r)v≤n.q−rn≤1v,(q−r)v≤n.(8)Now, since vv divided by qq has the remainder q−1q−1, the left-hand side and the right-hand side have the same remainder when divided by qq. Therefore, the difference between the two must be either zero or at least qq. But since vv is the largest one with the remainder q−1q−1, it must be at least nmax−q+1nmax−q+1, thus nn cannot be larger than vv by more than q−1q−1. Thus, the only possibility is n=vn=v.
When x=pqx=pq and p≠1p≠1, the remainder rr of np/qnp/q depends pretty randomly on nn, so it is somewhat harder to be intuitively convinced that still the minimizer of ⌊nx⌋+1n⌊nx⌋+1n should have r=q−1r=q−1. But the same logic as above still works just with a little bit of tweaks in this case. The full proof can be found in the paper mentioned, or one of my previous posts.
Some applications
Here I collected some applications of the presented theorem.
Finding the first error case
In the beginning of the previous section, I claimed that
⌊n7⌋=⌊n⋅1428581000000⌋⌊n7⌋=⌊n⋅1428581000000⌋(9)holds for all n=1, ⋯ ,166668n=1, ⋯ ,166668 but not for n=166669n=166669. Now we can see how did I get this. Note that 166669≡6 (mod 7)166669≡6 (mod 7), so the range {1, ⋯ ,166668}{1, ⋯ ,166668} and the range {1, ⋯ ,166669}{1, ⋯ ,166669} have different vv’s: it is 166662166662 for the former, while it is 166669166669 for the latter. And this makes the difference, because the inequality
1428581000000<17+17v1428581000000<17+17v(10)holds if and only if v<166667v<166667. Thus, n=166669n=166669 is the first counterexample.
Coming up with a better magic number than Granlund-Montgomery
We can also see now why a better magic number worked in the example from the previous section. Let me repeat it here with the notation of Theorem 2: we have x=1/102807x=1/102807 and nmax=232−1nmax=232−1, so we have
⌊n102807⌋=⌊nm2k⌋⌊n102807⌋=⌊nm2k⌋(11)for all n=1, ⋯ ,232−1n=1, ⋯ ,232−1 if and only if ξ=m2kξ=m2k satisfies
1102807≤m2k<1102807+1v⋅102807.1102807≤m2k<1102807+1v⋅102807.(12)In this case, vv is the largest integer in the range 1, ⋯ ,232−11, ⋯ ,232−1 which has the remainder 102806102806 when divided by 102807102807. Then it can be easily seen that v=4294865231v=4294865231, so the inequality above becomes
2k102807≤m<41776⋅2k4294865231.2k102807≤m<41776⋅2k4294865231.(13)The smallest kk that allows an integer solution to the above inequality is k=48k=48, in which case the unique solution is m=2737896999m=2737896999.
A textbook example for the case p≠1p≠1
Let me demonstrate how Theorem 2 can be used for the case p≠1p≠1. Suppose that we want to convert the temperature from Fahrenheit to Celsius. Obviously, representing such values only using integers is a funny idea, but let us pretend that we are completely serious (and hey, in 2023, Fahrenheit itself is a funny joke from the first place😂… except that I am currently living in an interesting country🤮). Or, if one desperately wants to make a really serious example, we can maybe think about doing the same thing with fixed-point fractional numbers. But whatever.
So the formula is, we first subtract 3232 and then multiply 5/95/9. For the sake of simplicity, let us arbitrarily assume that our Fahrenheit temperature is ranging from 32∘F32∘F to 580∘F580∘F so in particular we do not run into negative numbers. After subtracting 3232, the range is from 00 to 548548, so we take nmax=548nmax=548. With x=59x=59, our vv is so the largest integer such that v≤548v≤548 and 5v≡8 (mod 9)5v≡8 (mod 9), or equivalently, v≡7 (mod 9)v≡7 (mod 9). The largest multiple of 99 in the range is 540540, so we see v=547v=547. Hence, the inequality we are given with is
59≤m2k<59+1547⋅9=304547.59≤m2k<59+1547⋅9=304547.(14)The smallest kk that allows an integer solution is k=10k=10, and the unique solution for that case is m=569m=569. Therefore, the final formula would be:
⌊5(n−32)9⌋=⌊569(n−32)210⌋.⌊5(n−32)9⌋=⌊569(n−32)210⌋.(15)Will the case p≠1p≠1 be potentially relevant for compiler-writers?
Might be, but there is a caveat: in general, the compiler is not allowed to optimize an expression n * p / q
into (n * m) >> k
, because n * p
can overflow and in that case dividing by q
will give a weird answer. To be more specific, for unsigned integer types, C/C++ standards say that any overflows should wrap around, so the expression n * p / q
is not really supposed to compute ⌊npq⌋⌊npq⌋, rather it is supposed to compute ⌊(np mod 2N)q⌋⌊(np mod 2N)q⌋ where NN is the bit-width, even though it is quite likely that the one who wrote the code actually wanted the former. On the other hand, for signed integer types, (a signed-equivalent of) Theorem 2 might be applicable, because signed overflows are specifically defined to be undefined. But presumably there are lots of code out there relying on the wrong assumption that signed overflows will wrap around, so maybe compiler-writers do not want to do this kind of optimizations.
Nevertheless, there are situations where doing this is perfectly legal. For example, suppose n
, p
, q
are all of type std::uint32_t
and the code is written like static_cast<std::uint32_t>((static_cast<std::uint64_t>(n) * p) / q)
to intentionally avoid this overflow issue. Then the compiler might recognize such a pattern and do this kind of optimizations. Or more generally, with the new assume
attribute (or some other equivalent compiler-specific mechanisms), the user might give some assumptions that ensure no overflow.
It seems that currently both clang and GCC do not do this, so if they want to do so in a future, then Theorem 2 might be useful. But how many code can benefit from such an optimization? Will it be really worth implementing? I do not know, but maybe not.
When the magic number is too big
Even with the optimal bound, there exist situations where the smallest possible magic number does not fit into the word size. For example, consider the case nmax=264−1nmax=264−1 and x=1/10961x=1/10961. In this case, our vv is
v=⌊264−1096110961⌋⋅10961+10960=18446744073709550681.v=⌊264−1096110961⌋⋅10961+10960=18446744073709550681.(16)Hence, the inequality we need to inspect is
110961≤m2k<110961+110961⋅18446744073709550681=168294353377516218446744073709550681,110961≤m2k<110961+110961⋅18446744073709550681=168294353377516218446744073709550681,and the smallest kk allowing an integer solution is 7878, in which case the unique solution is m=27573346857372255605m=27573346857372255605. And unfortunately, this is a 6565-bit number!
Let us see how to deal with this case by looking at what a compiler actually does. Consider the code:
std::uint64_t div(std::uint64_t n) {
return n / 10961;
}
My compiler (clang, again) generated the following lines of assemblies:
div(unsigned long):
movabs rcx, 9126602783662703989
mov rax, rdi
mul rcx
sub rdi, rdx
shr rdi
lea rax, [rdi + rdx]
shr rax, 13
ret
This is definitely a little bit more complicated than the happy case, but if we think carefully, we can realize that this is still just computing ⌊nm2k⌋⌊nm2k⌋:
- The magic number m′=9126602783662703989m′=9126602783662703989 is precisely m−264m−264.
- In the third line, we multiply this magic number with the input. Let us call the upper half of the result as uu (which is stored in
rdx
), and the lower half as ℓℓ (which is stored inrax
, and we do not care about the lower half anyway). - We subtract the upper half uu from the input nn (the
sub
line), divide the result by 2 (theshr
line), add the result back to uu (thelea
line), and then store the end result intorax
. Now this looks a bit puzzling, but what it really does is nothing but to compute ⌊(n+u)/2⌋=⌊nm/265⌋⌊(n+u)/2⌋=⌊nm/265⌋. The reason why we first subtract uu from nn is to avoid overflow. On the other hand, the subtraction is totally fine as there can be no underflow, because u=⌊nm′/264⌋u=⌊nm′/264⌋ is at most nn, as m′m′ is less than 264264. - Recall k=78k=78, so we want to compute ⌊nm/278⌋⌊nm/278⌋. Since we got ⌊nm/265⌋⌊nm/265⌋ from the previous step, we just need to shift this further by 1313-bits.
This is not so bad, we just have two more trivial instructions compared to the happy case. But the reason why the above works is largely due to that the magic number is just one bit larger than 6464-bits. Well, can it be even larger than that?
The answer is: No, fortunately, the magic number being just one bit larger than the word size is the worst case.
To see why, note that the size of the interval where ξ=m2kξ=m2k can possibly live is precisely 1/vq1/vq. Therefore, if kk is large enough so that 2k≥vq2k≥vq, then the difference between the endpoints of the inequality
2kq≤m<2k(1q+1vq)2kq≤m<2k(1q+1vq)(17)is at least 11, so the inequality must admit an integer solution. Now, we are interested in the bit-width of ⌈2kq⌉⌈2kq⌉, which must be the smallest possible magic number if kk admits at least one solution. Since the smallest admissible kk is at most the smallest kk satisfying 2k≥vq2k≥vq, thus we must have vq>2k−1vq>2k−1, so 2kq<2v2kq<2v. Clearly, the right-hand side is of at most one bit larger than the word size.
Actually, there is an alternative way of dealing with the case of too large magic number, which is to consider a slightly different formula: instead of just doing a multiplication and then a shift, perform an addition by another magic number between those two operations. Using the notations from Theorem 2, what this means is that we have some ζζ satisfying the equality
⌊nx⌋=⌊nξ+ζ⌋⌊nx⌋=⌊nξ+ζ⌋(18)instead of ⌊nx⌋=⌊nξ⌋⌊nx⌋=⌊nξ⌋, where ξ=m2kξ=m2k and ζ=s2kζ=s2k so that
⌊nξ+ζ⌋=⌊nm+s2k⌋⌊nξ+ζ⌋=⌊nm+s2k⌋(19)can be computed by performing a multiplication, an addition, and a shift.
The inclusion of this ζζ might allow us to use a smaller magic number, thereby enabling its accommodation within a single word. The next section is dedicated to the condition for having the above equality.
Here is a small remark before we start the (extensive) discussion of the optimal bound for this. Note that this trick of including the ζζ term is probably not so useful for 6464-bit divisions, because addition is not really a trivial operation in this case. This is because the result of the multiplication mnmn spans two 6464-bit blocks. Hence, we need an adc
(add-with-carry) instruction or an equivalent, which is not particularly well-optimized in typical x86-64 CPU’s. While I have not conducted a benchmark, I speculate that this approach probably results in worse performance. However, this trick might give a better performance for the 3232-bit case than the method explained in this section, as every operation can be done inside 6464-bits. Interestingly, it seems modern compilers do not use this trick anyway. In the link provided, the compiler is trying to compute multiplication by 49992447494999244749 followed by the shift by 4949-bits. However, it turns out, by multiplying 12498111871249811187, adding 12498111871249811187, and then shifting to right by 4747-bits, we can do this computation completely within 6464-bits. I do not know whether the reason why the compilers do not leverage this trick is because it does not perform better, or just because they did not bother to implement it.
Multiply-add-and-shift rather than multiply-shift
WARNING: The contents of this section is substantially more complicated and math-heavy than previous sections.
As said in the last section, we will explore the condition for having
⌊nx⌋=⌊nξ+ζ⌋⌊nx⌋=⌊nξ+ζ⌋(20)for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax, where xx, ξξ are real numbers and ζζ is a nonnegative real number. We will derive the optimal bound, i.e., an “if and only if” condition. We remark that an optimal bound has been obtained in the paper by Lemire et al. mentioned above for the special case when x=1qx=1q for some q≤nmaxq≤nmax and ξ=ζξ=ζ. According to their paper, the proof of the optimality of their bound is almost identical to the case of having no ζζ, so they even did not bother to write down the proof. I provided a proof of this special case in a later subsection. The proof I wrote seems to rely on a heavy lifting done below, but it can be done without it pretty easily as well.
However, in the general case I am dealing here, i.e., the only restriction I have is ζ≥0ζ≥0, the situation is quite more complicated. Nevertheless, even in this generality, it is possible to give a very concrete description of how exactly the presence of ζζ distorts the optimal bound.
Just like the case ζ=0ζ=0 (i.e., Theorem 2), having the equality for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax is equivalent to having the inequality
maxn=1, ⋯ ,nmax⌊nx⌋−ζn≤ξ<minn=1, ⋯ ,nmax⌊nx⌋+1−ζn,maxn=1, ⋯ ,nmax⌊nx⌋−ζn≤ξ<minn=1, ⋯ ,nmax⌊nx⌋+1−ζn,(21)so the question is how to evaluate the maximum and the minimum in the above. By the reason that will become apparent as we proceed, we will in fact try to find the largest maximizer of the left-hand side and the smallest minimizer of the right-hand side.
The lower bound
Since ζ≥0ζ≥0 is supposed to be just a small constant, it sounds reasonable to believe that the maximizer of ⌊nx⌋n⌊nx⌋n is probably quite close to be the maximizer of ⌊nx⌋−ζn⌊nx⌋−ζn as well. So let us start there: let n0n0 be the largest among all maximizers of ⌊nx⌋n⌊nx⌋n. Now, what should happen if n0n0 were not the maximizer of ⌊nx⌋−ζn⌊nx⌋−ζn? Say, what can we say about nn’s such that
⌊nx⌋−ζn≥⌊n0x⌋−ζn0⌊nx⌋−ζn≥⌊n0x⌋−ζn0(22)holds? First of all, since this inequality implies
−ζn≥−ζn0,−ζn≥−ζn0,(23)we must have n≥n0n≥n0 unless ζ=0ζ=0, which is an uninteresting case anyway.
As a result, we can equivalently reformulate our optimization problem as follows: given n=0, ⋯ ,nmax−n0n=0, ⋯ ,nmax−n0, find the largest maximizer of
⌊(n0+n)x⌋−ζn0+n.⌊(n0+n)x⌋−ζn0+n.(24)Now, we claim that
⌊(n0+n)x⌋=⌊nx⌋+⌊n0x⌋⌊(n0+n)x⌋=⌊nx⌋+⌊n0x⌋(25)holds for all such nn. (Note that in general ⌊x+y⌋⌊x+y⌋ is equal to either ⌊x⌋+⌊y⌋⌊x⌋+⌊y⌋ or ⌊x⌋+⌊y⌋+1⌊x⌋+⌊y⌋+1.) This follows from the fact that ⌊n0x⌋n0⌊n0x⌋n0 is not only the best rational approximation from below in the weak sense, but also in the strong sense. Okay, so at this point there is no way to get around these jargons anymore, so let us define them formally.
Definition 3 (Best rational approximations from below/above).
Let xx be a real number. We say a rational number pqpq (in its reduced form, which is always assumed if not specified otherwise) is a best rational approximation from below (above, resp.) if pq≤xpq≤x (pq≥xpq≥x, resp.) and for any rational number abab with pq≤ab≤xpq≤ab≤x (pq≥ab≥xpq≥ab≥x, resp.), we always have q≤bq≤b.
In other words, pqpq is a best rational approximation of xx if any better rational approximation must have a larger denominator.
Remark. Note that the terminology best rational approximation is pretty standard in pure mathematics, but it usually disregards the direction of approximation, from below or above. For example, 1313 is a best rational approximation from below of 3737, but it is not a best rational approximation in the usual, non-directional sense, because 1212 is a better approximation with a strictly less denominator. But this concept of non-directional best rational approximation is quite irrelevant to our application, so any usage of the term best rational approximation in this post always means the directional ones, either from below or above.
The definition provided above is the one in the weak sense, although it is not explicitly written so. The corresponding one in the strong sense is given below:
Definition 4 (Best rational approximations from below/above in the strong sense).
Let xx be a real number. We say a rational number pqpq (again, in its reduced form) is a best rational approximation from below (above, resp.) in the strong sense, if pq≤xpq≤x (pq≥xpq≥x, resp.) and for any rational number abab with qx−p≥bx−a≥0qx−p≥bx−a≥0 (p−qx≥a−bx≥0p−qx≥a−bx≥0, resp.), we always have q≤bq≤b.
As the name suggests, if pqpq is a best rational approximation from below (above, resp.) in the strong sense, then it is a best rational approximation from below (above, resp.) in the weak sense. To see why, suppose that pqpq is a best rational approximation from above of xx in the strong sense and take any rational number abab with ab≤xab≤x and b≤qb≤q. Then it is enough to show that pq≥abpq≥ab holds. (We are only considering the “from below” case, and the “from above” case can be done in the same way.) This is indeed quite easy to show:
x−pq=qx−pq≤bx−aq≤bx−ab=x−ab,x−pq=qx−pq≤bx−aq≤bx−ab=x−ab,(26)thus we conclude ab≤pqab≤pq as claimed, so pqpq is indeed a best rational approximation from below of xx in the weak sense.
Remarkably, using the theory of continued fractions, it can be shown that the converse is also true. Note that this fact is entirely not trivial only by looking at their definitions. You can find a proof of this fact in my paper on Dragonbox; see the remark after Algorithm C.13 (I shamelessly give my own writing as a reference because I do not know of any other).
Alright, but what is the point of these nonsenses?
First, as suggested before, our ⌊n0x⌋n0⌊n0x⌋n0 is a best rational approximation from below of xx. This is obvious from its definition: n0n0 is the maximizer of ⌊nx⌋n⌊nx⌋n for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax, which means that whenever we have
⌊n0x⌋n0<ab≤x,⌊n0x⌋n0<ab≤x,(27)then since a≤⌊bx⌋a≤⌊bx⌋ holds (because otherwise the inequality ab≤xab≤x fails to hold), we must have b>nmaxb>nmax, so in particular b>n0b>n0.
Therefore, by the aforementioned equivalence of the two concepts, we immediately know that ⌊n0x⌋n0⌊n0x⌋n0 is a best rational approximation from below of xx in the strong sense. Note that this does not imply that n0n0 is a minimizer of nx−⌊nx⌋nx−⌊nx⌋ for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax because ⌊n0x⌋n0⌊n0x⌋n0 is not necessarily in its reduced form (since we chose n0n0 to be the largest maximizer). Still, if we denote its reduced form as pqpq, then certainly qq is the minimizer of it (and pp must be equal to ⌊qx⌋⌊qx⌋). Indeed, by the definition of best rational approximations from below in the strong sense, if nn is the smallest integer such that n>qn>q and
nx−⌊nx⌋<qx−p,nx−⌊nx⌋<qx−p,(28)then ⌊nx⌋n⌊nx⌋n itself must be a best rational approximation from below in the strong sense, thus also in the weak sense. This in particular means ⌊nx⌋n≥pq=⌊n0x⌋n0⌊nx⌋n≥pq=⌊n0x⌋n0 holds since n>qn>q, and in fact the equality cannot hold because otherwise we get
qx−p=qx−qn⌊nx⌋=qn(nx−⌊nx⌋)<qn(qx−p)qx−p=qx−qn⌊nx⌋=qn(nx−⌊nx⌋)<qn(qx−p)(29)which contradicts to n>qn>q. Therefore, by the definition of n0n0, we must have n>nmaxn>nmax.
Using this fact, now we can easily prove (25)(25). Again let pqpq be the reduced form of ⌊n0x⌋n0⌊n0x⌋n0, then for any n=0, ⋯ ,nmax−qn=0, ⋯ ,nmax−q, we must have
(q+n)x−⌊(q+n)x⌋≥qx−p,(q+n)x−⌊(q+n)x⌋≥qx−p,(30)so we get
⌊(q+n)x⌋≤⌊qx⌋+nx<⌊qx⌋+⌊nx⌋+1,⌊(q+n)x⌋≤⌊qx⌋+nx<⌊qx⌋+⌊nx⌋+1,(31)and this rules out the possibility ⌊(q+n)x⌋=⌊qx⌋+⌊nx⌋+1⌊(q+n)x⌋=⌊qx⌋+⌊nx⌋+1, thus we must have ⌊(q+n)x⌋=⌊qx⌋+⌊nx⌋⌊(q+n)x⌋=⌊qx⌋+⌊nx⌋. Then inductively, we get that for any positive integer kk such that kq≤nmaxkq≤nmax and n=0, ⋯ ,nmax−kqn=0, ⋯ ,nmax−kq,
⌊(kq+n)x⌋=⌊qx⌋+⌊((k−1)q+n)x⌋= ⋯ =⌊kqx⌋+⌊nx⌋.⌊(kq+n)x⌋=⌊qx⌋+⌊((k−1)q+n)x⌋= ⋯ =⌊kqx⌋+⌊nx⌋.(32)Hence, we must have ⌊(n0+n)x⌋=⌊n0x⌋+⌊nx⌋⌊(n0+n)x⌋=⌊n0x⌋+⌊nx⌋ in particular.
(Here is a more intuitive explanation. Note that among all nxnx’s, qxqx is the one with the smallest fractional part. Then whenever kq+n≤nmaxkq+n≤nmax, the sum of the fractional parts of nxnx and that of kk copies of qxqx should not “wrap around”, i.e., it should be strictly less than 11, because if we choose the smallest kk that (kq+n)x(kq+n)x wraps around, then since ((k−1)q+n)x((k−1)q+n)x should not wrap around, its fractional part is strictly less than 11, which means that the fractional part of (kq+n)x(kq+n)x is strictly less than that of qxqx, contradicting to the minimality of the fractional part of qxqx. Hence, since the sum of all fractional parts is still strictly less than 11, the integer part of (kq+n)x(kq+n)x should be just the sum of the integer parts of its summands.)
Next, using the claim, now we can characterize n=1, ⋯ ,nmax−n0n=1, ⋯ ,nmax−n0 which yields a larger value for
⌊(n0+n)x⌋−ζn0+n⌊(n0+n)x⌋−ζn0+n(33)than the case n=0n=0. Indeed, we have the inequality
⌊(n0+n)x⌋−ζn0+n≥⌊n0x⌋−ζn0⌊(n0+n)x⌋−ζn0+n≥⌊n0x⌋−ζn0(34)if and only if
n0⌊nx⌋−n0ζ≥n⌊n0x⌋−(n0+n)ζ,n0⌊nx⌋−n0ζ≥n⌊n0x⌋−(n0+n)ζ,(35)if and only if
⌊nx⌋n≥⌊n0x⌋−ζn0,⌊nx⌋n≥⌊n0x⌋−ζn0,(36)or equivalently,
x−⌊nx⌋n≤ζn0+(x−⌊n0x⌋n0).x−⌊nx⌋n≤ζn0+(x−⌊n0x⌋n0).(37)Therefore, in a sense, ⌊nx⌋n⌊nx⌋n itself should be a good enough approximation of xx, although it does not need to (and cannot) be a better approximation than ⌊n0x⌋n0⌊n0x⌋n0.
At this point, we could just enumerate all rational approximations of xx from below satisfying the above bound and find out the one that maximizes ⌊(n0+n)x⌋−ζn0+n⌊(n0+n)x⌋−ζn0+n. Indeed, the theory of continued fractions allows us to develop an efficient algorithm for doing such a task. (See Algorithm C.13 in my paper on Dragonbox for an algorithm doing a closely related task.) However, we can do better here.
First of all, note that if ζζ is small enough, then (37)(37) does not have any solution for n=1, ⋯ ,nmax−n0n=1, ⋯ ,nmax−n0, which means n0n0 is indeed the largest maximizer we are looking for and there is nothing further we need to do. To be more precise, if ζζ chosen so that ⌊n0x⌋−ζn0⌊n0x⌋−ζn0 is strictly greater than the next best rational approximation from below of xx, then any solution to (37)(37) must be a multiple of q∗q∗ where q∗q∗ is the denominator of ⌊n0x⌋n0⌊n0x⌋n0 in its reduced form. But since n0n0 is chosen to be the largest, it follows that there is no multiple of q∗q∗ in the range {1, ⋯ ,nmax−n0}{1, ⋯ ,nmax−n0}, so there is no solution to (37)(37) in that range.
This conclusion is consistent with the intuition that n0n0 should be close enough to the maximizer of ⌊nx⌋−ζn⌊nx⌋−ζn at least if ζζ is small. But what if ζζ is not that small?
It is quite tempting to claim that the minimizer of the left-hand side of (37)(37) is the maximizer of ⌊(n0+n)x⌋−ζn0+n⌊(n0+n)x⌋−ζn0+n, but that is not true in general. Nevertheless, we can start from there, just like that we started from n0n0 from the beginning.
In this reason, let n1n1 be the largest maximizer of ⌊nx⌋n⌊nx⌋n for n=1, ⋯ ,nmax−n0n=1, ⋯ ,nmax−n0. As pointed out earlier, if such n1n1 does not satisfy (37)(37), then there is nothing further to do, so suppose that the inequality is indeed satisfied with n=n1n=n1.
Next, we claim that
⌊(n0+n)x⌋−ζn0+n≤⌊(n0+n1)x⌋−ζn0+n1⌊(n0+n)x⌋−ζn0+n≤⌊(n0+n1)x⌋−ζn0+n1(38)holds for all n=1, ⋯ ,n1n=1, ⋯ ,n1.
Note that by (25)(25), this is equivalent to
n0⌊nx⌋+n1(⌊n0x⌋+⌊nx⌋)−(n0+n1)ζ≤n0⌊n1x⌋+n(⌊n0x⌋+⌊n1x⌋)−(n0+n)ζ,n0⌊nx⌋+n1(⌊n0x⌋+⌊nx⌋)−(n0+n1)ζ≤n0⌊n1x⌋+n(⌊n0x⌋+⌊n1x⌋)−(n0+n)ζ,and rearranging it gives
(n1−n)(ζ−⌊n0x⌋)≥(n1⌊nx⌋−n⌊n1x⌋)+n0(⌊nx⌋−⌊n1x⌋).(n1−n)(ζ−⌊n0x⌋)≥(n1⌊nx⌋−n⌊n1x⌋)+n0(⌊nx⌋−⌊n1x⌋).(39)By adding and subtracting appropriate terms, we can rewrite this as
n0(n1−n)(ζn0+(x−⌊n0x⌋n0))≥n1(n0+n)(x−⌊n1x⌋n1)−n(n0+n1)(x−⌊nx⌋n).n0(n1−n)(ζn0+(x−⌊n0x⌋n0))≥n1(n0+n)(x−⌊n1x⌋n1)−n(n0+n1)(x−⌊nx⌋n).Recall that we already have assumed
x−⌊n1x⌋n1≤ζn0+(x−⌊n0x⌋n0),x−⌊n1x⌋n1≤ζn0+(x−⌊n0x⌋n0),(40)so it is enough to show
n0(n1−n)(x−⌊n1x⌋n1)≥n1(n0+n)(x−⌊n1x⌋n1)−n(n0+n1)(x−⌊nx⌋n),n0(n1−n)(x−⌊n1x⌋n1)≥n1(n0+n)(x−⌊n1x⌋n1)−n(n0+n1)(x−⌊nx⌋n),or equivalently,
n(n0+n1)(x−⌊nx⌋n)≥(n1(n0+n)−n0(n1−n))(x−⌊n1x⌋n1)=n(n0+n1)(x−⌊n1x⌋n1),n(n0+n1)(x−⌊nx⌋n)≥(n1(n0+n)−n0(n1−n))(x−⌊n1x⌋n1)=n(n0+n1)(x−⌊n1x⌋n1),which trivially holds by the definition of n1n1. Thus the claim is proved.
As a result, we can reformulate our optimization problem again in the following way: define N1:=n0+n1N1:=n0+n1, then we are to find n=0, ⋯ ,nmax−N1n=0, ⋯ ,nmax−N1 which maximizes
⌊(N1+n)x⌋−ζN1+n.⌊(N1+n)x⌋−ζN1+n.(41)As you can see, now it resembles quite a lot what we just have done. Then the natural next step is to see whether we have
⌊(N1+n)x⌋=⌊N1x⌋+⌊nx⌋,⌊(N1+n)x⌋=⌊N1x⌋+⌊nx⌋,(42)which, by (25)(25), is equivalent to
⌊(n1+n)x⌋=⌊n1x⌋+⌊nx⌋.⌊(n1+n)x⌋=⌊n1x⌋+⌊nx⌋.(43)But we already know this: because ⌊n1x⌋n1⌊n1x⌋n1 is a best rational approximation from below of xx, the exact same proof of (25)(25) applies.
Therefore, by the same procedure as we did, we get that
⌊(N1+n)x⌋−ζN1+n≥⌊N1x⌋−ζN1⌊(N1+n)x⌋−ζN1+n≥⌊N1x⌋−ζN1(44)holds for n=1, ⋯ ,nmax−N1n=1, ⋯ ,nmax−N1 if and only if
⌊nx⌋n≥⌊N1x⌋−ζN1.⌊nx⌋n≥⌊N1x⌋−ζN1.(45)Hence, we finally arrive at the following iterative algorithm for computing the maximizer of ⌊nx⌋−ζn⌊nx⌋−ζn:
Algorithm 5 (Computing the lower bound).
Input: x∈Rx∈R, nmax∈Z>0nmax∈Z>0, ζ≥0ζ≥0.
Output: the largest maximizer of ⌊nx⌋−ζn⌊nx⌋−ζn for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax.
- Find the largest n=1, ⋯ ,nmaxn=1, ⋯ ,nmax that maximizes ⌊nx⌋n⌊nx⌋n and call it n0n0.
- If n0=nmaxn0=nmax, then n0n0 is the largest maximizer; return.
- Otherwise, find the largest n=1, ⋯ ,nmax−n0n=1, ⋯ ,nmax−n0 that maximizes ⌊nx⌋n⌊nx⌋n and call it n1n1.
- Inspect the inequality ⌊n1x⌋n1≥⌊n0x⌋−ζn0.⌊n1x⌋n1≥⌊n0x⌋−ζn0.(46)
- If the inequality does not hold, then n0n0 is the largest maximizer; return.
- If the inequality does hold, then set n0←n0+n1n0←n0+n1 and go to Step 2.
Remark. In the above, we blackboxed the operation of finding the largest maximizer of ⌊nx⌋n⌊nx⌋n. Actually, this is precisely how we obtain Theorem 2. If xx is a rational number whose denominator qq is at most nmaxnmax, then obviously the largest maximizer is the largest multiple of qq bounded by nmaxnmax. Otherwise, we just compute the best rational approximation from below of xx with the largest denominator q∗≤nmaxq∗≤nmax, where the theory of continued fractions allows us to compute this very efficiently. (Especially, when xx is a rational number, this efficient algorithm is really just the extended Euclid algorithm.) Once we get the best rational approximation p∗q∗p∗q∗ (which must be in its reduced form), we find the largest multiple kq∗kq∗ of q∗q∗ bounded by nmaxnmax. Then since p∗q∗p∗q∗ is the maximizer of ⌊nx⌋n⌊nx⌋n for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax, it follows that ⌊kq∗x⌋⌊kq∗x⌋ must be equal to kp∗kp∗ and kq∗kq∗ is the largest maximizer of ⌊nx⌋n⌊nx⌋n.
The upper bound
The computation of the upper bound, that is, finding the smallest minimizer of
⌊nx⌋+1−ζn⌊nx⌋+1−ζn(47)for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax, is largely same as the case of lower bound. In this case, we start with the smallest minimizer n0n0 of ⌊nx⌋+1n⌊nx⌋+1n. Then for any n=1, ⋯ ,nmaxn=1, ⋯ ,nmax with
⌊nx⌋+1−ζn≤⌊n0x⌋+1−ζn0,⌊nx⌋+1−ζn≤⌊n0x⌋+1−ζn0,(48)we must have
−ζn≤−ζn0,−ζn≤−ζn0,(49)thus n≤n0n≤n0 unless ζ=0ζ=0, which again is an uninteresting case.
Hence, our goal is to find n=0, ⋯ ,n0−1n=0, ⋯ ,n0−1 minimizing
⌊(n0−n)x⌋+1−ζn0−n.⌊(n0−n)x⌋+1−ζn0−n.(50)Again, we claim that
⌊(n0−n)x⌋=⌊n0x⌋−⌊nx⌋⌊(n0−n)x⌋=⌊n0x⌋−⌊nx⌋(51)holds for all such nn. We have two cases: (1) when ⌊n0x⌋+1n0⌊n0x⌋+1n0 is a best rational approximation from above of xx, or (2) when it is not. The second case can happen only when xx is a rational number whose denominator is at most nmaxnmax, so that the best rational approximation of it is xx itself. However, for such a case, if we let x=pqx=pq, then according to how we derived Theorem 2, the remainder of n0p/qn0p/q must be the largest possible value, q−1q−1. Hence, the quotient of (n0−n)p/q(n0−n)p/q cannot be strictly smaller than the difference between the quotients of n0p/qn0p/q and np/qnp/q, so the claim holds in this case.
On the other hand, if we suppose that ⌊n0x⌋+1n0⌊n0x⌋+1n0 is a best rational approximation from above of xx, then as we have seen in the case of lower bound, it must be a best rational approximation from above in the strong sense. Then it is not hard to see that n0n0 is the minimizer of ⌈nx⌉−nx=⌊nx⌋+1−nx⌈nx⌉−nx=⌊nx⌋+1−nx, or equivalently, the maximizer of nx−⌊nx⌋nx−⌊nx⌋, because we chose n0n0 to be the smallest minimizer. This shows
(n0−n)x−⌊(n0−n)x⌋≤n0x−⌊n0x⌋,(n0−n)x−⌊(n0−n)x⌋≤n0x−⌊n0x⌋,(52)thus
⌊(n0−n)x⌋≥⌊n0x⌋−nx>⌊n0x⌋−⌊nx⌋−1,⌊(n0−n)x⌋≥⌊n0x⌋−nx>⌊n0x⌋−⌊nx⌋−1,(53)so we must have ⌊(n0−n)x⌋=⌊n0x⌋−⌊nx⌋⌊(n0−n)x⌋=⌊n0x⌋−⌊nx⌋ as claimed.
Using the claim, we can again characterize n=1, ⋯ ,n0−1n=1, ⋯ ,n0−1 which yields a smaller value of
⌊(n0−n)x⌋+1−ζn0−n⌊(n0−n)x⌋+1−ζn0−n(54)than the case n=n0n=n0. Indeed, we have the inequality
⌊(n0−n)x⌋+1−ζn0−n≤⌊n0x⌋+1−ζn0⌊(n0−n)x⌋+1−ζn0−n≤⌊n0x⌋+1−ζn0(55)if and only if
−n0⌊nx⌋+n0(1−ζ)≤−n⌊n0x⌋+(n0−n)(1−ζ)−n0⌊nx⌋+n0(1−ζ)≤−n⌊n0x⌋+(n0−n)(1−ζ)(56)if and only if
n0⌊nx⌋≥n⌊n0x⌋+n(1−ζ)n0⌊nx⌋≥n⌊n0x⌋+n(1−ζ)(57)if and only if
⌊nx⌋n≥⌊n0x⌋+1−ζn0,⌊nx⌋n≥⌊n0x⌋+1−ζn0,(58)or equivalently,
x−⌊nx⌋n≤ζn0−(⌊n0x⌋+1n0−x).x−⌊nx⌋n≤ζn0−(⌊n0x⌋+1n0−x).(59)Next, let n1n1 be the largest maximizer of ⌊nx⌋n⌊nx⌋n for n=1, ⋯ ,n0−1n=1, ⋯ ,n0−1. If (59)(59) does not hold with n=n1n=n1, then we know that n0n0 is the one we are looking for, so suppose the otherwise.
Next, we claim that
⌊(n0−n)x⌋+1−ζn0−n≥⌊(n0−n1)x⌋+1−ζn0−n1⌊(n0−n)x⌋+1−ζn0−n≥⌊(n0−n1)x⌋+1−ζn0−n1(60)holds for all n=1, ⋯ ,n1n=1, ⋯ ,n1. By (51)(51), we can rewrite the inequality above as
−n0⌊nx⌋−n1⌊n0x⌋+n1⌊nx⌋+(n0−n1)(1−ζ)≥−n0⌊n1x⌋−n⌊n0x⌋+n⌊n1x⌋+(n0−n)(1−ζ),−n0⌊nx⌋−n1⌊n0x⌋+n1⌊nx⌋+(n0−n1)(1−ζ)≥−n0⌊n1x⌋−n⌊n0x⌋+n⌊n1x⌋+(n0−n)(1−ζ),or
(n1−n)(⌊n0x⌋+1−ζ)≤(n1⌊nx⌋−n⌊n1x⌋)−n0(⌊nx⌋−⌊n1x⌋).(n1−n)(⌊n0x⌋+1−ζ)≤(n1⌊nx⌋−n⌊n1x⌋)−n0(⌊nx⌋−⌊n1x⌋).(61)Then adding and subtracting appropriate terms gives
n0(n1−n)(ζn0−(⌊n0x⌋+1n0−x))≥n1(n0−n)(x−⌊n1x⌋n1)−n(n0−n1)(x−⌊nx⌋n).n0(n1−n)(ζn0−(⌊n0x⌋+1n0−x))≥n1(n0−n)(x−⌊n1x⌋n1)−n(n0−n1)(x−⌊nx⌋n).Recall that we already supposed
x−⌊n1x⌋n1≤ζn0−(⌊n0x⌋+1n0−x),x−⌊n1x⌋n1≤ζn0−(⌊n0x⌋+1n0−x),(62)so it is enough to show
n0(n1−n)(x−⌊n1x⌋n1)≥n1(n0−n)(x−⌊n1x⌋n1)−n(n0−n1)(x−⌊nx⌋n),n0(n1−n)(x−⌊n1x⌋n1)≥n1(n0−n)(x−⌊n1x⌋n1)−n(n0−n1)(x−⌊nx⌋n),or equivalently,
n(n0−n1)(x−⌊nx⌋n)≥(n1(n0−n)−n0(n1−n))(x−⌊n1x⌋n1)=n(n0−n1)(x−⌊n1x⌋n1),n(n0−n1)(x−⌊nx⌋n)≥(n1(n0−n)−n0(n1−n))(x−⌊n1x⌋n1)=n(n0−n1)(x−⌊n1x⌋n1),which trivially holds by the definition of n1n1. Thus, the claim is proved.
As a result, we can reformulate our optimization problem in the following way: define N1=n0−n1N1=n0−n1, then we are to find n=0, ⋯ ,N1−1n=0, ⋯ ,N1−1 which minimizes
⌊(N1−n)x⌋+1−ζN1−n.⌊(N1−n)x⌋+1−ζN1−n.(63)Again, the next step is to show that
⌊(N1−n)x⌋=⌊N1x⌋−⌊nx⌋⌊(N1−n)x⌋=⌊N1x⌋−⌊nx⌋(64)holds for all such nn, which by (51)(51), is equivalent to
⌊(n1+n)x⌋=⌊n1x⌋+⌊nx⌋.⌊(n1+n)x⌋=⌊n1x⌋+⌊nx⌋.(65)But we already have seen this in the case of lower bound, because we know n1n1 is a maximizer of ⌊nx⌋n⌊nx⌋n for n=1, ⋯ ,n0−1n=1, ⋯ ,n0−1. Hence, again we can show that
⌊(N1−n)x⌋+1−ζN1−n≤⌊N1x⌋+1−ζN1⌊(N1−n)x⌋+1−ζN1−n≤⌊N1x⌋+1−ζN1(66)holds if and only if
⌊nx⌋n≥⌊N1x⌋+1−ζN1,and repeating this procedure gives us the smallest minimizer of ⌊nx⌋+1−ζn.
Algorithm 6 (Computing the upper bound).
Input: x∈R, nmax∈Z>0, ζ≥0.
Output: the smallest minimizer of ⌊nx⌋+1−ζn for n=1, ⋯ ,nmax.
- Find the smallest n=1, ⋯ ,nmax that minimizes ⌊nx⌋+1n and call it n0.
- If n0=1, then n0 is the smallest minimizer; return.
- Find the largest n=1, ⋯ ,n0−1 that maximizes ⌊nx⌋n and call it n1.
- Inspect the inequality ⌊n1x⌋n1≥⌊n0x⌋+1−ζn0.
- If the inequality does not hold, then n0 is the smallest minimizer; return.
- If the inequality does hold, then set n0←n0−n1 and go to Step 2.
Remark. Similarly to the case of lower bound, we blackboxed the operation of finding the smallest minimizer of ⌊nx⌋n, which again is precisely how we obtain Theorem 2. If x=pq is a rational number with q≤nmax, then the minimizer is unique and is the largest v≤nmax such that vp≡−1 (mod q). Otherwise, we just compute the best rational approximation from above of x with the largest denominator q∗≤nmax, where the theory of continued fractions allows us to compute this very efficiently. The resulting p∗q∗ must be in its reduced form, so q∗ is the smallest minimizer of ⌊nx⌋+1n.
Finding feasible values of ξ and ζ
Note that our purpose of introducing ζ was to increase the gap between the lower and the upper bounds of ξ=m2k when the required bit-width of the constant m is too large. Thus, ζ is not something given, rather is something we want to figure out. In this sense, Algorithm 5 and Algorithm 6 might seem pretty useless because they work only when the value of ζ is already given.
Nevertheless, the way those algorithms work for a fixed ζ is in fact quite special in that it allows us to figure out a good ζ fitting into our purpose of widening the gap between two bounds. The important point here is that, ζ only appears in deciding when to stop the iteration, and all other details of the actual iteration steps do not depend on ζ at all.
Hence, our strategy is as follows. First, we assume ζ lives in the interval [0,1) (otherwise ⌊nx⌋=⌊nξ+ζ⌋ fails to hold for n=0). Then we can partition the interval [0,1) into subintervals of the form [ζmin,ζmax) where the numbers of iterations that Algorithm 5 and Algorithm 6 will take remain constant. Then we look at each of these subintervals one by one, from left to right, while proceeding the iterations of Algorithm 5 and Algorithm 6 whenever we move onto the next subinterval and the numbers of iterations for each of them change.
Our constraint on ξ and ζ is that the computation of mn+s should not overflow, so suppose that there is a given limit Nmax on how large mn+s can be. Now, take a subinterval [ζmin,ζmax) as described above. If there is at least one feasible choice of ξ for some ζ∈[ζmin,ζmax), then such ξ must lie in the interval
I:=(⌊nL,0x⌋−ζmaxnL,0,⌊nU,0x⌋+1−ζminnU,0),where nL,0 and nU,0 are the supposed output of Algorithm 5 and Algorithm 6, respectively, which must stay constant as long as ζ∈[ζmin,ζmax).
Next, we will take a loop over all numbers of the form m2k in I and check one by one if it is really a possible choice of ξ. The order of examination will be the lexicographic order on (k,m), that is, from smaller k to larger k, and for given k, from smaller m to larger m. To find the smallest k, let Δ be the size of the interval above, that is,
Δ:=⌊nU,0x⌋+1−ζminnU,0−⌊nL,0x⌋−ζmaxnL,0.Define k0:=⌊log21Δ⌋, then we have
2k0Δ≤1<2k0+1Δ≤2,and this means that there is at most one integer in the interval 2k0I while there is at least one integer in the interval 2k0+1I. Therefore, we first see if there is one in 2k0I, and if not, then have a look at 2k0+1I where the existence is guaranteed. Also, note that if there were none in 2k0I, then 2k0+1I should have exactly one integer and that integer must be odd.
Either case, we can find the smallest possible integer k such that
ξ=m2kis in the interval I for some m∈Z. (When 2k0I does not contain an integer, then k=k0+1, and if it contains an integer, then k=k0−b where b is the highest exponent of 2 such that 2b divides the unique integer in 2k0I.) We will start from there and successively increase the value of k. For currently given k, we take a loop over all m such that ξ=m2k stays inside I. (By the construction, there is exactly one possible m for the smallest k which we start from.) In order for this ξ to be allowed by a choice of ζ, according to Algorithm 5, ζ should be at least
ζ0:=⌊nL,0x⌋−nL,0ξ=2k⌊nL,0x⌋−nL,0m2k.Since we want our ξ to satisfy the upper bound given by Algorithm 6 as well, we want to choose ζ to be as small as possible as long as ζ≥ζ0 is satisfied. Hence, we want to take ζ=ζ0, but this is not always possible because ζ0≥ζmin is not always true. (The above ζ0 even can be negative, so in the actual algorithm we may truncate it to be nonnegative.)
If ζ0≥ζmin is satisfied, then we may indeed take ζ=ζ0. By the construction, ζ0<ζmax is automatically satisfied, so we want to check two things: (1) the smallness constraint mnmax+s≤Nmax with s=2kζ0, and (2) the true upper bound
ξ<⌊nU,0x⌋+1−ζ0nU,0given by Algorithm 6. If both of the conditions are satisfied, then k, m, and s=2kζ0 are valid solutions. Otherwise, we conclude that ξ does not yield an admissable choice of (k,m,s).
If ζ0<ζmin, then we have to increase the numerator of ζ0 to put in inside [ζmin,ζmax), so let ζ be the one obtained by adding the smallest positive integer to the numerator of ζ0 which ensures ζ≥ζmin. In this case, since we cannot easily estimate how large ζ is compared to ζ0, we have to check ζ<ζmax in addition to the other two conditions.
If we failed to find any admissible choice of (k,m,s) with our ξ, then we increase the numerator m and see if that works out until ξ=m2k goes outside of the search interval I. After exhausting all m’s, now we increase k by one and repeat the same procedure. Note that even though all of ξ=m2k’s with even numerators are already considered when we were looking at a smaller k, we do not exclude them from the search because for the case ζ0<ζmin, having a larger denominator of ζ might allow the gap ζ−ζ0 to be smaller.
This loop over k should be stopped when we arrive at a point where the smallest integer m in 2kI already fails to satisfy
mnmax+2kζ0≤Nmax.If we exhausted all k’s satisfying the above, then now we conclude that there is no ζ∈[ζmin,ζmax) yielding an admissible choice of (k,m,s), so we should move onto the next subinterval.
After filling out some omitted details, we arrive at the following algorithm.
Algorithm 7 (Finding feasible values of ξ and ζ).
Input: x∈R, nmax∈Z>0, Nmax∈Z>0.
Output: k, m, and s, where ξ=m2k and ζ=s2k, so that we have ⌊nx⌋=⌊nm+s2k⌋ for all n=1, ⋯ ,nmax.
- Find the largest n=1, ⋯ ,nmax that maximizes ⌊nx⌋n and call it nL,0.
- Find the smallest n=1, ⋯ ,nmax that minimizes ⌊nx⌋+1n and call it nU,0.
- Set ζmax←0, ζL,max←0, ζU,max←0, nL,1←0, nU,1←0.
- Check if ζmax=ζL,max. If that is the case, then we have to update ζL,max. Set nL,0←nL,0+nL,1. If nL,0=nmax, then set ζL,max←1. Otherwise, find the largest n=1, ⋯ ,nmax−nL,0 that maximizes ⌊nx⌋n, assign it to nL,1, and set ζL,max←min(nL,1⌊nL,0x⌋−nL,0⌊nL,1x⌋nL,1,1).
- Check if ζmax=ζU,max. If that is the case, then we have to update ζU,max. Set nU,0←nU,0−nU,1. If nU,0=1, then set ζU,max←1. Otherwise, find the largest n=1, ⋯ ,nU,0−1 that maximizes ⌊nx⌋n, assign it to nU,1, and set ζU,max←min(nU,1(⌊nU,0x⌋+1)−nU,0⌊nU,1x⌋nU,1,1).
- Set ζmin←ζmax, ζmax←min(ζL,max,ζU,max), and Δ←⌊nU,0x⌋+1−ζminnU,0−⌊nL,0x⌋−ζmaxnL,0. If Δ≤0, then this means the search interval I is empty. In this case, go to Step 16. Otherwise, set k←⌊log21Δ⌋ and proceed to the next step.
- Compute m←⌊2k(⌊nL,0x⌋−ζmax)nL,0⌋+1 and inspect the inequality m<2k(⌊nU,0x⌋+1−ζmin)nU,0. If it does not hold, then set k←k+1 and recompute m accordingly. Otherwise, set k←k−b where b is the greatest integer such that 2b divides m, and set m←m/2b.
- Set ζ0←max(2k⌊nL,0x⌋−nL,0m2k,0) and inspect the inequality mnmax+2kζ0≤Nmax. If it does not hold, then this means every candidate we can further find in the interval [ζmin,ζmax) are consisting of too large numbers, so we move onto the next subinterval. In this case, go to Step 16. Otherwise, proceed to the next step.
- If ζ0≥ζmin, then ζ0∈[ζmin,ζmax) and mnmax+2kζ0≤Nmax hold, so it is enough to check the inequality m2k<⌊nU,0x⌋+1−ζ0nU,0. If this holds, then we have found an admissible choice of (k,m,s), so return. Otherwise, we conclude that ξ=m2k does not yield an admissible answer. In this case, go to Step 13.
- If ζ0<ζmin, then set a←⌈2k(ζmin−ζ0)⌉ and ζ←ζ0+a2k so that ζ≥ζmin is satisfied. Check if ζ<ζmax holds, and if that is not the case, then go to Step 13.
- Check if mnmax+2kζ≤Nmax holds. If that is not the case, then go to Step 13.
- Inspect the inequality m2k<⌊nU,0x⌋+1−ζnU,0. If it does hold, then we have found an admissible choice of (k,m,s), so return. Otherwise, proceed to the next step.
- Set m←m+1 and inspect the inequality m2k<⌊nU,0x⌋+1−ζminnU,0. If it does not hold, then go to Step 15.
- If it does hold, then set ζ0←max(2k⌊nL,0x⌋−nL,0m2k,0) and inspect the inequality mnmax+2kζ0≤Nmax. If it does hold, then go back to Step 9. Otherwise, proceed to the next step.
- Set k←k+1 and m←⌊2k(⌊nL,0x⌋−ζmax)nL,0⌋+1, and go to Step 8.
- We have failed to find any admissible choice of (k,m,s) so far, so we have to move onto the next subinterval. If ζmax=1, then we already have exhausted the whole interval, so return with FAIL. Otherwise, go back to Step 4.
Remarks.
Whenever we update ζL,max and ζU,max, the updated values must be always strictly bigger than the previous values unless they already have reached their highest value 1. Let us see the case of ζL,max; the case of ζU,max is similar. Suppose that initially we had nL,0=n0, nL,1=n1, and after recomputing nL,1 we got nL,1=n2. Then the new value of ζL,max and the old value of it are respectively given as n2⌊(n0+n1)x⌋−(n0+n1)⌊n2x⌋n2,n1⌊n0x⌋−n0⌊n1x⌋n1, respectively. Applying (25), it turns out that the first one minus the second one is equal to (n0+n1)(⌊n1x⌋n1−⌊n2x⌋n2). Now, recall the way n1 is chosen: we first find the best rational approximation of x from below in the range {1, ⋯ ,nmax−n0}, call its denominator q∗, and set n1 to be the largest multiple of q∗. Since n1 is the largest multiple, it follows that nmax−n0−n1 should be strictly smaller than q∗. Therefore, the best rational approximation in the range {1, ⋯ ,nmax−n0−n1} should be strictly worse than what q∗ gives. This shows that (87) is strictly positive.
When there indeed exists at least one admissible choice of (k,m,s) given x, nmax, and Nmax, then Algorithm 7 finds the triple (k,m,s) with the smallest k, and then m, and then s, within the first subinterval [ζmin,ζmax) where a solution can be found. To make it to always find (k,m,s) with the smallest k, m, and s among all solutions, we can simply modify the algorithm a little bit so that it does not immediately stop when it finds a solution, rather it continues to other subintervals and then compare any solutions found there with the previous best one.
An actual implementation of this algorithm can be found here. (Click here to test it in live.)
Results by Lemire et al.
Consider the special case when x=1q, q≤nmax, and ξ=ζ. In this case, we have the following result:
Theorem 8 (Lemire et al., 2021).
For any positive integers q and nmax with q≤nmax, we have
⌊nq⌋=⌊(n+1)ξ⌋for all n=0, ⋯ ,nmax if and only if
(1−1⌊nmax/q⌋q+1)1q≤ξ<1q.
We can give an alternative proof of this fact using what we have developed so far. Essentially, this is due to the fact that Algorithm 5 finishes its iteration at the first step and do not proceed further.
Proof. (⇒) Take n=q−1, then we have 0=⌊qξ⌋, thus ξ<1q follows. On the other hand, take n=⌊nmaxq⌋q, then we have
⌊nmaxq⌋≤(⌊nmaxq⌋q+1)ξ,so rearranging this gives the desired lower bound on ξ.
(⇐) It is enough to show that ξ satisfies
maxn=1, ⋯ ,nmax⌊n/q⌋−ξn≤ξ<minn=1, ⋯ ,nmax⌊n/q⌋+1−ξn.Following Algorithm 5, let n0 be the largest maximizer of ⌊n/q⌋n, i.e., n0=⌊nmaxq⌋q. Then we know from Algorithm 5 that n0 is the largest maximizer of ⌊n/q⌋−ξn if and only if
⌊n/q⌋n<⌊n0/q⌋−ξn0=1q−ξ⌊nmax/q⌋qholds for all n=1, ⋯ ,nmax−n0. Pick any such n and let r be the remainder of n/q, then we have
⌊n/q⌋n=(n−r)/qn=1q−rnq.Hence, we claim that
ξ⌊nmax/q⌋<rnholds for all such n. This is clear, because ξ<1q while nmax−n0 is at most q−1, so the right-hand side is at least 1q−1. Therefore, n0 is indeed the largest maximizer of ⌊n/q⌋−ξn for n=1, ⋯ ,nmax, and the inequality
⌊n0/q⌋−ξn0≤ξis equivalent to
ξ≥⌊n0/q⌋n0+1=1q(1−1n0+1),which we already know.
For the upper bound, note that it is enough to show that
1q≤⌊n/q⌋+1n+1holds for all n=1, ⋯ ,nmax. We can rewrite this inequality as
nq−⌊nq⌋≤q−1q,which means nothing but that the remainder of n divided by q is at most q−1, which is trivially true. ◼
Note the fact that the numerator of x is 1 is crucially used in this proof.
Lemire et al. also showed that, if nmax=2N−1 and q is not a power of 2, then whenever the best magic constant predicted by Theorem 2 does not fit into a word, the best magic constant predicted by the above theorem should fit into a word. In fact, those assumptions about being power of 2 or not power of 2 and such can be removed, as shown below.
Theorem 9 (Improves Lemire et al.)
Let x be a positive real number and u,v be positive integers. Let k be the smallest integer such that the set
[2kx,2kx(1+1v))∩Zis not empty. If 2kx≥max(u,v), then the set
[2k−1x(1−1u),2k−1x)∩Zmust be nonempty as well.
Thus, if we let x=1q, v to be the v in Theorem 2 and u:=⌊nmaxq⌋q+1, then whenever the best magic constant m=⌈2kq⌉ happens to be greater than or equal to nmax, we always have
⌈2k−1q(1−1u)⌉<2k−1q.Recall that we have seen in a previous section that 2kq<2v always holds, so the right-hand side of the above is bounded by v≤nmax, so indeed this shows that the best magic constant predicted by Theorem 8 is bounded by nmax in this case.
Proof. By definition of k, we must have
[2k−1x,2k−1x(1+1v))∩Z=∅.Let
α:=⌈2k−1x⌉−2k−1x∈[0,1),then hence we must have
2k−1x+α≥2k−1x(1+1v),thus
α≥2k−1xv.Now, note that
2k−1x(1−1u)=⌈2k−1x⌉−α−2k−1xu,so it is enough to show that
α+2k−1xu≥1.Note that from (106), we know
α+2k−1xu≥2k−1xv+2k−1xu≥2kxmax(u,v),and the right-hand side is bounded below by 1 by the assumption, so we are done. ◼
Therefore, Theorem 2 and Theorem 8 are enough to cover all cases when x=1q with q≤nmax. Then does Algorithm 7 have any relevance in practice?
An example usage of Algorithm 7
I do not know, I mean, as I pointed out before, I am not sure how important in general is to cover the case x=pq with p≠1. But anyway when p≠1, there certainly are some cases where Algorithm 7 might be useful. Consider the following example:
std::uint32_t div(std::uint32_t n) {
return (std::uint64_t(n) * 7) / 18;
}
Here I provide three different ways of doing this computation:
div1(unsigned int):
mov ecx, edi
lea rax, [8*rcx]
sub rax, rcx
movabs rcx, 1024819115206086201
mul rcx
mov rax, rdx
ret
div2(unsigned int):
mov eax, edi
movabs rcx, 7173733806472429568
mul rcx
mov rax, rdx
ret
div3(unsigned int):
mov ecx, edi
mov eax, 3340530119
imul rax, rcx
add rax, 477218588
shr rax, 33
ret
The first version (div1
) is what my compiler (clang) generated. It first computes n * 7
by computing n * 8 - n
. And then, it computes the division by 18 using the multiply-and-shift technique. Especially, it deliberately chose the shifting amount to be 64 so that it can skip the shifting. Clang is indeed quite clever in the sense that it actually leverages the fact that n
is only a 32-bit integer, because, the minimum amount of shifting required for the division by 18 is 68, if all 64-bit dividends are considered.
But it is not clever enough to realize that two operations * 7
and / 18
can be merged. The second version div2
is what one can get by utilizing this fact and do these two computations in one multiply-and-shift. Using Theorem 2, it can be shown that the minimum amount of shifting for doing this with all 32-bit dividends is 36, and in that case the corresponding magic number is 26724240953. Since the computation cannot be done in 64-bit anyway, I deliberately multiplied 228 to this magic number and got 7173733806472429568 to make the shifting amount equal to 64, so that the actual shifting instruction is not needed. And the result is a clear win over div1
, although lea
and sub
are just trivial instructions and the difference is not huge.
Now, if we do this with multiply-add-and-shift, we can indeed complete our computation inside 64-bit, which is the third version div3
. Using Algorithm 7, it turns out (k,m,s)=(33,3340530119,477218588) works for all 32-bit dividend, i.e., we have
for all n=0, ⋯ ,232−1. The maximum possible value of the numerator 3340530119⋅n+477218588 is strictly less than 264 in this case. Hence, we get div3
.
It is hard to tell which one between div2
and div3
will be faster. Note that div3
has one more instructions than div2
, but it does not invoke the 128-bit multiplication (the mul
instruction both found in div1
and div2
). In typical modern x86-64 CPU’s, 128-bit multiplication uses twice more computational resources than the 64-bit one, so it might more easily become a bottleneck if there are many multiplications to be done. Also, as far as I know Intel CPU’s still does not provide SIMD instructions for 128-bit multiplication, so div3
could be more aggressively vectorized, which may result in massive speed up. Still, all of these are just speculations and of course when it comes to performance there is nothing we can be sure about. Anyway, it is good to have multiple options we can try out.
Leave a Comment