On the optimal bounds for integer division by constants

71 minute read

Published:

It is well-known that the integer division is quite a heavy operation on modern CPU’s - so slow, in fact, that it has even become a common wisdom to avoid doing it at ALL cost in performance-critical sections of a program. I do not know why division is particularly hard to optimize from the hardware perspective. I am just guessing, maybe (1) every general algorithm is essentially just a minor variation of the good-old long division, (2) which is almost impossible to parallelize. But whatever, that is not the topic of this post.

Rather, this post is about some of the common optimization techniques for circumventing integer division. To be more precise, the post can be roughly divided into two parts. The first part discusses the well-known Granlund-Montgomery style multiply-and-shift technique and some associated issues. The second part is about my recent research on a related multiply-add-and-shift technique. In this section, I establish an optimal bound, which perhaps is a novel result. It would be intriguing to see whether modern compilers can take advantage of this new bound.

Note that by integer division, we specifically mean the computation of the quotient and/or the remainder, rather than evaluating the result as a real number. More specifically, throughout this entire post, integer division will always mean taking the quotient unless specified otherwise. Also, I will confine myself into divisions of positive integers, even though divisions of negative integers hold practical significance as well. Finally, all assembly code provided herein are for x86-64 architecture.

Turning an integer division into a multiply-and-shift

One of the most widely used techniques is converting division into a multiplication when the divisor is a known constant (or remains mostly unchanged). The idea behind this approach is quite simple. For instance, dividing by 44 is equivalent to multiplying by 0.250.25, which can be further represented as multiplying by 2525 and then dividing by 100100, where dividing by 100100 is simply a matter of moving the decimal dot into left by two positions. Since we are only interested in taking the quotient, this means throwing away the last two digits.

In this particular instance, 44 is a divisor of 100100 so we indeed have a fairly concise such a representation, but in general the divisor might not divide a power of 1010. Let us take 77 as an example. In this case, we cannot write 1717 as m10km10k for some positive integers m,km,k. However, we can still come up with a good approximation. Note that

17=0.142857142857,17=0.142857142857,(1)

so presumably something like 14285810000001428581000000 would be a good enough approximation of 1717. Taking that as our approximation, we may want to compute n/7n/7 by multiplying 142858142858 to nn and then throwing away the last 66 digits. (Note that we are taking 142858142858 instead of 142857142857 because the latter already fails when n=7n=7. In general, we must take ceiling, not floor nor half-up rounding.)

This indeed gives the right answer for all n=1,  ,166668n=1,  ,166668, but it starts to produce a wrong answer at n=166669n=166669, where the correct answer is 2380923809, whereas our method produces 2381023810. And of course such a failure is expected. Given that we are using an approximation with nonzero error, it is inevitable that the error will eventually manifest as the dividend nn grows large enough. But, the question at hand is, can we estimate how far it will go? Or, how to choose a good enough approximation guaranteed to work correctly when there is a given limit on how big our nn can be?

Obviously, our intention is to implement this concept in computer program, so the denominators of the approximations will be powers of 22, not 1010. So, for a given positive integer dd, our goal is to find a good enough approximation of 1d1d of the form m2km2k. Perhaps one of the most widely known formal results in this realm is the following theorem by Granlund-Montgomery:

Theorem 1 (Granlund-Montgomery, 1994).

Suppose mm, dd, kk are nonnegative integers such that d0d0

2N+kmd2N+k+2k.2N+kmd2N+k+2k.(2)

Then n/d=mn/2N+kn/d=mn/2N+k for every integer nn with 0n<2N0n<2N.

Here, dd is the given divisor and we are supposed to approximate 1d1d by m2N+km2N+k. An assumption here is that we want to perform the division n/dn/d for all nn from 00 to 2N12N1, where NN is supposed to be the bit-width of the integer data type under consideration. In this setting, the mentioned theorem establishes a sufficient condition under which we can calculate the quotient of n/dn/d by first multiplying nn by mm and subsequently right-shifting the result by (N+k)(N+k)-bits. Note that we will need more than NN-bits; since our dividend is of NN-bits, the result of the multiplication mnmn will need to be stored in 2N2N-bits. Since we are shifting by (N+k)(N+k)-bits, the lower half of the result is actually not needed, and we just take the upper half and shift it by kk-bits.

We will not talk about the proof of this theorem because a much more general theorem will be presented in the next section. But let us see how results like this are being applied in the wild. For example, consider the following C++ code:

std::uint64_t div(std::uint64_t n) noexcept {
    return n / 17;
}

Modern compilers are well-aware of these Granlund-Montgomery-style division tricks. Indeed, my compiler (clang) adeptly harnessed such tactics, as demonstrated by its translation of the code above into the following lines of assembly instructions:

div(unsigned long):
        mov     rax, rdi
        movabs  rcx, -1085102592571150095
        mul     rcx
        mov     rax, rdx
        shr     rax, 4
        ret

(Check it out!)

Note that we have N=64N=64 and d=17d=17 in this case, and the magic constant 10851025925711500951085102592571150095 you can see in the second line, interpreted as an unsigned value 1736164148113840152117361641481138401521, is the mm appearing in the theorem. You can also see in the fifth line that kk in this case is equal to 44. What the assembly code is doing is as follows:

  • Load the 6464-bit input nn into the 6464-bit rax register.
  • Load the 6464-bit constant 1736164148113840152117361641481138401521 into the 6464-bit rcx register.
  • Multiply them together, then the result is of 128128-bits. The lower half of this is stored back into rax, and the upper half is stored in the rdx register. But we do not care about the lower half, so throw it away and copy the upper half back into rax.
  • Shift the result to the right by 44-bits. The result must be equal to n/17n/17.

We can easily check that these mm and kk chosen by the compiler indeed satisfy the inequalities in the theorem:

295147905179352825856=264+4md=1736164148113840152117=295147905179352825857264+4+24=295147905179352825872.295147905179352825856=264+4md=1736164148113840152117=295147905179352825857264+4+24=295147905179352825872.

We have not yet seen how to actually find mm and kk using a result like the theorem above, but before asking such a question, let us instead ask: “is this condition on mm and kk the best possible one?”

And the answer is: No.

Here is an example, take N=32N=32 and d=102807d=102807. In this case, the smallest kk that allows an integer mm that satisfies

2N+kmd2N+k+2k2N+kmd2N+k+2k(3)

to exist is k=17k=17, and in that case the unique mm satisfying the above is 54757939975475793997. This is kind of unfortunate, because the magic constant m=5475793997m=5475793997 is of 3333-bits, so the computation of nmnm cannot be done inside 6464-bits. However, it turns out that we can take k=16k=16 and m=2737896999m=2737896999 and still the equality nd=nm2N+knd=nm2N+k holds for all n=0,  ,2N1n=0,  ,2N1, although the above inequality is not satisfied in this case. Now, the new constant 27378969992737896999 is of 3232-bits, so we can do our computation inside 6464-bits. This might result a massive difference in practice!

It seems that even the most recent version of GCC (13.2) is still not aware of this, while clang knows that the above mm and kk work. (In the link provided, GCC is actually trying to compute nm2N+knm2N+k with the 3333-bit constant m=5475793997m=5475793997. Detailed explanation will be given in a later section.)

Then what is the best possible condition? I am not sure who was the first for finding the optimal bound, but at least it seems that such a bound is written in the famous book Hacker’s Delight by H. S. Warren Jr. Also, recently (in 2021), Lemire et al. showed the optimality of an equivalent bound. (EDIT: according to a Reddit user, there was a report written by H. S. Warren Jr in 1992, even earlier than Granlund-Montgomery, which contained the optimal bound with the proof of optimality.)

I will not write down the optimal bounds obtained by these authors, because I will present a more general result proved by myself in the next section.

Here is one remark before getting into the next section. The aforementioned results on the optimal bound work for nn from the range {1,  ,nmax}{1,  ,nmax} where nmaxnmax is not necessarily of the form 2N12N1. However, even recent compilers do not seem to leverage this fact. For example, let us look at the following code:

std::uint64_t div(std::uint64_t n) {
    [[assume(n < 10000000000)]];
    return n / 10;
}

Here we are relying on a new language feature added in C++23: assume. GCC generated the following lines of assemblies:

div(unsigned long):
        movabs  rax, -3689348814741910323
        mul     rdi
        mov     rax, rdx
        shr     rax, 3
        ret

(Check it out!)

My gripe with this is the generation of a superfluous shr instruction. GCC seems to think that kk must be at least 6767 (which is why it shifted by 33-bits, after throwing away the 6464-bit lower half), but actually k=64k=64 is fine with the magic number m=1844674407370955162m=1844674407370955162 thanks to the bound on nn, in which case we do not need this additional shifting.

How about clang? Unlike GCC, clang currently does not seem to understand this new language feature assume. But it has an equivalent language extension __builtin_assume, so I tried with that:

std::uint64_t div(std::uint64_t n) {
    __builtin_assume(n < 10000000000);
    return n / 10;
}
div(unsigned long):
        mov     rax, rdi
        movabs  rcx, -3689348814741910323
        mul     rcx
        mov     rax, rdx
        shr     rax, 3
        ret

(Check it out!)

And there is no big difference😥

Turning multiplication by a real number into a multiply-and-shift

Actually, during the development of Dragonbox, I was interested in a more general problem of multiplying a rational number to nn and then finding out the integer part of the resulting rational number. In other words, my problem was not just about division, rather about multiplication followed by a division. This presence of multiplier certainly makes the situation a little bit more tricky, but it is anyway possible to derive the optimal bound in a similar way, which leads to the following generalization of the results mentioned in the previous section. (Disclaimer: I am definitely not claiming to be the first who proved this, and I am sure an equivalent result could be found elsewhere, though I am not aware of any.)

Theorem 2 (From this paper).

Let xx be a real number and nmaxnmax a positive integer. Then for a real number ξξ, we have the followings.

  1. If x=pqx=pq is a rational number with qnmaxqnmax, then we have nx=nξnx=nξ for all n=1,  ,nmaxn=1,  ,nmax if and only if xξ<x+1vqxξ<x+1vq holds, where vv is the greatest integer such that vp1 (mod q)vp1 (mod q) and vnmaxvnmax.

  2. If xx is either irrational or a rational number with the denominator strictly greater than nmaxnmax, then we have nx=nξnx=nξ for all n=1,  ,nmaxn=1,  ,nmax if and only if pqξ<pqpqξ<pq holds, where pqpq, pqpq are the best rational approximations of xx from below and above, respectively, with the largest denominators q,qnmaxq,qnmax.

Note that nxnx is supposed to be the one we actually want to compute, while ξξ is supposed to be the chosen approximation of xx. For the special case when x=1dx=1d, nmax=2N1nmax=2N1, and ξ=m2N+kξ=m2N+k, we obtain the setting of Granlund-Montgomery.

So I said rational number in the beginning of this section, but actually our xx can be any real number. Note, however, that the result depends on whether xx is “effectively rational” or not over the domain n=1,  ,nmaxn=1,  ,nmax, i.e., whether or not there exists a multiplier nn which makes nxnx into an integer.

Since I was working on floating-point conversion problems when I derived this theorem, for me the more relevant case was the second case, that is, when xx is “effectively irrational”, because the numerator and the denominator of xx I was considering were some high powers of 22 and 55. But the first case is more relevant in the main theme of this post, i.e., integer division, so let us forget about these jargons like best rational approximations and such. (Spoiler: they will show up again in the next section.)

So let us focus on the first case. First of all, note that if p=1p=1, that is, when x=1qx=1q, then vv has a simpler description: it is the last multiple of qq in the range 1,  ,nmax+11,  ,nmax+1 minus one. If you care, you can check the aforementioned paper by Lemire et al. to see that their Theorem 1 exactly corresponds to the resulting bound. In fact, in this special case it is rather easy to see why the best bound should be something like that.

Indeed, note that having the equality

nx=nξnx=nξ(4)

for all n=1,  ,nmaxn=1,  ,nmax is equivalent to having the inequality

nxnξ<nx+1nnxnξ<nx+1n(5)

for all such nn. Hence, it is enough to find the largest possible value of the left-hand side and the smallest possible value of the right-hand side. Since we are assuming that the denominator of xx is bounded by nmaxnmax, obviously the maximum value of the left-hand side is just xx. Thus, it is enough to find the minimum value of the right-hand side. Note that we can write

n=qnq+r=qnx+rn=qnq+r=qnx+r(6)

where rr is the remainder of the division n/qn/q. Replacing nxnx by nn and rr using the above identity, we get

nx+1n=(nr)/q+1n=n+(qr)qn=1q+qrqn.nx+1n=(nr)/q+1n=n+(qr)qn=1q+qrqn.(7)

Therefore, the minimization of the left-hand side is equivalent to the minimization of qrnqrn. Intuitively, it appears reasonable to believe that the minimizer nn must have the largest possible remainder r=q1r=q1, because for example if rr were set to be q2q2 instead, then the numerator gets doubled, necessitating a proportionally larger nn to achieve a diminished value of qrnqrn. Also, among nn’s with r=q1r=q1, obviously the largest nn yields the smallest value of qrnqrn, so it sounds rational to say that probably the greatest nn with r=q1r=q1 is the minimizer of qrnqrn. Indeed, this is quite easy to prove: suppose we call such nn as vv, and suppose that there is nn which is even better than vv:

qrn1v,(qr)vn.qrn1v,(qr)vn.(8)

Now, since vv divided by qq has the remainder q1q1, the left-hand side and the right-hand side have the same remainder when divided by qq. Therefore, the difference between the two must be either zero or at least qq. But since vv is the largest one with the remainder q1q1, it must be at least nmaxq+1nmaxq+1, thus nn cannot be larger than vv by more than q1q1. Thus, the only possibility is n=vn=v.

When x=pqx=pq and p1p1, the remainder rr of np/qnp/q depends pretty randomly on nn, so it is somewhat harder to be intuitively convinced that still the minimizer of nx+1nnx+1n should have r=q1r=q1. But the same logic as above still works just with a little bit of tweaks in this case. The full proof can be found in the paper mentioned, or one of my previous posts.

Some applications

Here I collected some applications of the presented theorem.

Finding the first error case

In the beginning of the previous section, I claimed that

n7=n1428581000000n7=n1428581000000(9)

holds for all n=1,  ,166668n=1,  ,166668 but not for n=166669n=166669. Now we can see how did I get this. Note that 1666696 (mod 7)1666696 (mod 7), so the range {1,  ,166668}{1,  ,166668} and the range {1,  ,166669}{1,  ,166669} have different vv’s: it is 166662166662 for the former, while it is 166669166669 for the latter. And this makes the difference, because the inequality

1428581000000<17+17v1428581000000<17+17v(10)

holds if and only if v<166667v<166667. Thus, n=166669n=166669 is the first counterexample.

Coming up with a better magic number than Granlund-Montgomery

We can also see now why a better magic number worked in the example from the previous section. Let me repeat it here with the notation of Theorem 2: we have x=1/102807x=1/102807 and nmax=2321nmax=2321, so we have

n102807=nm2kn102807=nm2k(11)

for all n=1,  ,2321n=1,  ,2321 if and only if ξ=m2kξ=m2k satisfies

1102807m2k<1102807+1v102807.1102807m2k<1102807+1v102807.(12)

In this case, vv is the largest integer in the range 1,  ,23211,  ,2321 which has the remainder 102806102806 when divided by 102807102807. Then it can be easily seen that v=4294865231v=4294865231, so the inequality above becomes

2k102807m<417762k4294865231.2k102807m<417762k4294865231.(13)

The smallest kk that allows an integer solution to the above inequality is k=48k=48, in which case the unique solution is m=2737896999m=2737896999.

A textbook example for the case p1p1

Let me demonstrate how Theorem 2 can be used for the case p1p1. Suppose that we want to convert the temperature from Fahrenheit to Celsius. Obviously, representing such values only using integers is a funny idea, but let us pretend that we are completely serious (and hey, in 2023, Fahrenheit itself is a funny joke from the first place😂… except that I am currently living in an interesting country🤮). Or, if one desperately wants to make a really serious example, we can maybe think about doing the same thing with fixed-point fractional numbers. But whatever.

So the formula is, we first subtract 3232 and then multiply 5/95/9. For the sake of simplicity, let us arbitrarily assume that our Fahrenheit temperature is ranging from 32F32F to 580F580F so in particular we do not run into negative numbers. After subtracting 3232, the range is from 00 to 548548, so we take nmax=548nmax=548. With x=59x=59, our vv is so the largest integer such that v548v548 and 5v8 (mod 9)5v8 (mod 9), or equivalently, v7 (mod 9)v7 (mod 9). The largest multiple of 99 in the range is 540540, so we see v=547v=547. Hence, the inequality we are given with is

59m2k<59+15479=304547.59m2k<59+15479=304547.(14)

The smallest kk that allows an integer solution is k=10k=10, and the unique solution for that case is m=569m=569. Therefore, the final formula would be:

5(n32)9=569(n32)210.5(n32)9=569(n32)210.(15)

Will the case p1p1 be potentially relevant for compiler-writers?

Might be, but there is a caveat: in general, the compiler is not allowed to optimize an expression n * p / q into (n * m) >> k, because n * p can overflow and in that case dividing by q will give a weird answer. To be more specific, for unsigned integer types, C/C++ standards say that any overflows should wrap around, so the expression n * p / q is not really supposed to compute npqnpq, rather it is supposed to compute (np mod 2N)q(np mod 2N)q where NN is the bit-width, even though it is quite likely that the one who wrote the code actually wanted the former. On the other hand, for signed integer types, (a signed-equivalent of) Theorem 2 might be applicable, because signed overflows are specifically defined to be undefined. But presumably there are lots of code out there relying on the wrong assumption that signed overflows will wrap around, so maybe compiler-writers do not want to do this kind of optimizations.

Nevertheless, there are situations where doing this is perfectly legal. For example, suppose n, p, q are all of type std::uint32_t and the code is written like static_cast<std::uint32_t>((static_cast<std::uint64_t>(n) * p) / q) to intentionally avoid this overflow issue. Then the compiler might recognize such a pattern and do this kind of optimizations. Or more generally, with the new assume attribute (or some other equivalent compiler-specific mechanisms), the user might give some assumptions that ensure no overflow.

It seems that currently both clang and GCC do not do this, so if they want to do so in a future, then Theorem 2 might be useful. But how many code can benefit from such an optimization? Will it be really worth implementing? I do not know, but maybe not.

When the magic number is too big

Even with the optimal bound, there exist situations where the smallest possible magic number does not fit into the word size. For example, consider the case nmax=2641nmax=2641 and x=1/10961x=1/10961. In this case, our vv is

v=264109611096110961+10960=18446744073709550681.v=264109611096110961+10960=18446744073709550681.(16)

Hence, the inequality we need to inspect is

110961m2k<110961+11096118446744073709550681=168294353377516218446744073709550681,110961m2k<110961+11096118446744073709550681=168294353377516218446744073709550681,

and the smallest kk allowing an integer solution is 7878, in which case the unique solution is m=27573346857372255605m=27573346857372255605. And unfortunately, this is a 6565-bit number!

Let us see how to deal with this case by looking at what a compiler actually does. Consider the code:

std::uint64_t div(std::uint64_t n) {
    return n / 10961;
}

My compiler (clang, again) generated the following lines of assemblies:

div(unsigned long):
        movabs  rcx, 9126602783662703989
        mov     rax, rdi
        mul     rcx
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 13
        ret

(Check it out!)

This is definitely a little bit more complicated than the happy case, but if we think carefully, we can realize that this is still just computing nm2knm2k:

  • The magic number m=9126602783662703989m=9126602783662703989 is precisely m264m264.
  • In the third line, we multiply this magic number with the input. Let us call the upper half of the result as uu (which is stored in rdx), and the lower half as (which is stored in rax, and we do not care about the lower half anyway).
  • We subtract the upper half uu from the input nn (the sub line), divide the result by 2 (the shr line), add the result back to uu (the lea line), and then store the end result into rax. Now this looks a bit puzzling, but what it really does is nothing but to compute (n+u)/2=nm/265(n+u)/2=nm/265. The reason why we first subtract uu from nn is to avoid overflow. On the other hand, the subtraction is totally fine as there can be no underflow, because u=nm/264u=nm/264 is at most nn, as mm is less than 264264.
  • Recall k=78k=78, so we want to compute nm/278nm/278. Since we got nm/265nm/265 from the previous step, we just need to shift this further by 1313-bits.

This is not so bad, we just have two more trivial instructions compared to the happy case. But the reason why the above works is largely due to that the magic number is just one bit larger than 6464-bits. Well, can it be even larger than that?

The answer is: No, fortunately, the magic number being just one bit larger than the word size is the worst case.

To see why, note that the size of the interval where ξ=m2kξ=m2k can possibly live is precisely 1/vq1/vq. Therefore, if kk is large enough so that 2kvq2kvq, then the difference between the endpoints of the inequality

2kqm<2k(1q+1vq)2kqm<2k(1q+1vq)(17)

is at least 11, so the inequality must admit an integer solution. Now, we are interested in the bit-width of 2kq2kq, which must be the smallest possible magic number if kk admits at least one solution. Since the smallest admissible kk is at most the smallest kk satisfying 2kvq2kvq, thus we must have vq>2k1vq>2k1, so 2kq<2v2kq<2v. Clearly, the right-hand side is of at most one bit larger than the word size.

Actually, there is an alternative way of dealing with the case of too large magic number, which is to consider a slightly different formula: instead of just doing a multiplication and then a shift, perform an addition by another magic number between those two operations. Using the notations from Theorem 2, what this means is that we have some ζζ satisfying the equality

nx=nξ+ζnx=nξ+ζ(18)

instead of nx=nξnx=nξ, where ξ=m2kξ=m2k and ζ=s2kζ=s2k so that

nξ+ζ=nm+s2knξ+ζ=nm+s2k(19)

can be computed by performing a multiplication, an addition, and a shift.

The inclusion of this ζζ might allow us to use a smaller magic number, thereby enabling its accommodation within a single word. The next section is dedicated to the condition for having the above equality.

Here is a small remark before we start the (extensive) discussion of the optimal bound for this. Note that this trick of including the ζζ term is probably not so useful for 6464-bit divisions, because addition is not really a trivial operation in this case. This is because the result of the multiplication mnmn spans two 6464-bit blocks. Hence, we need an adc (add-with-carry) instruction or an equivalent, which is not particularly well-optimized in typical x86-64 CPU’s. While I have not conducted a benchmark, I speculate that this approach probably results in worse performance. However, this trick might give a better performance for the 3232-bit case than the method explained in this section, as every operation can be done inside 6464-bits. Interestingly, it seems modern compilers do not use this trick anyway. In the link provided, the compiler is trying to compute multiplication by 49992447494999244749 followed by the shift by 4949-bits. However, it turns out, by multiplying 12498111871249811187, adding 12498111871249811187, and then shifting to right by 4747-bits, we can do this computation completely within 6464-bits. I do not know whether the reason why the compilers do not leverage this trick is because it does not perform better, or just because they did not bother to implement it.

Multiply-add-and-shift rather than multiply-shift

WARNING: The contents of this section is substantially more complicated and math-heavy than previous sections.

As said in the last section, we will explore the condition for having

nx=nξ+ζnx=nξ+ζ(20)

for all n=1,  ,nmaxn=1,  ,nmax, where xx, ξξ are real numbers and ζζ is a nonnegative real number. We will derive the optimal bound, i.e., an “if and only if” condition. We remark that an optimal bound has been obtained in the paper by Lemire et al. mentioned above for the special case when x=1qx=1q for some qnmaxqnmax and ξ=ζξ=ζ. According to their paper, the proof of the optimality of their bound is almost identical to the case of having no ζζ, so they even did not bother to write down the proof. I provided a proof of this special case in a later subsection. The proof I wrote seems to rely on a heavy lifting done below, but it can be done without it pretty easily as well.

However, in the general case I am dealing here, i.e., the only restriction I have is ζ0ζ0, the situation is quite more complicated. Nevertheless, even in this generality, it is possible to give a very concrete description of how exactly the presence of ζζ distorts the optimal bound.

Just like the case ζ=0ζ=0 (i.e., Theorem 2), having the equality for all n=1,  ,nmaxn=1,  ,nmax is equivalent to having the inequality

maxn=1,  ,nmaxnxζnξ<minn=1,  ,nmaxnx+1ζn,maxn=1,  ,nmaxnxζnξ<minn=1,  ,nmaxnx+1ζn,(21)

so the question is how to evaluate the maximum and the minimum in the above. By the reason that will become apparent as we proceed, we will in fact try to find the largest maximizer of the left-hand side and the smallest minimizer of the right-hand side.

The lower bound

Since ζ0ζ0 is supposed to be just a small constant, it sounds reasonable to believe that the maximizer of nxnnxn is probably quite close to be the maximizer of nxζnnxζn as well. So let us start there: let n0n0 be the largest among all maximizers of nxnnxn. Now, what should happen if n0n0 were not the maximizer of nxζnnxζn? Say, what can we say about nn’s such that

nxζnn0xζn0nxζnn0xζn0(22)

holds? First of all, since this inequality implies

ζnζn0,ζnζn0,(23)

we must have nn0nn0 unless ζ=0ζ=0, which is an uninteresting case anyway.

As a result, we can equivalently reformulate our optimization problem as follows: given n=0,  ,nmaxn0n=0,  ,nmaxn0, find the largest maximizer of

(n0+n)xζn0+n.(n0+n)xζn0+n.(24)

Now, we claim that

(n0+n)x=nx+n0x(n0+n)x=nx+n0x(25)

holds for all such nn. (Note that in general x+yx+y is equal to either x+yx+y or x+y+1x+y+1.) This follows from the fact that n0xn0n0xn0 is not only the best rational approximation from below in the weak sense, but also in the strong sense. Okay, so at this point there is no way to get around these jargons anymore, so let us define them formally.

Definition 3 (Best rational approximations from below/above).

Let xx be a real number. We say a rational number pqpq (in its reduced form, which is always assumed if not specified otherwise) is a best rational approximation from below (above, resp.) if pqxpqx (pqxpqx, resp.) and for any rational number abab with pqabxpqabx (pqabxpqabx, resp.), we always have qbqb.

In other words, pqpq is a best rational approximation of xx if any better rational approximation must have a larger denominator.

Remark. Note that the terminology best rational approximation is pretty standard in pure mathematics, but it usually disregards the direction of approximation, from below or above. For example, 1313 is a best rational approximation from below of 3737, but it is not a best rational approximation in the usual, non-directional sense, because 1212 is a better approximation with a strictly less denominator. But this concept of non-directional best rational approximation is quite irrelevant to our application, so any usage of the term best rational approximation in this post always means the directional ones, either from below or above.

The definition provided above is the one in the weak sense, although it is not explicitly written so. The corresponding one in the strong sense is given below:

Definition 4 (Best rational approximations from below/above in the strong sense).

Let xx be a real number. We say a rational number pqpq (again, in its reduced form) is a best rational approximation from below (above, resp.) in the strong sense, if pqxpqx (pqxpqx, resp.) and for any rational number abab with qxpbxa0qxpbxa0 (pqxabx0pqxabx0, resp.), we always have qbqb.

As the name suggests, if pqpq is a best rational approximation from below (above, resp.) in the strong sense, then it is a best rational approximation from below (above, resp.) in the weak sense. To see why, suppose that pqpq is a best rational approximation from above of xx in the strong sense and take any rational number abab with abxabx and bqbq. Then it is enough to show that pqabpqab holds. (We are only considering the “from below” case, and the “from above” case can be done in the same way.) This is indeed quite easy to show:

xpq=qxpqbxaqbxab=xab,xpq=qxpqbxaqbxab=xab,(26)

thus we conclude abpqabpq as claimed, so pqpq is indeed a best rational approximation from below of xx in the weak sense.

Remarkably, using the theory of continued fractions, it can be shown that the converse is also true. Note that this fact is entirely not trivial only by looking at their definitions. You can find a proof of this fact in my paper on Dragonbox; see the remark after Algorithm C.13 (I shamelessly give my own writing as a reference because I do not know of any other).

Alright, but what is the point of these nonsenses?

First, as suggested before, our n0xn0n0xn0 is a best rational approximation from below of xx. This is obvious from its definition: n0n0 is the maximizer of nxnnxn for n=1,  ,nmaxn=1,  ,nmax, which means that whenever we have

n0xn0<abx,n0xn0<abx,(27)

then since abxabx holds (because otherwise the inequality abxabx fails to hold), we must have b>nmaxb>nmax, so in particular b>n0b>n0.

Therefore, by the aforementioned equivalence of the two concepts, we immediately know that n0xn0n0xn0 is a best rational approximation from below of xx in the strong sense. Note that this does not imply that n0n0 is a minimizer of nxnxnxnx for n=1,  ,nmaxn=1,  ,nmax because n0xn0n0xn0 is not necessarily in its reduced form (since we chose n0n0 to be the largest maximizer). Still, if we denote its reduced form as pqpq, then certainly qq is the minimizer of it (and pp must be equal to qxqx). Indeed, by the definition of best rational approximations from below in the strong sense, if nn is the smallest integer such that n>qn>q and

nxnx<qxp,nxnx<qxp,(28)

then nxnnxn itself must be a best rational approximation from below in the strong sense, thus also in the weak sense. This in particular means nxnpq=n0xn0nxnpq=n0xn0 holds since n>qn>q, and in fact the equality cannot hold because otherwise we get

qxp=qxqnnx=qn(nxnx)<qn(qxp)qxp=qxqnnx=qn(nxnx)<qn(qxp)(29)

which contradicts to n>qn>q. Therefore, by the definition of n0n0, we must have n>nmaxn>nmax.

Using this fact, now we can easily prove (25)(25). Again let pqpq be the reduced form of n0xn0n0xn0, then for any n=0,  ,nmaxqn=0,  ,nmaxq, we must have

(q+n)x(q+n)xqxp,(q+n)x(q+n)xqxp,(30)

so we get

(q+n)xqx+nx<qx+nx+1,(q+n)xqx+nx<qx+nx+1,(31)

and this rules out the possibility (q+n)x=qx+nx+1(q+n)x=qx+nx+1, thus we must have (q+n)x=qx+nx(q+n)x=qx+nx. Then inductively, we get that for any positive integer kk such that kqnmaxkqnmax and n=0,  ,nmaxkqn=0,  ,nmaxkq,

(kq+n)x=qx+((k1)q+n)x=  =kqx+nx.(kq+n)x=qx+((k1)q+n)x=  =kqx+nx.(32)

Hence, we must have (n0+n)x=n0x+nx(n0+n)x=n0x+nx in particular.

(Here is a more intuitive explanation. Note that among all nxnx’s, qxqx is the one with the smallest fractional part. Then whenever kq+nnmaxkq+nnmax, the sum of the fractional parts of nxnx and that of kk copies of qxqx should not “wrap around”, i.e., it should be strictly less than 11, because if we choose the smallest kk that (kq+n)x(kq+n)x wraps around, then since ((k1)q+n)x((k1)q+n)x should not wrap around, its fractional part is strictly less than 11, which means that the fractional part of (kq+n)x(kq+n)x is strictly less than that of qxqx, contradicting to the minimality of the fractional part of qxqx. Hence, since the sum of all fractional parts is still strictly less than 11, the integer part of (kq+n)x(kq+n)x should be just the sum of the integer parts of its summands.)

Next, using the claim, now we can characterize n=1,  ,nmaxn0n=1,  ,nmaxn0 which yields a larger value for

(n0+n)xζn0+n(n0+n)xζn0+n(33)

than the case n=0n=0. Indeed, we have the inequality

(n0+n)xζn0+nn0xζn0(n0+n)xζn0+nn0xζn0(34)

if and only if

n0nxn0ζnn0x(n0+n)ζ,n0nxn0ζnn0x(n0+n)ζ,(35)

if and only if

nxnn0xζn0,nxnn0xζn0,(36)

or equivalently,

xnxnζn0+(xn0xn0).xnxnζn0+(xn0xn0).(37)

Therefore, in a sense, nxnnxn itself should be a good enough approximation of xx, although it does not need to (and cannot) be a better approximation than n0xn0n0xn0.

At this point, we could just enumerate all rational approximations of xx from below satisfying the above bound and find out the one that maximizes (n0+n)xζn0+n(n0+n)xζn0+n. Indeed, the theory of continued fractions allows us to develop an efficient algorithm for doing such a task. (See Algorithm C.13 in my paper on Dragonbox for an algorithm doing a closely related task.) However, we can do better here.

First of all, note that if ζζ is small enough, then (37)(37) does not have any solution for n=1,  ,nmaxn0n=1,  ,nmaxn0, which means n0n0 is indeed the largest maximizer we are looking for and there is nothing further we need to do. To be more precise, if ζζ chosen so that n0xζn0n0xζn0 is strictly greater than the next best rational approximation from below of xx, then any solution to (37)(37) must be a multiple of qq where qq is the denominator of n0xn0n0xn0 in its reduced form. But since n0n0 is chosen to be the largest, it follows that there is no multiple of qq in the range {1,  ,nmaxn0}{1,  ,nmaxn0}, so there is no solution to (37)(37) in that range.

This conclusion is consistent with the intuition that n0n0 should be close enough to the maximizer of nxζnnxζn at least if ζζ is small. But what if ζζ is not that small?

It is quite tempting to claim that the minimizer of the left-hand side of (37)(37) is the maximizer of (n0+n)xζn0+n(n0+n)xζn0+n, but that is not true in general. Nevertheless, we can start from there, just like that we started from n0n0 from the beginning.

In this reason, let n1n1 be the largest maximizer of nxnnxn for n=1,  ,nmaxn0n=1,  ,nmaxn0. As pointed out earlier, if such n1n1 does not satisfy (37)(37), then there is nothing further to do, so suppose that the inequality is indeed satisfied with n=n1n=n1.

Next, we claim that

(n0+n)xζn0+n(n0+n1)xζn0+n1(n0+n)xζn0+n(n0+n1)xζn0+n1(38)

holds for all n=1,  ,n1n=1,  ,n1.

Note that by (25)(25), this is equivalent to

n0nx+n1(n0x+nx)(n0+n1)ζn0n1x+n(n0x+n1x)(n0+n)ζ,n0nx+n1(n0x+nx)(n0+n1)ζn0n1x+n(n0x+n1x)(n0+n)ζ,

and rearranging it gives

(n1n)(ζn0x)(n1nxnn1x)+n0(nxn1x).(n1n)(ζn0x)(n1nxnn1x)+n0(nxn1x).(39)

By adding and subtracting appropriate terms, we can rewrite this as

n0(n1n)(ζn0+(xn0xn0))n1(n0+n)(xn1xn1)n(n0+n1)(xnxn).n0(n1n)(ζn0+(xn0xn0))n1(n0+n)(xn1xn1)n(n0+n1)(xnxn).

Recall that we already have assumed

xn1xn1ζn0+(xn0xn0),xn1xn1ζn0+(xn0xn0),(40)

so it is enough to show

n0(n1n)(xn1xn1)n1(n0+n)(xn1xn1)n(n0+n1)(xnxn),n0(n1n)(xn1xn1)n1(n0+n)(xn1xn1)n(n0+n1)(xnxn),

or equivalently,

n(n0+n1)(xnxn)(n1(n0+n)n0(n1n))(xn1xn1)=n(n0+n1)(xn1xn1),n(n0+n1)(xnxn)(n1(n0+n)n0(n1n))(xn1xn1)=n(n0+n1)(xn1xn1),

which trivially holds by the definition of n1n1. Thus the claim is proved.

As a result, we can reformulate our optimization problem again in the following way: define N1:=n0+n1N1:=n0+n1, then we are to find n=0,  ,nmaxN1n=0,  ,nmaxN1 which maximizes

(N1+n)xζN1+n.(N1+n)xζN1+n.(41)

As you can see, now it resembles quite a lot what we just have done. Then the natural next step is to see whether we have

(N1+n)x=N1x+nx,(N1+n)x=N1x+nx,(42)

which, by (25)(25), is equivalent to

(n1+n)x=n1x+nx.(n1+n)x=n1x+nx.(43)

But we already know this: because n1xn1n1xn1 is a best rational approximation from below of xx, the exact same proof of (25)(25) applies.

Therefore, by the same procedure as we did, we get that

(N1+n)xζN1+nN1xζN1(N1+n)xζN1+nN1xζN1(44)

holds for n=1,  ,nmaxN1n=1,  ,nmaxN1 if and only if

nxnN1xζN1.nxnN1xζN1.(45)

Hence, we finally arrive at the following iterative algorithm for computing the maximizer of nxζnnxζn:

Algorithm 5 (Computing the lower bound).

Input: xRxR, nmaxZ>0nmaxZ>0, ζ0ζ0.

Output: the largest maximizer of nxζnnxζn for n=1,  ,nmaxn=1,  ,nmax.

  1. Find the largest n=1,  ,nmaxn=1,  ,nmax that maximizes nxnnxn and call it n0n0.
  2. If n0=nmaxn0=nmax, then n0n0 is the largest maximizer; return.
  3. Otherwise, find the largest n=1,  ,nmaxn0n=1,  ,nmaxn0 that maximizes nxnnxn and call it n1n1.
  4. Inspect the inequality n1xn1n0xζn0.n1xn1n0xζn0.(46)
  5. If the inequality does not hold, then n0n0 is the largest maximizer; return.
  6. If the inequality does hold, then set n0n0+n1n0n0+n1 and go to Step 2.

Remark. In the above, we blackboxed the operation of finding the largest maximizer of nxnnxn. Actually, this is precisely how we obtain Theorem 2. If xx is a rational number whose denominator qq is at most nmaxnmax, then obviously the largest maximizer is the largest multiple of qq bounded by nmaxnmax. Otherwise, we just compute the best rational approximation from below of xx with the largest denominator qnmaxqnmax, where the theory of continued fractions allows us to compute this very efficiently. (Especially, when xx is a rational number, this efficient algorithm is really just the extended Euclid algorithm.) Once we get the best rational approximation pqpq (which must be in its reduced form), we find the largest multiple kqkq of qq bounded by nmaxnmax. Then since pqpq is the maximizer of nxnnxn for n=1,  ,nmaxn=1,  ,nmax, it follows that kqxkqx must be equal to kpkp and kqkq is the largest maximizer of nxnnxn.

The upper bound

The computation of the upper bound, that is, finding the smallest minimizer of

nx+1ζnnx+1ζn(47)

for n=1,  ,nmaxn=1,  ,nmax, is largely same as the case of lower bound. In this case, we start with the smallest minimizer n0n0 of nx+1nnx+1n. Then for any n=1,  ,nmaxn=1,  ,nmax with

nx+1ζnn0x+1ζn0,nx+1ζnn0x+1ζn0,(48)

we must have

ζnζn0,ζnζn0,(49)

thus nn0nn0 unless ζ=0ζ=0, which again is an uninteresting case.

Hence, our goal is to find n=0,  ,n01n=0,  ,n01 minimizing

(n0n)x+1ζn0n.(n0n)x+1ζn0n.(50)

Again, we claim that

(n0n)x=n0xnx(n0n)x=n0xnx(51)

holds for all such nn. We have two cases: (1) when n0x+1n0n0x+1n0 is a best rational approximation from above of xx, or (2) when it is not. The second case can happen only when xx is a rational number whose denominator is at most nmaxnmax, so that the best rational approximation of it is xx itself. However, for such a case, if we let x=pqx=pq, then according to how we derived Theorem 2, the remainder of n0p/qn0p/q must be the largest possible value, q1q1. Hence, the quotient of (n0n)p/q(n0n)p/q cannot be strictly smaller than the difference between the quotients of n0p/qn0p/q and np/qnp/q, so the claim holds in this case.

On the other hand, if we suppose that n0x+1n0n0x+1n0 is a best rational approximation from above of xx, then as we have seen in the case of lower bound, it must be a best rational approximation from above in the strong sense. Then it is not hard to see that n0n0 is the minimizer of nxnx=nx+1nxnxnx=nx+1nx, or equivalently, the maximizer of nxnxnxnx, because we chose n0n0 to be the smallest minimizer. This shows

(n0n)x(n0n)xn0xn0x,(n0n)x(n0n)xn0xn0x,(52)

thus

(n0n)xn0xnx>n0xnx1,(n0n)xn0xnx>n0xnx1,(53)

so we must have (n0n)x=n0xnx(n0n)x=n0xnx as claimed.

Using the claim, we can again characterize n=1,  ,n01n=1,  ,n01 which yields a smaller value of

(n0n)x+1ζn0n(n0n)x+1ζn0n(54)

than the case n=n0n=n0. Indeed, we have the inequality

(n0n)x+1ζn0nn0x+1ζn0(n0n)x+1ζn0nn0x+1ζn0(55)

if and only if

n0nx+n0(1ζ)nn0x+(n0n)(1ζ)n0nx+n0(1ζ)nn0x+(n0n)(1ζ)(56)

if and only if

n0nxnn0x+n(1ζ)n0nxnn0x+n(1ζ)(57)

if and only if

nxnn0x+1ζn0,nxnn0x+1ζn0,(58)

or equivalently,

xnxnζn0(n0x+1n0x).xnxnζn0(n0x+1n0x).(59)

Next, let n1n1 be the largest maximizer of nxnnxn for n=1,  ,n01n=1,  ,n01. If (59)(59) does not hold with n=n1n=n1, then we know that n0n0 is the one we are looking for, so suppose the otherwise.

Next, we claim that

(n0n)x+1ζn0n(n0n1)x+1ζn0n1(n0n)x+1ζn0n(n0n1)x+1ζn0n1(60)

holds for all n=1,  ,n1n=1,  ,n1. By (51)(51), we can rewrite the inequality above as

n0nxn1n0x+n1nx+(n0n1)(1ζ)n0n1xnn0x+nn1x+(n0n)(1ζ),n0nxn1n0x+n1nx+(n0n1)(1ζ)n0n1xnn0x+nn1x+(n0n)(1ζ),

or

(n1n)(n0x+1ζ)(n1nxnn1x)n0(nxn1x).(n1n)(n0x+1ζ)(n1nxnn1x)n0(nxn1x).(61)

Then adding and subtracting appropriate terms gives

n0(n1n)(ζn0(n0x+1n0x))n1(n0n)(xn1xn1)n(n0n1)(xnxn).n0(n1n)(ζn0(n0x+1n0x))n1(n0n)(xn1xn1)n(n0n1)(xnxn).

Recall that we already supposed

xn1xn1ζn0(n0x+1n0x),xn1xn1ζn0(n0x+1n0x),(62)

so it is enough to show

n0(n1n)(xn1xn1)n1(n0n)(xn1xn1)n(n0n1)(xnxn),n0(n1n)(xn1xn1)n1(n0n)(xn1xn1)n(n0n1)(xnxn),

or equivalently,

n(n0n1)(xnxn)(n1(n0n)n0(n1n))(xn1xn1)=n(n0n1)(xn1xn1),n(n0n1)(xnxn)(n1(n0n)n0(n1n))(xn1xn1)=n(n0n1)(xn1xn1),

which trivially holds by the definition of n1n1. Thus, the claim is proved.

As a result, we can reformulate our optimization problem in the following way: define N1=n0n1N1=n0n1, then we are to find n=0,  ,N11n=0,  ,N11 which minimizes

(N1n)x+1ζN1n.(N1n)x+1ζN1n.(63)

Again, the next step is to show that

(N1n)x=N1xnx(N1n)x=N1xnx(64)

holds for all such nn, which by (51)(51), is equivalent to

(n1+n)x=n1x+nx.(n1+n)x=n1x+nx.(65)

But we already have seen this in the case of lower bound, because we know n1n1 is a maximizer of nxnnxn for n=1,  ,n01n=1,  ,n01. Hence, again we can show that

(N1n)x+1ζN1nN1x+1ζN1(N1n)x+1ζN1nN1x+1ζN1(66)

holds if and only if

nxnN1x+1ζN1,

and repeating this procedure gives us the smallest minimizer of nx+1ζn.

Algorithm 6 (Computing the upper bound).

Input: xR, nmaxZ>0, ζ0.

Output: the smallest minimizer of nx+1ζn for n=1,  ,nmax.

  1. Find the smallest n=1,  ,nmax that minimizes nx+1n and call it n0.
  2. If n0=1, then n0 is the smallest minimizer; return.
  3. Find the largest n=1,  ,n01 that maximizes nxn and call it n1.
  4. Inspect the inequality n1xn1n0x+1ζn0.
  5. If the inequality does not hold, then n0 is the smallest minimizer; return.
  6. If the inequality does hold, then set n0n0n1 and go to Step 2.

Remark. Similarly to the case of lower bound, we blackboxed the operation of finding the smallest minimizer of nxn, which again is precisely how we obtain Theorem 2. If x=pq is a rational number with qnmax, then the minimizer is unique and is the largest vnmax such that vp1 (mod q). Otherwise, we just compute the best rational approximation from above of x with the largest denominator qnmax, where the theory of continued fractions allows us to compute this very efficiently. The resulting pq must be in its reduced form, so q is the smallest minimizer of nx+1n.

Finding feasible values of ξ and ζ

Note that our purpose of introducing ζ was to increase the gap between the lower and the upper bounds of ξ=m2k when the required bit-width of the constant m is too large. Thus, ζ is not something given, rather is something we want to figure out. In this sense, Algorithm 5 and Algorithm 6 might seem pretty useless because they work only when the value of ζ is already given.

Nevertheless, the way those algorithms work for a fixed ζ is in fact quite special in that it allows us to figure out a good ζ fitting into our purpose of widening the gap between two bounds. The important point here is that, ζ only appears in deciding when to stop the iteration, and all other details of the actual iteration steps do not depend on ζ at all.

Hence, our strategy is as follows. First, we assume ζ lives in the interval [0,1) (otherwise nx=nξ+ζ fails to hold for n=0). Then we can partition the interval [0,1) into subintervals of the form [ζmin,ζmax) where the numbers of iterations that Algorithm 5 and Algorithm 6 will take remain constant. Then we look at each of these subintervals one by one, from left to right, while proceeding the iterations of Algorithm 5 and Algorithm 6 whenever we move onto the next subinterval and the numbers of iterations for each of them change.

Our constraint on ξ and ζ is that the computation of mn+s should not overflow, so suppose that there is a given limit Nmax on how large mn+s can be. Now, take a subinterval [ζmin,ζmax) as described above. If there is at least one feasible choice of ξ for some ζ[ζmin,ζmax), then such ξ must lie in the interval

I:=(nL,0xζmaxnL,0,nU,0x+1ζminnU,0),

where nL,0 and nU,0 are the supposed output of Algorithm 5 and Algorithm 6, respectively, which must stay constant as long as ζ[ζmin,ζmax).

Next, we will take a loop over all numbers of the form m2k in I and check one by one if it is really a possible choice of ξ. The order of examination will be the lexicographic order on (k,m), that is, from smaller k to larger k, and for given k, from smaller m to larger m. To find the smallest k, let Δ be the size of the interval above, that is,

Δ:=nU,0x+1ζminnU,0nL,0xζmaxnL,0.

Define k0:=log21Δ, then we have

2k0Δ1<2k0+1Δ2,

and this means that there is at most one integer in the interval 2k0I while there is at least one integer in the interval 2k0+1I. Therefore, we first see if there is one in 2k0I, and if not, then have a look at 2k0+1I where the existence is guaranteed. Also, note that if there were none in 2k0I, then 2k0+1I should have exactly one integer and that integer must be odd.

Either case, we can find the smallest possible integer k such that

ξ=m2k

is in the interval I for some mZ. (When 2k0I does not contain an integer, then k=k0+1, and if it contains an integer, then k=k0b where b is the highest exponent of 2 such that 2b divides the unique integer in 2k0I.) We will start from there and successively increase the value of k. For currently given k, we take a loop over all m such that ξ=m2k stays inside I. (By the construction, there is exactly one possible m for the smallest k which we start from.) In order for this ξ to be allowed by a choice of ζ, according to Algorithm 5, ζ should be at least

ζ0:=nL,0xnL,0ξ=2knL,0xnL,0m2k.

Since we want our ξ to satisfy the upper bound given by Algorithm 6 as well, we want to choose ζ to be as small as possible as long as ζζ0 is satisfied. Hence, we want to take ζ=ζ0, but this is not always possible because ζ0ζmin is not always true. (The above ζ0 even can be negative, so in the actual algorithm we may truncate it to be nonnegative.)

If ζ0ζmin is satisfied, then we may indeed take ζ=ζ0. By the construction, ζ0<ζmax is automatically satisfied, so we want to check two things: (1) the smallness constraint mnmax+sNmax with s=2kζ0, and (2) the true upper bound

ξ<nU,0x+1ζ0nU,0

given by Algorithm 6. If both of the conditions are satisfied, then k, m, and s=2kζ0 are valid solutions. Otherwise, we conclude that ξ does not yield an admissable choice of (k,m,s).

If ζ0<ζmin, then we have to increase the numerator of ζ0 to put in inside [ζmin,ζmax), so let ζ be the one obtained by adding the smallest positive integer to the numerator of ζ0 which ensures ζζmin. In this case, since we cannot easily estimate how large ζ is compared to ζ0, we have to check ζ<ζmax in addition to the other two conditions.

If we failed to find any admissible choice of (k,m,s) with our ξ, then we increase the numerator m and see if that works out until ξ=m2k goes outside of the search interval I. After exhausting all m’s, now we increase k by one and repeat the same procedure. Note that even though all of ξ=m2k’s with even numerators are already considered when we were looking at a smaller k, we do not exclude them from the search because for the case ζ0<ζmin, having a larger denominator of ζ might allow the gap ζζ0 to be smaller.

This loop over k should be stopped when we arrive at a point where the smallest integer m in 2kI already fails to satisfy

mnmax+2kζ0Nmax.

If we exhausted all k’s satisfying the above, then now we conclude that there is no ζ[ζmin,ζmax) yielding an admissible choice of (k,m,s), so we should move onto the next subinterval.

After filling out some omitted details, we arrive at the following algorithm.

Algorithm 7 (Finding feasible values of ξ and ζ).

Input: xR, nmaxZ>0, NmaxZ>0.

Output: k, m, and s, where ξ=m2k and ζ=s2k, so that we have nx=nm+s2k for all n=1,  ,nmax.

  1. Find the largest n=1,  ,nmax that maximizes nxn and call it nL,0.
  2. Find the smallest n=1,  ,nmax that minimizes nx+1n and call it nU,0.
  3. Set ζmax0, ζL,max0, ζU,max0, nL,10, nU,10.
  4. Check if ζmax=ζL,max. If that is the case, then we have to update ζL,max. Set nL,0nL,0+nL,1. If nL,0=nmax, then set ζL,max1. Otherwise, find the largest n=1,  ,nmaxnL,0 that maximizes nxn, assign it to nL,1, and set ζL,maxmin(nL,1nL,0xnL,0nL,1xnL,1,1).
  5. Check if ζmax=ζU,max. If that is the case, then we have to update ζU,max. Set nU,0nU,0nU,1. If nU,0=1, then set ζU,max1. Otherwise, find the largest n=1,  ,nU,01 that maximizes nxn, assign it to nU,1, and set ζU,maxmin(nU,1(nU,0x+1)nU,0nU,1xnU,1,1).
  6. Set ζminζmax, ζmaxmin(ζL,max,ζU,max), and ΔnU,0x+1ζminnU,0nL,0xζmaxnL,0. If Δ0, then this means the search interval I is empty. In this case, go to Step 16. Otherwise, set klog21Δ and proceed to the next step.
  7. Compute m2k(nL,0xζmax)nL,0+1 and inspect the inequality m<2k(nU,0x+1ζmin)nU,0. If it does not hold, then set kk+1 and recompute m accordingly. Otherwise, set kkb where b is the greatest integer such that 2b divides m, and set mm/2b.
  8. Set ζ0max(2knL,0xnL,0m2k,0) and inspect the inequality mnmax+2kζ0Nmax. If it does not hold, then this means every candidate we can further find in the interval [ζmin,ζmax) are consisting of too large numbers, so we move onto the next subinterval. In this case, go to Step 16. Otherwise, proceed to the next step.
  9. If ζ0ζmin, then ζ0[ζmin,ζmax) and mnmax+2kζ0Nmax hold, so it is enough to check the inequality m2k<nU,0x+1ζ0nU,0. If this holds, then we have found an admissible choice of (k,m,s), so return. Otherwise, we conclude that ξ=m2k does not yield an admissible answer. In this case, go to Step 13.
  10. If ζ0<ζmin, then set a2k(ζminζ0) and ζζ0+a2k so that ζζmin is satisfied. Check if ζ<ζmax holds, and if that is not the case, then go to Step 13.
  11. Check if mnmax+2kζNmax holds. If that is not the case, then go to Step 13.
  12. Inspect the inequality m2k<nU,0x+1ζnU,0. If it does hold, then we have found an admissible choice of (k,m,s), so return. Otherwise, proceed to the next step.
  13. Set mm+1 and inspect the inequality m2k<nU,0x+1ζminnU,0. If it does not hold, then go to Step 15.
  14. If it does hold, then set ζ0max(2knL,0xnL,0m2k,0) and inspect the inequality mnmax+2kζ0Nmax. If it does hold, then go back to Step 9. Otherwise, proceed to the next step.
  15. Set kk+1 and m2k(nL,0xζmax)nL,0+1, and go to Step 8.
  16. We have failed to find any admissible choice of (k,m,s) so far, so we have to move onto the next subinterval. If ζmax=1, then we already have exhausted the whole interval, so return with FAIL. Otherwise, go back to Step 4.

Remarks.

  1. Whenever we update ζL,max and ζU,max, the updated values must be always strictly bigger than the previous values unless they already have reached their highest value 1. Let us see the case of ζL,max; the case of ζU,max is similar. Suppose that initially we had nL,0=n0, nL,1=n1, and after recomputing nL,1 we got nL,1=n2. Then the new value of ζL,max and the old value of it are respectively given as n2(n0+n1)x(n0+n1)n2xn2,n1n0xn0n1xn1, respectively. Applying (25), it turns out that the first one minus the second one is equal to (n0+n1)(n1xn1n2xn2). Now, recall the way n1 is chosen: we first find the best rational approximation of x from below in the range {1,  ,nmaxn0}, call its denominator q, and set n1 to be the largest multiple of q. Since n1 is the largest multiple, it follows that nmaxn0n1 should be strictly smaller than q. Therefore, the best rational approximation in the range {1,  ,nmaxn0n1} should be strictly worse than what q gives. This shows that (87) is strictly positive.

  2. When there indeed exists at least one admissible choice of (k,m,s) given x, nmax, and Nmax, then Algorithm 7 finds the triple (k,m,s) with the smallest k, and then m, and then s, within the first subinterval [ζmin,ζmax) where a solution can be found. To make it to always find (k,m,s) with the smallest k, m, and s among all solutions, we can simply modify the algorithm a little bit so that it does not immediately stop when it finds a solution, rather it continues to other subintervals and then compare any solutions found there with the previous best one.

An actual implementation of this algorithm can be found here. (Click here to test it in live.)

Results by Lemire et al.

Consider the special case when x=1q, qnmax, and ξ=ζ. In this case, we have the following result:

Theorem 8 (Lemire et al., 2021).

For any positive integers q and nmax with qnmax, we have

nq=(n+1)ξ

for all n=0,  ,nmax if and only if

(11nmax/qq+1)1qξ<1q.

We can give an alternative proof of this fact using what we have developed so far. Essentially, this is due to the fact that Algorithm 5 finishes its iteration at the first step and do not proceed further.

Proof. () Take n=q1, then we have 0=qξ, thus ξ<1q follows. On the other hand, take n=nmaxqq, then we have

nmaxq(nmaxqq+1)ξ,

so rearranging this gives the desired lower bound on ξ.

() It is enough to show that ξ satisfies

maxn=1,  ,nmaxn/qξnξ<minn=1,  ,nmaxn/q+1ξn.

Following Algorithm 5, let n0 be the largest maximizer of n/qn, i.e., n0=nmaxqq. Then we know from Algorithm 5 that n0 is the largest maximizer of n/qξn if and only if

n/qn<n0/qξn0=1qξnmax/qq

holds for all n=1,  ,nmaxn0. Pick any such n and let r be the remainder of n/q, then we have

n/qn=(nr)/qn=1qrnq.

Hence, we claim that

ξnmax/q<rn

holds for all such n. This is clear, because ξ<1q while nmaxn0 is at most q1, so the right-hand side is at least 1q1. Therefore, n0 is indeed the largest maximizer of n/qξn for n=1,  ,nmax, and the inequality

n0/qξn0ξ

is equivalent to

ξn0/qn0+1=1q(11n0+1),

which we already know.

For the upper bound, note that it is enough to show that

1qn/q+1n+1

holds for all n=1,  ,nmax. We can rewrite this inequality as

nqnqq1q,

which means nothing but that the remainder of n divided by q is at most q1, which is trivially true.

Note the fact that the numerator of x is 1 is crucially used in this proof.

Lemire et al. also showed that, if nmax=2N1 and q is not a power of 2, then whenever the best magic constant predicted by Theorem 2 does not fit into a word, the best magic constant predicted by the above theorem should fit into a word. In fact, those assumptions about being power of 2 or not power of 2 and such can be removed, as shown below.

Theorem 9 (Improves Lemire et al.)

Let x be a positive real number and u,v be positive integers. Let k be the smallest integer such that the set

[2kx,2kx(1+1v))Z

is not empty. If 2kxmax(u,v), then the set

[2k1x(11u),2k1x)Z

must be nonempty as well.

Thus, if we let x=1q, v to be the v in Theorem 2 and u:=nmaxqq+1, then whenever the best magic constant m=2kq happens to be greater than or equal to nmax, we always have

2k1q(11u)<2k1q.

Recall that we have seen in a previous section that 2kq<2v always holds, so the right-hand side of the above is bounded by vnmax, so indeed this shows that the best magic constant predicted by Theorem 8 is bounded by nmax in this case.

Proof. By definition of k, we must have

[2k1x,2k1x(1+1v))Z=.

Let

α:=2k1x2k1x[0,1),

then hence we must have

2k1x+α2k1x(1+1v),

thus

α2k1xv.

Now, note that

2k1x(11u)=2k1xα2k1xu,

so it is enough to show that

α+2k1xu1.

Note that from (106), we know

α+2k1xu2k1xv+2k1xu2kxmax(u,v),

and the right-hand side is bounded below by 1 by the assumption, so we are done.

Therefore, Theorem 2 and Theorem 8 are enough to cover all cases when x=1q with qnmax. Then does Algorithm 7 have any relevance in practice?

An example usage of Algorithm 7

I do not know, I mean, as I pointed out before, I am not sure how important in general is to cover the case x=pq with p1. But anyway when p1, there certainly are some cases where Algorithm 7 might be useful. Consider the following example:

std::uint32_t div(std::uint32_t n) {
    return (std::uint64_t(n) * 7) / 18;
}

Here I provide three different ways of doing this computation:

div1(unsigned int):
        mov     ecx, edi
        lea     rax, [8*rcx]
        sub     rax, rcx
        movabs  rcx, 1024819115206086201
        mul     rcx
        mov     rax, rdx
        ret
div2(unsigned int):
        mov     eax, edi
        movabs  rcx, 7173733806472429568
        mul     rcx
        mov     rax, rdx
        ret
div3(unsigned int):
        mov     ecx, edi
        mov     eax, 3340530119
        imul    rax, rcx
        add     rax, 477218588
        shr     rax, 33
        ret

(Check it out!)

The first version (div1) is what my compiler (clang) generated. It first computes n * 7 by computing n * 8 - n. And then, it computes the division by 18 using the multiply-and-shift technique. Especially, it deliberately chose the shifting amount to be 64 so that it can skip the shifting. Clang is indeed quite clever in the sense that it actually leverages the fact that n is only a 32-bit integer, because, the minimum amount of shifting required for the division by 18 is 68, if all 64-bit dividends are considered.

But it is not clever enough to realize that two operations * 7 and / 18 can be merged. The second version div2 is what one can get by utilizing this fact and do these two computations in one multiply-and-shift. Using Theorem 2, it can be shown that the minimum amount of shifting for doing this with all 32-bit dividends is 36, and in that case the corresponding magic number is 26724240953. Since the computation cannot be done in 64-bit anyway, I deliberately multiplied 228 to this magic number and got 7173733806472429568 to make the shifting amount equal to 64, so that the actual shifting instruction is not needed. And the result is a clear win over div1, although lea and sub are just trivial instructions and the difference is not huge.

Now, if we do this with multiply-add-and-shift, we can indeed complete our computation inside 64-bit, which is the third version div3. Using Algorithm 7, it turns out (k,m,s)=(33,3340530119,477218588) works for all 32-bit dividend, i.e., we have

7n18=3340530119n+477218588233

for all n=0,  ,2321. The maximum possible value of the numerator 3340530119n+477218588 is strictly less than 264 in this case. Hence, we get div3.

It is hard to tell which one between div2 and div3 will be faster. Note that div3 has one more instructions than div2, but it does not invoke the 128-bit multiplication (the mul instruction both found in div1 and div2). In typical modern x86-64 CPU’s, 128-bit multiplication uses twice more computational resources than the 64-bit one, so it might more easily become a bottleneck if there are many multiplications to be done. Also, as far as I know Intel CPU’s still does not provide SIMD instructions for 128-bit multiplication, so div3 could be more aggressively vectorized, which may result in massive speed up. Still, all of these are just speculations and of course when it comes to performance there is nothing we can be sure about. Anyway, it is good to have multiple options we can try out.

Leave a Comment