How to quickly factor out a constant factor from integers

29 minute read

Published:

This post revisits the topic of integer division, building upon the discussion in the previous post. Specifically, I’ll delve into removing trailing zeros in the decimal representation of an input integer, or more broadly, factoring out the highest power of a given constant that divides the input. This exploration stems from the problem of converting floating-point numbers into strings, where certain contemporary algorithms, such as Schubfach and Dragonbox, may yield outputs containing trailing zeros.

Here is the precise statement of our problem:

Given positive integer constants nmax and qnmax, how to find the largest integer k such that qk divides n, together with the corresponding nqk, for any n=1,  ,nmax?

As one can expect, this more or less boils down to an efficient divisibility test algorithm. However, merely testing for divisibility is not enough, and we need to actually divide the input by the divisor when we know the input is indeed divisible.

Naïve algorithm

Here is the most straightforward implementation I can think of:

std::size_t s = 0;
while (true) {
    auto const r = n % q;
    if (r == 0) {
        n /= q;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Of course, this works (as long as n0 which we do assume throughout), but obviously our objective is to explore avenues for greater performance. Here, we are assuming that the divisor q is a given constant, so any sane modern compiler knows that the dreaded generic integer division is not necessary. Rather, they would replace the division into a multiply-and-shift, or some slight variation of it, as explained in the previous post. That is great, but note that we need to compute both the quotient and the remainder. As far as I know, there is no known algorithm capable of computing both quotient and remainder with only one multiplication, which means that the above code will perform two multiplications per iteration. However, if we consider cases where the input is not divisible by the divisor, we realize that we don’t actually require the precise values of the quotient or the remainder. Our sole concern is whether the remainder is zero, and only if that is the case, we do want to know the quotient of the division. Therefore, it’s conceivable that we could accomplish this with just one multiplication per iteration, which presumably will improve the performance.

Actually, the classical paper by Granlund-Montgomery already presented such an algorithm, which is the topic of the next section.

Granlund-Montgomery modular inverse algorithm

First, let us assume that q is an odd number for a while. Assume further that our input is of b-bit unsigned integer type, e.g. of type std::uint32_t, with b=32. Hence, we are assuming that nmax is at most 2b1. Now, since q is coprime to 2b, there uniquely exists the modular inverse of q with respect to 2b. Let us call it m, which in general can be found using the extended Euclid’s algorithm. But how is it useful to us?

The key observation here is that, on the group Z/2b of integer residue classes modulo 2b, the multiplication by any integer coprime to 2b is an automorphism, i.e., a bijective group homomorphism onto itself. In particular, multiplication by such an integer induces a bijection from the set {0,1,  ,2b1} onto itself.

Now, what does this bijection, defined in particular by the modular inverse m (which must be coprime to 2b), do to multiples of q, i.e., 0,q,  ,2b1qq? Note that for any integer a,

(1)(aq)ma(qm)a (mod 2b),

thus, the bijection {0,1,  ,2b1}{0,1,  ,2b1} defined by m maps aq into a. Therefore, anything that gets mapped into {0,1,  ,2b1q} must be a multiple of q, because the map is a bijection, and vice versa.

Furthermore, the image of this bijection, in the case when the input n was actually a multiple of q, is precisely n/q. Since multiplication modulo 2b is what C/C++ abstract machines are supposed to do whenever they see an integer multiplication, we find that the following code does the job we want, but with only one multiplication per an iteration:

// m and threshold_value are precalculated from q.
// m is the modular inverse of q, and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
    auto const r = n * m;
    if (r < threshold_value) {
        n = r;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

All is good so far, but recall that this works only when q is odd. Unfortunately, our main application, the trailing zero removal, requires q to be 10, which is even.

Actually, the aforementioned paper by Granlund-Montgomery proposes a clever solution for the case of possibly non-odd divisors, based on bit-rotation. Let us write q=2tq0 for an integer t and a positive odd number q0. Obviously, there is no modular inverse of q with respect to 2b, but we can try the next best thing, that is, the modular inverse of q0 with respect to 2bt. Let us call it m.

Now, for given n=0,1,  ,nmax2b1, let us consider the multiplication nm modulo 2b and then perform bit-rotate-to-right on it by t-bits. Let us call the result r. Clearly, the highest t-bits of r is all-zero if and only if the lowest t-bits of (nm mod 2b) is all-zero, and since m must be an odd number, this is the case if and only if the lowest t-bits of n is all-zero, that is, n is a multiple of 2t.

We claim that n is a multiple of q if and only if r2b1q=2bt1q0 holds. By the argument from the previous paragraph, r must be strictly larger than 2bt1q0 if n is not a multiple of 2t, since 2bt1q0 is at most of (bt)-bits. Thus, we only need to care about the case when n is a multiple of 2t. In that case, write n=2tn0, then r is in fact equal to

(2)(nm mod 2b)2t=nmnm/2b2b2t=n0mn0m2bt2bt=(n0m mod 2bt),

since the lowest t-bits of (nm mod 2b) are all zero. By the discussion of the case when q is odd, we know that n0 is a multiple of q0 if and only if (n0m mod 2bt) is at most 2bt1q0, thus finishing the proof of the claim.

Of course, when n was really a multiple of q, the resulting r is precisely n/q, by the same reason as in the case of odd q. Consequently, we obtain the following version which works for all q:

// t, m and threshold_value are precalculated from q.
// t is the number of trailing zero bits of q,
// m is the modular inverse of (q >> t) with respect to 2^(b-t), and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
    auto const r = std::rotr(n * m, t); // C++20 <bit>
    if (r < threshold_value) {
        n = r;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Lemire’s algorithm

In a paper published in 2019, Lemire et al proposed an alternative way of checking divisibility which has some advantages over Granlund-Montgomery algorithm. The theorem behind this algorithm was not optimal when first presented, but later they showed the optimal result in another paper. Here, I present a more general result which contains the result by Lemire et al as a special case. This is one of the theorems I proved in my paper on Dragonbox (Theorem 4.6), which I copied below:

Theorem 1 Let x,ξ be real numbers and nmax be a positive integer such that

(3)nx=nξ

holds for all n=1,  ,nmax. Then, for a positive real number η, we have the followings.

  1. If x=pq is a rational number with 2qnmax, then we have

    (4){n{1,  ,nmax}:nxZ}={n{1,  ,nmax}:nξnξ<η}

    if and only if

    (5)nmaxqq(ξx)<ηu(ξx)+1q

    holds, where u is the smallest positive integer such that up1 (mod q).

  2. If x is either an irrational number or a rational number with the denominator strictly greater than nmax, then we have

    (6){n{1,  ,nmax}:nxZ}=={n{1,  ,nmax}:nξnξ<η}

    if and only if

    (7)ηqξp

    holds, where pq is the best rational approximations of x from below with the largest denominator qnmax.

The theorem above shows a necessary and sufficient condition for the product nx to be an integer in terms of a good enough approximation ξ of x. The second item is irrelevant in this post, so we only focus on the first item. The point here is that we are going to let x=1q so that nx is an integer if and only if n is a multiple of q. In that case, it is shown in the paper (Remark 2 after Theorem 4.6) that we can always take ξ=η as long as ξ satisfies the condition

(8)nq=nξ

for all n=1,  ,nmax, so that n is a multiple of q if and only if

(9)nξnξ<ξ.

Also, another theorem from the paper (Theorem 4.2) shows that a necessary and sufficient condition for having the above is that

(10)1qξ<1q+1vq

holds, where v is the greatest integer with v1 (mod q) and vnmax, i.e.,

(11)v=nmax+1qq1.

In practice, we take ξ=m2b for some positive integers m,b so that

(12)nξnξ<ξ

holds if and only if

(13)(nm mod 2b)<m.

Hence, the above is a necessary and sufficient condition for n to be divisible by q, as long as the inequality

(14)1qm2b<1q+1vq=(nmax+1)/q(nmax+1)/qq1

holds. Furthermore, since nx=nξ holds, the quotient nq can be also computed as

(15)nξ=nm2b.

In other words, we multiply m to n, then the lowest b-bits can be used for inspecting divisibility, while the remaining upper bits constitutes the quotient.

Note that this is always the case, regardless of whether n is divisible by q or not, so this algorithm actually does more than what we are asking for. Furthermore, there is only one magic number, m, while Granlund-Montgomery requires two. This may have some positive impact on the code size, instruction decoding overhead, register usage overhead, etc..

However, one should also note that the cost of these “extra features” is that we must perform a “widening multiplication”, that is, if nmax and m are of b-bits, then we need all of 2b-bits of the multiplication of n and m. It is also worth mentioning that the magic number m might actually require more than b-bits. For details, please refer to the previous post, or this paper by Lemire et al.

In any case, the resulting algorithm would be like the following:

// m is precalculated from q.
std::size_t s = 0;
while (true) {
    auto const r = widening_mul(n, m);
    if (low_bits(r) < m) {
        n = high_bits(r);
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Generalized modular inverse algorithm

Staring at Theorem 1, one can ask: if all we care about is divisibility, not the quotient, then do we really need to take x=1q? That is, as long as p is any integer coprime to q, asking if n/q is an integer is exactly the same question as asking if npq is an integer. In fact, this observation leads to a derivation of the modular inverse divisibility check algorithm by Granlund-Montgomery explained above for the case of odd divisor, and the Dragonbox paper already did this (Remark 3 after Theorem 4.6). A few days ago, I realized the same argument actually applies to the case of general divisor as well, which leads to yet another divisibility check algorithm explained below, which I think probably is novel.

Suppose that p is a positive integer coprime to q, then as pointed out, n is divisible by q if and only if nx is an integer with xpq. Therefore, by Theorem 1 (together with Theorem 4.2 from the Dragonbox paper), for any ξ,η satisfying

(16)pqξ<pq+1vq

and

(17)nmaxqq(ξpq)<ηu(ξpq)+1q,

a given n=1,  ,nmax is divisible by q if and only if

(18)nξnξ<η,

where v is the greatest integer satisfying vp1 (mod q) and u is the smallest positive integer satisfying up1 (mod q).

Again, since our goal is to make the evaluation of the inequality (18) as easy as possible, we may want to take ξ=m2b and η=s2b as before, so that (18) becomes

(19)(nm mod 2b)<s.

Although that is what we will do eventually, let us consider a little bit more general case that ξ=mN and η=sN for any positive integer N, not necessarily of the form 2b, for the sake of pedantry clarity of what is going on. Of course, in this case (18) is equivalent to

(20)(nm mod N)<s.

With this setting, we can also rewrite (16) and (17) as

(21)0qmNp<Nv

and

(22)nmaxq(qmNp)<suq(qmNp)+Nq,

respectively. Note that, both sides of the above inequality have the factor qmNp, and the left-hand side multiplies it to nmaxq and the right-hand side multiplies it to uq. Since u is at most q1, usually nmaxq is much larger than uq, so it is usually the case that the inequality can hold only when the factor qmNp is small enough. So, it makes sense to actually minimize it. Note that we are to take N=2b and b is a factor defined by the application, while m and p are something we can choose. In such a situation, the smallest possible nonnegative value of qmNp is exactly ggcd(q,N), the greatest common divisor of q and N. Recall that a general solution for the equation qmNp=g is given as

(23)m=m0+Nkg,p=p0+qkg

where m0 is the modular inverse of qg with respect to Ng, p0 is the unique integer satisfying qm0Np0=g, and k is any integer.

Now, we cannot just take any kZ, because this whole argument breaks down if p and q are not coprime. In particular, p0 is not guaranteed to be coprime to q. For example, take N=32 and q=30, then we get g=2, m0=15 and p0=14 so that qm0Np0=2, but gcd(p0,q)=2. Nevertheless, it is always possible to find some k such that p=p0+qkg is coprime to q. The proof I wrote below of this fact is due to Seung uk Jang:

Proposition 2 Let a,b be any integers and ggcd(a,b). Then there exist integers x,y such that axby=g and gcd(a,y)=1. Specifically, such x,y can be found as

(24)x=x0+bkg,y=y0+akg,

where x0 is the modular inverse of ag with respect to bg, y0=ax0gb, and k is the smallest nonnegative integer satisfying

(25)(ag)k1y0 (mod ggcd(a/g,g)).

Proof. Take x,y,k as given above. Note that ag is coprime to g/gcd(a/g,g), so such k uniquely exists. Then we have

(26)yy0+(1y0)1 (mod ggcd(a/g,g)),

so y is coprime to g/gcd(a/g,g). On the other hand, from

(27)(ag)x0(bg)y0=1,

we know that y0 is coprime to ag, thus y is coprime to ag. This means that y is coprime to a. Indeed, let p be any prime factor of a, then p should divide either ag or g. If p divides ag, then it cannot divide y since y is coprime to ag. Also, if p divides g but not ag, then it divides g/gcd(a/g,g) which is again coprime to y.

For instance, applying this proposition to our example N=32, q=30 yields

(28)m=15+16k,p=14+15k

with k being the smallest nonnegative integer satisfying

(29)15k114 (mod 2),

which is k=1. Hence, we take m=31 and p=29.

In fact, for the specific case of N=2b and qN, we can always take either k=0 or 1. Indeed, by the assumption, g is a power of 2 and q/g is an odd number. Then, one of p0 and p0+qg is an odd number, so one of them is coprime to g. On the other hand, since

(30)(qg)m0(Ng)p0=1,

we know p0 is coprime to qg, so both p0 and p0+qg are coprime to qg. Therefore, the odd one between p0 and p0+qg must be coprime to q.

In any case, let us assume from now on that we have chosen m and p in a way that gcd(p,q)=1 and qmNp=g. Then (21) and (22) can be rewritten as

(31)v<Ng

and

(32)nmaxqg<sug+Nq,

respectively. Next, we claim that ug+Nq is an integer. Indeed, we have

(33)(ug+N)pg+Npqm0 (mod q),

and since p is coprime to q, ug+N must be a multiple of q. Hence, the most sensible choice of s in this case is s=ug+Nq, and let us assume that from now on. Then, the left-hand side of the inequality above can be rewritten as

(34)nmaxqqu<Ng.

Actually, this inequality follows from v<Ng, so is redundant, because v can be written in terms of u as

(35)v=nmax+uqqu.

Plugging in the above equation into v<N/g, we obtain that the only condition we need to ensure is

(36)nmax+uq<(N/g)+uq,

or equivalently,

(37)nmax(N/g)+uqq+q1u.

Therefore, as long as the above is true, we can check divisibility of n=1,  ,nmax by inspecting the inequality

(38)(nm mod N)<ug+Nq.

Furthermore, if n turned out to be a multiple of q, then we can compute n/q from (nm mod N) as in the classical Granlund-Montgomery case. More precisely, assume that n=aq for some integer a0, then

(nm mod N)=aqmaqmNN=aqm(a(qmNp)N+ap)N=a(qmNp)agN=agagN.

We show that ag<N always holds as long as 1<q<N, aqnmax hold and nmax satisfies (37). Indeed, ag<N is equivalent to aq<qNg, so it is enough to show the right-hand side of (37) is strictly less than qNg. Note that

(N/g)+uqq+q1u(N/g)+uqq+q1u=Ng+q1,

and

(39)qNg(Ng+q1)=(Ng1)(q1)>0

by the condition 1<q<N. Therefore, in this case we always have

(40)(nm mod N)=ag,

thus a can be recovered by computing (nm mod N)g. Note that in particular if N=2b, division by g is just a bit-shift.

Compared to the method proposed by Granlund-Montgomery, bit-rotation is never needed, but at the expense of requires an additional shift for computing n/q. Note that this shifting is only needed if n turned out to be a multiple of q. The divisibility check alone can be done with one multiplication and one comparison.

So far it sounds like this new method is better than the classical Granlund-Montgomery algorithm based on bit-rotation, but note that the maximum possible value of nmax (i.e., the right-hand side of (37)) is roughly of size N/g, so in particular, if N=2b and q=2tq0 for some odd integer q0, then nmax should be of at most about (bt)-bits. Depending on the specific parameters, it is possible to improve the right-hand side of (37) a little bit by choosing a different p (since u is determined from p), but this cannot have any substantial impact on how large nmax can be. Also, at this moment I do not know of an elegant way of choosing p that maximizes the bound on nmax.

In summary, the algorithm works as follows, assuming 1<q<N and N=2b.

  1. Write q=2tq0 for an odd integer q0.

  2. Let m0 be the modular inverse of q0 with respect to 2bt, and let p0(q0m01)/2bt.

  3. If p0 is odd, let pp0, otherwise, let pp0+q0.

  4. Let m(2btp+1)/q0.

  5. Let u be the modular inverse of p with respect to q.

  6. Then for any n=0,1,  ,(2bt+u)/qq+q1u, n is a multiple of q if and only if

    (41)(nm mod 2b)<2bt+uq0.
  7. In case the above inequality holds, we also have

    (42)nq=(nm mod 2b)2t.

The corresponding code for factoring out the highest power of q would look like the following:

// m, threshold_value and t are precalculated from q.
std::size_t s = 0;
while (true) {
    auto const r = n * m;
    if (r < threshold_value) {
        n = r >> t;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

One should keep in mind that the above only works provided n is at most (2bt+u)/qq+q1u, which is roughly equal to 2bt.

Benchmark and conclusion

Determining the most efficient approach inherently depends on a multitude of factors. To gain insight into the relative performances of various algorithms, I conducted a benchmark. Given that my primary application involves removing trailing zeros, I set the divisor q to 10 for this benchmark. Additionally, considering the requirements of Dragonbox, where trailing zero removal may only be necessary for numbers up to 8 digits for IEEE-754 binary32 and up to 16 digits for IEEE-754 binary64, I incorporated these assumptions in the benchmark to determine the optimal parameters for each algorithm.

Here is the data I collected on my laptop (Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Windows 10):

  • 32-bit benchmark for numbers with at most 8 digits.
AlgorithmsAverage time consumed per a sample
Null (baseline)1.4035ns
Naïve12.7084ns
Granlund-Montgomery11.8153ns
Lemire12.2671ns
Generalized Granlund-Montgomery11.2075ns
Naïve 2-18.92781ns
Granlund-Montgomery 2-17.85643ns
Lemire 2-17.60924ns
Generalized Granlund-Montgomery 2-17.85875ns
Naïve branchless3.30768ns
Granlund-Montgomery branchless2.52126ns
Lemire branchless2.71366ns
Generalized Granlund-Montgomery branchless2.51748ns
  • 64-bit benchmark for numbers with at most 16 digits.
AlgorithmsAverage time consumed per a sample
Null (baseline)1.68744ns
Naïve16.5861ns
Granlund-Montgomery14.1657ns
Lemire14.3427ns
Generalized Granlund-Montgomery15.0626ns
Naïve 2-113.2377ns
Granlund-Montgomery 2-111.3316ns
Lemire 2-111.6016ns
Generalized Granlund-Montgomery 2-111.8173ns
Naïve 8-2-112.5984ns
Granlund-Montgomery 8-2-111.0704ns
Lemire 8-2-113.3804ns
Generalized Granlund-Montgomery 8-2-111.1482ns
Naïve branchless5.68382ns
Granlund-Montgomery branchless4.0157ns
Lemire branchless4.92971ns
Generalized Granlund-Montgomery branchless4.64833ns

(The code is available here.)

And here are some detailed information on how the benchmark is done:

  • Samples were generated randomly using the following procedure:
    • Uniformly randomly generate the total number of digits, ranging from 1 to the specified maximum number of digits.
    • Given the total number of digits, uniformly randomly generate the number of trailing zeros, ranging from 0 to the total number of digits minus 1.
    • Uniformly randomly generate an unsigned integer with given total number of digits and the number of trailing zeros.
  • Generalized Granlund-Montgomery refers to the generalized modular inverse algorithm explained in the last section.
  • Algorithms without any suffix iteratively remove trailing zeros one by one as demonstrated as code snippets in previous sections.
  • Algorithms suffixed with “2-1” initially attempt to iteratively remove two consecutive trailing zeros at once (by running the loop with q=100), and then remove one more zero if necessary.
  • Algorithms suffixed with “8-2-1” first check if the input contains at least eight trailing zeros (using the corresponding divisibility check algorithm with q=108), and if that is the case, then remove eight zeros and invoke the 32-bit “2-1” variants of themselves. If there are fewer than eight trailing zeros, then they proceed like their “2-1” variants.
  • Algorithms suffixed with “branchless” do branchless binary search, as suggested by reddit users r/pigeon768 and r/TheoreticalDumbass. (See this reddit post.)

It seems there is not so much difference between all three algorithms overall, and even the naïve one is not so bad. There were notable fluctuation with repeated runs and the top performer varied run to run, but all three consistently outperformed the naïve approach, and I could observe a certain pattern.

Firstly, Lemire’s algorithm seems to suffer for large divisors, for instance q=108 or maybe even q=104. This is probably because for large divisors the number of bits needed for a correct divisibility test often exceeds the word size. This means that comparing the low bits of the result of multiplication with the threshold value is not a simple single comparison in reality.

For instance, “Lemire 8-2-1” and “Lemire branchless” algorithms from the benchmark use 80-bits for checking divisibility by q=108 for inputs of up to 16 decimal digits. This means that, given the input is passed as a 64-bit unsigned integer, we perform widening 128-bit multiplication with a 64-bit magic number m (whose value is 12089258196146292 in this case), and we check two things to decide divisibility: that the lowest 16-bits from the upper half of the 128-bit multiplication result is all zero, and that the lower half of this 128-bit multiplication result is strictly less than m. Actually, the minimum required number of bits for a correct divisibility test is 78 assuming inputs are limited to 16 decimal digits, and I opted for 2 more bits to facilitate the extraction of 16-bits from the upper 64-bits rather than 14-bits.

(Note also that having to have more bits than the word size means that, even if all we care is divisibility without any need to determine the quotient, Lemire’s algorithm may still need widening multiplication!)

Secondly, it seems on x86-64 the classical Granlund-Montgomery algorithm is just better than the proposed generalization of it, especially for the branchless case. Note that ror and shr have identical performance (reference), so for the branchless case, having to bit-shift only when the input is divisible turns out to be a pessimization, because we need to evaluate the bit-shift regardless of the divisibility, and as a result it just ends up requiring an additional mov compared to the unconditional bit-rotation of classical Granlund-Montgomery algorithm. Even for the branchful case, it seems the compiler still tries to evaluate the bit-shift regardless of the divisibility and as a consequence generates one more mov.

My conclusion is that, the classical Granlund-Montgomery is probably the best for trailing zero removal, at least on x86-64. Yet, the proposed generalized modular inverse algorithm may be a better choice on machines without ror instruction, or machines with smaller word size. Lemire’s algorithm does not seem to offer any big advantage over the other two on x86-64, and I expect that the cost of widening multiplication may overwhelm the advantage of having only one magic number on machines with smaller word size.

On the other hand, Lemire’s algorithm is still very useful if I have to know the quotient regardless of divisibility. There indeed is such an occasion in Dragonbox, where I am already leveraging Lemire’s algorithm for this purpose.

Special thanks to reddit users r/pigeon768 and r/TheoreticalDumbass who proposed the branchless idea!

Leave a Comment