How to quickly factor out a constant factor from integers
Published:
This post revisits the topic of integer division, building upon the discussion in the previous post. Specifically, I’ll delve into removing trailing zeros in the decimal representation of an input integer, or more broadly, factoring out the highest power of a given constant that divides the input. This exploration stems from the problem of converting floating-point numbers into strings, where certain contemporary algorithms, such as Schubfach and Dragonbox, may yield outputs containing trailing zeros.
Here is the precise statement of our problem:
Given positive integer constants and , how to find the largest integer such that divides , together with the corresponding , for any ?
As one can expect, this more or less boils down to an efficient divisibility test algorithm. However, merely testing for divisibility is not enough, and we need to actually divide the input by the divisor when we know the input is indeed divisible.
Naïve algorithm
Here is the most straightforward implementation I can think of:
std::size_t s = 0;
while (true) {
auto const r = n % q;
if (r == 0) {
n /= q;
s += 1;
}
else {
break;
}
}
return {n, s};
Of course, this works (as long as which we do assume throughout), but obviously our objective is to explore avenues for greater performance. Here, we are assuming that the divisor is a given constant, so any sane modern compiler knows that the dreaded generic integer division is not necessary. Rather, they would replace the division into a multiply-and-shift, or some slight variation of it, as explained in the previous post. That is great, but note that we need to compute both the quotient and the remainder. As far as I know, there is no known algorithm capable of computing both quotient and remainder with only one multiplication, which means that the above code will perform two multiplications per iteration. However, if we consider cases where the input is not divisible by the divisor, we realize that we don’t actually require the precise values of the quotient or the remainder. Our sole concern is whether the remainder is zero, and only if that is the case, we do want to know the quotient of the division. Therefore, it’s conceivable that we could accomplish this with just one multiplication per iteration, which presumably will improve the performance.
Actually, the classical paper by Granlund-Montgomery already presented such an algorithm, which is the topic of the next section.
Granlund-Montgomery modular inverse algorithm
First, let us assume that is an odd number for a while. Assume further that our input is of -bit unsigned integer type, e.g. of type std::uint32_t
, with . Hence, we are assuming that is at most . Now, since is coprime to , there uniquely exists the modular inverse of with respect to . Let us call it , which in general can be found using the extended Euclid’s algorithm. But how is it useful to us?
The key observation here is that, on the group of integer residue classes modulo , the multiplication by any integer coprime to is an automorphism, i.e., a bijective group homomorphism onto itself. In particular, multiplication by such an integer induces a bijection from the set onto itself.
Now, what does this bijection, defined in particular by the modular inverse (which must be coprime to ), do to multiples of , i.e., ? Note that for any integer ,
thus, the bijection defined by maps into . Therefore, anything that gets mapped into must be a multiple of , because the map is a bijection, and vice versa.
Furthermore, the image of this bijection, in the case when the input was actually a multiple of , is precisely . Since multiplication modulo is what C/C++ abstract machines are supposed to do whenever they see an integer multiplication, we find that the following code does the job we want, but with only one multiplication per an iteration:
// m and threshold_value are precalculated from q.
// m is the modular inverse of q, and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
auto const r = n * m;
if (r < threshold_value) {
n = r;
s += 1;
}
else {
break;
}
}
return {n, s};
All is good so far, but recall that this works only when is odd. Unfortunately, our main application, the trailing zero removal, requires to be , which is even.
Actually, the aforementioned paper by Granlund-Montgomery proposes a clever solution for the case of possibly non-odd divisors, based on bit-rotation. Let us write for an integer and a positive odd number . Obviously, there is no modular inverse of with respect to , but we can try the next best thing, that is, the modular inverse of with respect to . Let us call it .
Now, for given , let us consider the multiplication modulo and then perform bit-rotate-to-right on it by -bits. Let us call the result . Clearly, the highest -bits of is all-zero if and only if the lowest -bits of is all-zero, and since must be an odd number, this is the case if and only if the lowest -bits of is all-zero, that is, is a multiple of .
We claim that is a multiple of if and only if holds. By the argument from the previous paragraph, must be strictly larger than if is not a multiple of , since is at most of -bits. Thus, we only need to care about the case when is a multiple of . In that case, write , then is in fact equal to
since the lowest -bits of are all zero. By the discussion of the case when is odd, we know that is a multiple of if and only if is at most , thus finishing the proof of the claim.
Of course, when was really a multiple of , the resulting is precisely , by the same reason as in the case of odd . Consequently, we obtain the following version which works for all :
// t, m and threshold_value are precalculated from q.
// t is the number of trailing zero bits of q,
// m is the modular inverse of (q >> t) with respect to 2^(b-t), and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
auto const r = std::rotr(n * m, t); // C++20 <bit>
if (r < threshold_value) {
n = r;
s += 1;
}
else {
break;
}
}
return {n, s};
Lemire’s algorithm
In a paper published in 2019, Lemire et al proposed an alternative way of checking divisibility which has some advantages over Granlund-Montgomery algorithm. The theorem behind this algorithm was not optimal when first presented, but later they showed the optimal result in another paper. Here, I present a more general result which contains the result by Lemire et al as a special case. This is one of the theorems I proved in my paper on Dragonbox (Theorem 4.6), which I copied below:
Theorem 1 Let be real numbers and be a positive integer such that
holds for all . Then, for a positive real number , we have the followings.
If is a rational number with , then we have
if and only if
holds, where is the smallest positive integer such that .
If is either an irrational number or a rational number with the denominator strictly greater than , then we have
if and only if
holds, where is the best rational approximations of from below with the largest denominator .
The theorem above shows a necessary and sufficient condition for the product to be an integer in terms of a good enough approximation of . The second item is irrelevant in this post, so we only focus on the first item. The point here is that we are going to let so that is an integer if and only if is a multiple of . In that case, it is shown in the paper (Remark 2 after Theorem 4.6) that we can always take as long as satisfies the condition
for all , so that is a multiple of if and only if
Also, another theorem from the paper (Theorem 4.2) shows that a necessary and sufficient condition for having the above is that
holds, where is the greatest integer with and , i.e.,
In practice, we take for some positive integers so that
holds if and only if
Hence, the above is a necessary and sufficient condition for to be divisible by , as long as the inequality
holds. Furthermore, since holds, the quotient can be also computed as
In other words, we multiply to , then the lowest -bits can be used for inspecting divisibility, while the remaining upper bits constitutes the quotient.
Note that this is always the case, regardless of whether is divisible by or not, so this algorithm actually does more than what we are asking for. Furthermore, there is only one magic number, , while Granlund-Montgomery requires two. This may have some positive impact on the code size, instruction decoding overhead, register usage overhead, etc..
However, one should also note that the cost of these “extra features” is that we must perform a “widening multiplication”, that is, if and are of -bits, then we need all of -bits of the multiplication of and . It is also worth mentioning that the magic number might actually require more than -bits. For details, please refer to the previous post, or this paper by Lemire et al.
In any case, the resulting algorithm would be like the following:
// m is precalculated from q.
std::size_t s = 0;
while (true) {
auto const r = widening_mul(n, m);
if (low_bits(r) < m) {
n = high_bits(r);
s += 1;
}
else {
break;
}
}
return {n, s};
Generalized modular inverse algorithm
Staring at Theorem 1, one can ask: if all we care about is divisibility, not the quotient, then do we really need to take ? That is, as long as is any integer coprime to , asking if is an integer is exactly the same question as asking if is an integer. In fact, this observation leads to a derivation of the modular inverse divisibility check algorithm by Granlund-Montgomery explained above for the case of odd divisor, and the Dragonbox paper already did this (Remark 3 after Theorem 4.6). A few days ago, I realized the same argument actually applies to the case of general divisor as well, which leads to yet another divisibility check algorithm explained below, which I think probably is novel.
Suppose that is a positive integer coprime to , then as pointed out, is divisible by if and only if is an integer with . Therefore, by Theorem 1 (together with Theorem 4.2 from the Dragonbox paper), for any satisfying
and
a given is divisible by if and only if
where is the greatest integer satisfying and is the smallest positive integer satisfying .
Again, since our goal is to make the evaluation of the inequality as easy as possible, we may want to take and as before, so that becomes
Although that is what we will do eventually, let us consider a little bit more general case that and for any positive integer , not necessarily of the form , for the sake of pedantry clarity of what is going on. Of course, in this case is equivalent to
With this setting, we can also rewrite and as
and
respectively. Note that, both sides of the above inequality have the factor , and the left-hand side multiplies it to and the right-hand side multiplies it to . Since is at most , usually is much larger than , so it is usually the case that the inequality can hold only when the factor is small enough. So, it makes sense to actually minimize it. Note that we are to take and is a factor defined by the application, while and are something we can choose. In such a situation, the smallest possible nonnegative value of is exactly , the greatest common divisor of and . Recall that a general solution for the equation is given as
where is the modular inverse of with respect to , is the unique integer satisfying , and is any integer.
Now, we cannot just take any , because this whole argument breaks down if and are not coprime. In particular, is not guaranteed to be coprime to . For example, take and , then we get , and so that , but . Nevertheless, it is always possible to find some such that is coprime to . The proof I wrote below of this fact is due to Seung uk Jang:
Proposition 2 Let be any integers and . Then there exist integers such that and . Specifically, such can be found as
where is the modular inverse of with respect to , , and is the smallest nonnegative integer satisfying
Proof. Take as given above. Note that is coprime to , so such uniquely exists. Then we have
so is coprime to . On the other hand, from
we know that is coprime to , thus is coprime to . This means that is coprime to . Indeed, let be any prime factor of , then should divide either or . If divides , then it cannot divide since is coprime to . Also, if divides but not , then it divides which is again coprime to .
For instance, applying this proposition to our example , yields
with being the smallest nonnegative integer satisfying
which is . Hence, we take and .
In fact, for the specific case of and , we can always take either or . Indeed, by the assumption, is a power of and is an odd number. Then, one of and is an odd number, so one of them is coprime to . On the other hand, since
we know is coprime to , so both and are coprime to . Therefore, the odd one between and must be coprime to .
In any case, let us assume from now on that we have chosen and in a way that and . Then and can be rewritten as
and
respectively. Next, we claim that is an integer. Indeed, we have
and since is coprime to , must be a multiple of . Hence, the most sensible choice of in this case is , and let us assume that from now on. Then, the left-hand side of the inequality above can be rewritten as
Actually, this inequality follows from , so is redundant, because can be written in terms of as
Plugging in the above equation into , we obtain that the only condition we need to ensure is
or equivalently,
Therefore, as long as the above is true, we can check divisibility of by inspecting the inequality
Furthermore, if turned out to be a multiple of , then we can compute from as in the classical Granlund-Montgomery case. More precisely, assume that for some integer , then
We show that always holds as long as , hold and satisfies . Indeed, is equivalent to , so it is enough to show the right-hand side of is strictly less than . Note that
and
by the condition . Therefore, in this case we always have
thus can be recovered by computing . Note that in particular if , division by is just a bit-shift.
Compared to the method proposed by Granlund-Montgomery, bit-rotation is never needed, but at the expense of requires an additional shift for computing . Note that this shifting is only needed if turned out to be a multiple of . The divisibility check alone can be done with one multiplication and one comparison.
So far it sounds like this new method is better than the classical Granlund-Montgomery algorithm based on bit-rotation, but note that the maximum possible value of (i.e., the right-hand side of ) is roughly of size , so in particular, if and for some odd integer , then should be of at most about -bits. Depending on the specific parameters, it is possible to improve the right-hand side of a little bit by choosing a different (since is determined from ), but this cannot have any substantial impact on how large can be. Also, at this moment I do not know of an elegant way of choosing that maximizes the bound on .
In summary, the algorithm works as follows, assuming and .
Write for an odd integer .
Let be the modular inverse of with respect to , and let .
If is odd, let , otherwise, let .
Let .
Let be the modular inverse of with respect to .
Then for any , is a multiple of if and only if
In case the above inequality holds, we also have
The corresponding code for factoring out the highest power of would look like the following:
// m, threshold_value and t are precalculated from q.
std::size_t s = 0;
while (true) {
auto const r = n * m;
if (r < threshold_value) {
n = r >> t;
s += 1;
}
else {
break;
}
}
return {n, s};
One should keep in mind that the above only works provided is at most , which is roughly equal to .
Benchmark and conclusion
Determining the most efficient approach inherently depends on a multitude of factors. To gain insight into the relative performances of various algorithms, I conducted a benchmark. Given that my primary application involves removing trailing zeros, I set the divisor to for this benchmark. Additionally, considering the requirements of Dragonbox, where trailing zero removal may only be necessary for numbers up to digits for IEEE-754 binary32 and up to digits for IEEE-754 binary64, I incorporated these assumptions in the benchmark to determine the optimal parameters for each algorithm.
Here is the data I collected on my laptop (Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Windows 10):
- 32-bit benchmark for numbers with at most 8 digits.
Algorithms | Average time consumed per a sample |
---|---|
Null (baseline) | 1.4035ns |
Naïve | 12.7084ns |
Granlund-Montgomery | 11.8153ns |
Lemire | 12.2671ns |
Generalized Granlund-Montgomery | 11.2075ns |
Naïve 2-1 | 8.92781ns |
Granlund-Montgomery 2-1 | 7.85643ns |
Lemire 2-1 | 7.60924ns |
Generalized Granlund-Montgomery 2-1 | 7.85875ns |
Naïve branchless | 3.30768ns |
Granlund-Montgomery branchless | 2.52126ns |
Lemire branchless | 2.71366ns |
Generalized Granlund-Montgomery branchless | 2.51748ns |
- 64-bit benchmark for numbers with at most 16 digits.
Algorithms | Average time consumed per a sample |
---|---|
Null (baseline) | 1.68744ns |
Naïve | 16.5861ns |
Granlund-Montgomery | 14.1657ns |
Lemire | 14.3427ns |
Generalized Granlund-Montgomery | 15.0626ns |
Naïve 2-1 | 13.2377ns |
Granlund-Montgomery 2-1 | 11.3316ns |
Lemire 2-1 | 11.6016ns |
Generalized Granlund-Montgomery 2-1 | 11.8173ns |
Naïve 8-2-1 | 12.5984ns |
Granlund-Montgomery 8-2-1 | 11.0704ns |
Lemire 8-2-1 | 13.3804ns |
Generalized Granlund-Montgomery 8-2-1 | 11.1482ns |
Naïve branchless | 5.68382ns |
Granlund-Montgomery branchless | 4.0157ns |
Lemire branchless | 4.92971ns |
Generalized Granlund-Montgomery branchless | 4.64833ns |
(The code is available here.)
And here are some detailed information on how the benchmark is done:
- Samples were generated randomly using the following procedure:
- Uniformly randomly generate the total number of digits, ranging from 1 to the specified maximum number of digits.
- Given the total number of digits, uniformly randomly generate the number of trailing zeros, ranging from 0 to the total number of digits minus 1.
- Uniformly randomly generate an unsigned integer with given total number of digits and the number of trailing zeros.
- Generalized Granlund-Montgomery refers to the generalized modular inverse algorithm explained in the last section.
- Algorithms without any suffix iteratively remove trailing zeros one by one as demonstrated as code snippets in previous sections.
- Algorithms suffixed with “2-1” initially attempt to iteratively remove two consecutive trailing zeros at once (by running the loop with ), and then remove one more zero if necessary.
- Algorithms suffixed with “8-2-1” first check if the input contains at least eight trailing zeros (using the corresponding divisibility check algorithm with ), and if that is the case, then remove eight zeros and invoke the 32-bit “2-1” variants of themselves. If there are fewer than eight trailing zeros, then they proceed like their “2-1” variants.
- Algorithms suffixed with “branchless” do branchless binary search, as suggested by reddit users r/pigeon768 and r/TheoreticalDumbass. (See this reddit post.)
It seems there is not so much difference between all three algorithms overall, and even the naïve one is not so bad. There were notable fluctuation with repeated runs and the top performer varied run to run, but all three consistently outperformed the naïve approach, and I could observe a certain pattern.
Firstly, Lemire’s algorithm seems to suffer for large divisors, for instance or maybe even . This is probably because for large divisors the number of bits needed for a correct divisibility test often exceeds the word size. This means that comparing the low bits of the result of multiplication with the threshold value is not a simple single comparison in reality.
For instance, “Lemire 8-2-1” and “Lemire branchless” algorithms from the benchmark use -bits for checking divisibility by for inputs of up to decimal digits. This means that, given the input is passed as a -bit unsigned integer, we perform widening -bit multiplication with a -bit magic number (whose value is in this case), and we check two things to decide divisibility: that the lowest -bits from the upper half of the -bit multiplication result is all zero, and that the lower half of this -bit multiplication result is strictly less than . Actually, the minimum required number of bits for a correct divisibility test is assuming inputs are limited to decimal digits, and I opted for more bits to facilitate the extraction of -bits from the upper -bits rather than -bits.
(Note also that having to have more bits than the word size means that, even if all we care is divisibility without any need to determine the quotient, Lemire’s algorithm may still need widening multiplication!)
Secondly, it seems on x86-64 the classical Granlund-Montgomery algorithm is just better than the proposed generalization of it, especially for the branchless case. Note that ror
and shr
have identical performance (reference), so for the branchless case, having to bit-shift only when the input is divisible turns out to be a pessimization, because we need to evaluate the bit-shift regardless of the divisibility, and as a result it just ends up requiring an additional mov
compared to the unconditional bit-rotation of classical Granlund-Montgomery algorithm. Even for the branchful case, it seems the compiler still tries to evaluate the bit-shift regardless of the divisibility and as a consequence generates one more mov
.
My conclusion is that, the classical Granlund-Montgomery is probably the best for trailing zero removal, at least on x86-64. Yet, the proposed generalized modular inverse algorithm may be a better choice on machines without ror
instruction, or machines with smaller word size. Lemire’s algorithm does not seem to offer any big advantage over the other two on x86-64, and I expect that the cost of widening multiplication may overwhelm the advantage of having only one magic number on machines with smaller word size.
On the other hand, Lemire’s algorithm is still very useful if I have to know the quotient regardless of divisibility. There indeed is such an occasion in Dragonbox, where I am already leveraging Lemire’s algorithm for this purpose.
Special thanks to reddit users r/pigeon768 and r/TheoreticalDumbass who proposed the branchless idea!
Leave a Comment