Continued fractions and their application into fast computation of ⌊nx⌋⌊nx⌋
Published:
When I was working on Dragonbox and Grisu-Exact (which are float-to-string conversion algorithms with some nice properties) I had to come up with a fast method for computing things like ⌊nlog102⌋⌊nlog102⌋ or ⌊nlog210⌋⌊nlog210⌋, or more generally ⌊nx⌋⌊nx⌋ for some integer nn and a fixed positive real number xx. Actually, the sign of xx isn’t extremely important, but let us just assume x>0x>0 for simplicity.
At that time I just took a fairly straightforward approach, which is to compute the multiplication of nn with a truncated binary expansion of xx. That is, select an appropriate nonnegative integer kk and then compute ⌊nm2k⌋⌊nm2k⌋ where mm is 2k2k multiplied by the binary expansion of xx up to kk th fractional digits, which is a nonnegative integer. Hence, the whole computation can be done with one integer multiplication followed by one arithmetic shift, which is indeed quite fast.
Here is an example. Consider x=log102x=log102. According to Wolfram Alpha, it’s hexadecimal expansion is 0x0.4d104d427de7fbcc47c4acd605be48...
. If we choose, for example, k=20k=20, then m=m=0x4d104
=315652=315652. Hence, we try to compute ⌊nx⌋⌊nx⌋ by computing
instead, or in C code:
(315652 * n) >> 20
With a manual verification, it can be seen that this computation is valid for all |n|≤1650|n|≤1650. However, when n=1651n=1651, the above formula gives 496496 while the correct value is 497497, and when n=−1651n=−1651, the above gives −497−497 while the correct value is −498−498. Obviously, one would expect that the range of valid inputs will increase if we use a larger value for kk (thus using more digits of xx). That is certainly true in general, but it will at the same time make the computation more prone to overflow.
Note that in order to make this computation truly fast, we want to keep things inside a machine word size. For example, we may want the result of multiplication nmnm to fit inside 32-bits. Actually, even on 64-bit platforms, 32-bit arithmetic is often faster, so we still want things to be in 32-bits. For example, on platforms like x86_64, (if I recall correctly) there is no instruction for integer multiplication of a register and a 64-bit immediate constant, so if the constant mm is of 64-bits, we need to first load it into a register. If mm fits inside 32-bits, we don’t need to do that because there is a multiplication instruction for 32-bit constants and this results in a faster code.
(Please don’t say “you need a benchmark for such a claim”. I mean, yeah, that’s true, but firstly, once upon a time I witnessed a noticeable performance difference between these two so even though I’ve thrown out that code into the trash bin long time ago I still believe in this claim to some extent, and secondly and more importantly, difference of 32-bit vs 64-bit is not the point of this post.)
Hence, if for example we choose k=30k=30, then m=323228496m=323228496, so we must have |n|≤⌊(231−1)/m⌋=6|n|≤⌊(231−1)/m⌋=6 in order to ensure nmnm to fit in 32-bits.
(In fact, at least on x86_64, it seems that there is no difference in performance of multiplying a 64-bit register and a 32-bit constant to produce a 64-bit number, versus multiplying a 32-bit register and a 32-bit constant to produce a 32-bit number, so we might only want to ensure mm to be in 32-bits and mnmn to be in 64-bits, but let us just assume that we want nmnm to be also in 32-bits for simplicity. What matters here is that we want to keep things inside a fixed number of bits.)
As a result, these competing requirements will admit a sweet spot that gives the maximum range of nn. In this specific case, the sweet spot turns out to be k=21k=21 and m=631305m=631305, which allows nn up to |n|≤2620|n|≤2620.
So far, everything is nice, we obtained a nice formula for computing ⌊nx⌋⌊nx⌋ and we have found a range of nn that makes the formula valid. Except that we have zero theoretical estimate on how the size of the range of nn, kk, and xx are related. I just claimed that the range can be verified manually, which is indeed what I did when working on Dragonbox and Grisu-Exact.
To be precise, I did have some estimate at the time of developing Grisu-Exact, which I briefly explained here (it has little to do with this post so readers don’t need to look at it really), but this estimate is too rough and gives a poor bound on the range of valid nn’s.
Quite recently, I realized that the concept of so called continued fractions is a very useful tool for this kind of stuffs. In a nutshell, the reason why this concept is so useful is that it gives us a way to enumerate all the best rational approximations of any given real number, and from there we can derive a very detailed analysis of the mentioned method of computing ⌊nx⌋⌊nx⌋.
Continued fractions
A continued fraction is a way of writing a number xx in a form
x=a0+b1a1+b2a2+b3a3+⋯,x=a0+b1a1+b2a2+b3a3+⋯,(2)either as a finite or infinite sequence. We will only consider the case of simple continued fractions, which means that all of b1,b2, ⋯ b1,b2, ⋯ are 1. From now on, by a continued fraction we will always refer to a simple continued fraction. Also, a0a0 is assumed to be an integer while all other aiai’s are assumed to be positive integers. We denote the continued fraction with the coefficients a0,a1,a2, ⋯ a0,a1,a2, ⋯ as [a0;a1,a2, ⋯ ][a0;a1,a2, ⋯ ].
With these conventions, there is a fairly simple algorithm for obtaining a continued fraction expansion of xx. First, let a0:=⌊x⌋a0:=⌊x⌋. Then x−a0x−a0 is always in the interval [0,1)[0,1). If it is 00 that means xx is an integer so stop there. Otherwise, compute the reciprocal x1:=1x−a0∈(1,∞)x1:=1x−a0∈(1,∞). Then let a1:=⌊x1⌋a1:=⌊x1⌋, then again x1−a1x1−a1 lies in [0,1)[0,1). If it is zero, stop, and if not, compute the reciprocal x2:=1x1−a1x2:=1x1−a1 and continue.
Here is an example from Wikipedia; consider x=41593x=41593. As 415=4⋅93+43415=4⋅93+43, we obtain a0=4a0=4 and x1=9343x1=9343. Then from 93=2⋅43+793=2⋅43+7 we get a1=2a1=2 and x2=437x2=437, and similarly a2=6a2=6, x3=7x3=7. Since x3x3 is an integer, we get a3=7a3=7 and the process terminates. As a result, we obtain the continued fraction expansion
41593=4+12+16+17.41593=4+12+16+17.(3)It can be easily shown that this procedure terminates if and only if xx is a rational number. In fact, when x=pqx=pq is a rational number (whenever we say pqpq is a rational number, we implicitly assumes that it is in its reduced form; that is, qq is a positive integer and pp is an integer coprime to qq), then the coefficients aiai’s are nothing but those appearing in the Euclid algorithm of computing GCD of pp and qq (which is a priori assumed to be 11 of course).
When xx is rational and x=[a0;a1, ⋯ ,ai]x=[a0;a1, ⋯ ,ai] is the continued fraction expansion of xx obtained from the above algorithm, then either xx is an integer (so i=0i=0) or ai>1ai>1. Then [a0;a1, ⋯ ,ai−1,ai−1,1][a0;a1, ⋯ ,ai−1,ai−1,1] is another continued fraction expansion of xx. Then it can be shown that these two are the only continued fraction expansions for xx. For example, we have the following alternative continued fraction expansion
41593=4+12+16+16+1141593=4+12+16+16+11(4)of 4159341593, and these two expansions are the only continued fraction expansions of 4159341593.
When xx is an irrational number, we can run the same algorithm to get an infinite sequence, and the obtained continued fraction expansion of xx is the unique one. Here are some list of examples from Wikipedia:
- √19=[4;2,1,3,1,2,8,2,1,3,1,2,8, ⋯ ]√19=[4;2,1,3,1,2,8,2,1,3,1,2,8, ⋯ ]
- e=[2;1,2,1,1,4,1,1,6,1,1,8, ⋯ ]e=[2;1,2,1,1,4,1,1,6,1,1,8, ⋯ ]
- π=[3;7,15,1,292,1,1,1,2,1,3,1, ⋯ ]π=[3;7,15,1,292,1,1,1,2,1,3,1, ⋯ ]
Also, according to Wolfram Alpha, the continued fraction expansion of log102log102 is
log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18, ⋯ ].log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18, ⋯ ].(5)Given a continued fraction [a0;a1, ⋯ ][a0;a1, ⋯ ], the rational number [a0;a1, ⋯ ,ai][a0;a1, ⋯ ,ai] obtained by truncating the sequence at the ii th position is called the ii th convergent of the continued fraction. As it is a rational number we can write it uniquely as
[a0;a1, ⋯ai]=piqi[a0;a1, ⋯ai]=piqi(6)for a positive integer qiqi and an integer pipi coprime to qiqi. Then one can derive a recurrence relation
{pi=pi−2+aipi−1,qi=qi−2+aiqi−1{pi=pi−2+aipi−1,qi=qi−2+aiqi−1(7)with the initial conditions (p−2,q−2)=(0,1)(p−2,q−2)=(0,1) and (p−1,q−1)=(1,0)(p−1,q−1)=(1,0).
For example, for log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18, ⋯ ]log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18, ⋯ ], using the recurrence relation we can compute several initial convergents:
p0q0=01,p1q1=1+3⋅00+3⋅1=13,p2q2=0+3⋅11+3⋅3=310,p3q3=1+9⋅33+9⋅10=2893,p4q4=3+2⋅2810+2⋅93=59196,p5q5=28+2⋅5993+2⋅196=146485,p6q6=59+4⋅146196+4⋅485=6432136,p7q7=146+6⋅643485+6⋅2136=400413301,p8q8=643+2⋅40042136+2⋅13301=865128738,p0q0=01,p1q1=1+3⋅00+3⋅1=13,p2q2=0+3⋅11+3⋅3=310,p3q3=1+9⋅33+9⋅10=2893,p4q4=3+2⋅2810+2⋅93=59196,p5q5=28+2⋅5993+2⋅196=146485,p6q6=59+4⋅146196+4⋅485=6432136,p7q7=146+6⋅643485+6⋅2136=400413301,p8q8=643+2⋅40042136+2⋅13301=865128738,(8)and so on.
Note that the sequence of convergents above is converging to log102log102 rapidly. Indeed, the error |p2q2−x|∣∣p2q2−x∣∣ of the second convergent is about 1.03×10−31.03×10−3, and |p4q4−x|∣∣p4q4−x∣∣ is about 9.59×10−69.59×10−6. For the sixth convergent, the error |p6q6−x|∣∣p6q6−x∣∣ is about 3.31×10−83.31×10−8. This is way better than other types of rational approximations, let’s say by truncated decimal expansions of log102log102, because in that case the denominator must grow approximately as large as 108108 in order to achieve the error of order 10−810−8, but that order of error was achievable by p6q6p6q6 whose denominator is only 21362136.
Best rational approximations
A cool thing about continued fractions is that, in fact convergents are ones of the best possible rational approximations in the following sense. Given a real number xx, a rational number pqpq is called a best rational approximation of xx, if for every rational number abab with b≤qb≤q, we always have
|x−pq|≤|x−ab|.∣∣∣x−pq∣∣∣≤∣∣x−ab∣∣.(9)So, this means that there is no better approximation of xx by rational numbers of denominator no more than qq.
In fact, whenever pqpq is a best rational approximation of xx with q>1q>1, the equality in the inequality above is only achieved when ab=pqab=pq. Note that q=1q=1 corresponds to the uninteresting case of approximating xx with integers, and obviously in this case there is the unique best approximation of xx for all real numbers xx except only when x=n+12x=n+12 for some integer nn, and for those exceptional cases there are precisely two best approximations, namely nn and n+1n+1.
As hinted, given a continued fraction expansion of xx, any convergent piqipiqi is a best rational approximation of xx, except possibly for i=0i=0 (which is also a best rational approximation if and only if a1>1a1>1). It must be noted that, however, not every best rational approximation of xx is obtained as a convergent of its continued fraction expansion.
There is a nice description of all best rational approximations of xx in terms of convergents, but in this post that is kind of irrelevant. Rather, what matters for us is a method of enumerating all best rational approximations from below and from above. (In fact, the aforementioned description of all best rational approximations can be derived from there.)
Formally, we say a rational number pqpq is a best rational approximation from below of xx if pq≤xpq≤x and for any rational number abab with ab≤xab≤x and b≤qb≤q, we have
ab≤pq≤x.ab≤pq≤x.(10)Similarly, we say a rational number pqpq is a best rational approximation from above of xx if pq≥xpq≥x and for any rational number abab with ab≥xab≥x and b≤qb≤q, we have
ab≥pq≥x.ab≥pq≥x.(11)To describe how those best rational approximations from below/above look like, let x=[a0;a1, ⋯ ]x=[a0;a1, ⋯ ] be a continued fraction expansion of a real number xx and (piqi)i(piqi)i be the corresponding sequence of convergents. When xx is a non-integer rational number so that the expansion terminates after a nonzero number of steps, then we always assume that the expansion is of the second form, that is, the last coefficient is assumed to be 11.
It can be shown that the sequence (p2iq2i)i(p2iq2i)i of even convergents strictly increases to xx from below, while the sequence (p2i+1q2i+1)i(p2i+1q2i+1)i of odd convergents strictly decreases to xx from above. In other words, the convergence of convergents happens in a “zig-zag” manner, alternating between below and above of xx.
As the approximation errors of the even convergents decrease to zero, any sufficiently good rational approximation pq≤xpq≤x from below must lie in between some p2iq2ip2iq2i and p2i+2q2i+2p2i+2q2i+2. Similarly, any sufficiently good rational approximation pq≥xpq≥x from above must lie in between some p2i+1q2i+1p2i+1q2i+1 and p2i−1q2i−1p2i−1q2i−1.
Based on these facts, one can show the following: a rational number is a best rational approximation from below of xx if and only if it can be written as
p2i+sp2i+1q2i+sq2i+1p2i+sp2i+1q2i+sq2i+1(12)for some integer 0≤s≤a2i+20≤s≤a2i+2, and it is a best approximation from above of xx if and only if it can be written as
p2i−1+sp2iq2i−1+sq2ip2i−1+sp2iq2i−1+sq2i(13)for some integer 0≤s≤a2i+10≤s≤a2i+1.
In general, a rational number of the form
pi−1+spiqi−1+sqipi−1+spiqi−1+sqi(14)for an integer 0≤s≤ai+10≤s≤ai+1 is called a semiconvergent. So in other words, semiconvergents are precisely the best rational approximations from below/above of xx.
It can be shown that the sequence
(p2i+sp2i+1q2i+sq2i+1)a2i+2s=0(p2i+sp2i+1q2i+sq2i+1)a2i+2s=0(15)strictly monotonically increases from p2iq2ip2iq2i to p2i+2q2i+2p2i+2q2i+2, and similarly the sequence
(p2i−1+sp2iq2i−1+sq2i)a2i+1s=0(p2i−1+sp2iq2i−1+sq2i)a2i+1s=0(16)strictly monotonically decreases from p2i−1q2i−1p2i−1q2i−1 to p2i+1q2i+1p2i+1q2i+1. Therefore, as ss increases, we get successively better approximations of xx.
For the case of x=log102x=log102, here are the lists of all best rational approximations from below:
0<14<27<310<31103<59196<205681<3511166<4971651<6432136< ⋯ <log102,0<14<27<310<31103<59196<205681<3511166<4971651<6432136< ⋯ <log102,(17)and from above:
1>12>13>413>723>1033>1343>1653>1963>2273>2583>2893> ⋯ >log1021>12>13>413>723>1033>1343>1653>1963>2273>2583>2893> ⋯ >log102(18)with convergents highlighted in bold.
Clearly, if pqpq is a best rational approximation of xx from below, then we must have p=⌊qx⌋p=⌊qx⌋. Indeed, if pp is strictly less than ⌊qx⌋⌊qx⌋, then p+1qp+1q must be a strictly better approximation of xx which is still below xx, and if pp is strictly greater than ⌊qx⌋⌊qx⌋, then pqpq must be strictly greater than xx.
Similarly, if pqpq is a best rational approximation of xx from above, then we must have p=⌈qx⌉p=⌈qx⌉.
Application into computation of ⌊nx⌋⌊nx⌋
Alright, now let’s get back to where we started: when do we have the equality
⌊nm2k⌋=⌊nx⌋⌊nm2k⌋=⌊nx⌋(∗)for all |n|≤nmax|n|≤nmax? First, note that this equality can be equivalently written as the inequality
⌊nx⌋≤nm2k<⌊nx⌋+1.⌊nx⌋≤nm2k<⌊nx⌋+1.(19)The case n>0n>0
We will first consider the case n>0n>0. In this case, the inequality above can be rewritten as
⌊nx⌋n≤m2k<⌊nx⌋+1n.⌊nx⌋n≤m2k<⌊nx⌋+1n.(20)Obviously, thus (∗)(∗) holds for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax if and only if
maxn=1, ⋯ ,nmax⌊nx⌋n≤m2k<minn=1, ⋯ ,nmax⌊nx⌋+1n.maxn=1, ⋯ ,nmax⌊nx⌋n≤m2k<minn=1, ⋯ ,nmax⌊nx⌋+1n.(∗∗)Now, note that ⌊nx⌋n⌊nx⌋n and ⌊nx⌋+1n⌊nx⌋+1n are rational approximations of xx, where the former is smaller than or equal to xx and the latter is strictly greater than xx.
Therefore, the left-hand side of (∗∗)(∗∗) is nothing but p∗q∗p∗q∗, where p∗q∗p∗q∗ is the best rational approximation from below of xx with largest q∗≤nmaxq∗≤nmax.
Similarly, the right-hand side of (∗∗)(∗∗) is nothing but p∗q∗p∗q∗, where p∗q∗p∗q∗ is the best rational approximation from above of xx with the largest q∗≤nmaxq∗≤nmax. Except when x=p∗q∗x=p∗q∗ (which means that xx is rational and its denominator is at most nmaxnmax), in which case the situation is a bit dirtier and is analyzed as follows.
Note that the case x=p∗q∗x=p∗q∗ is the case considered in the classical paper by Granlund-Montgomery (Theorem 4.2), but here we will do a sharper analysis that gives us a condition that is not only sufficient but also necessary to have (∗)(∗) for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax. Since the case we are considering here is when xx is rational and its denominator is at most nmaxnmax, which means p∗q∗=p∗q∗=xp∗q∗=p∗q∗=x, so let us drop those stars from our notation and just write x=pqx=pq for brevity. Then we want to find the minimizer of
⌊np/q⌋+1n⌊np/q⌋+1n(21)for n=1, ⋯ ,nmaxn=1, ⋯ ,nmax. Let rr be the remainder of npnp divided by qq, then the above can be written as
(np−r)/q+1n=pq+q−rqn,(np−r)/q+1n=pq+q−rqn,(22)so the task is to find the minimizer of q−rnq−rn. One can expect that the minimum will be achieved when r=q−1r=q−1, so let uu be the largest integer such that u≤nmaxu≤nmax and up≡−1 (modq)up≡−1 (modq). We claim that uu is an optimizer of q−rnq−rn. Indeed, by definition uu, we must have nmax<u+qnmax<u+q. Note that for any nn,
q−rn=1u⋅(q−r)un,q−rn=1u⋅(q−r)un,(23)so it suffices to show that nn must be at most (q−r)u(q−r)u, so suppose n>(q−r)un>(q−r)u on the contrary. Since up≡−1 (modq)up≡−1 (modq), we have (q−r)up≡r≡np (modq)(q−r)up≡r≡np (modq). As we have np>(q−r)upnp>(q−r)up, we must have
np=(q−r)up+eqnp=(q−r)up+eq(24)for some e≥1e≥1. However, since npnp and (q−r)up(q−r)up are both multiples of pp, eqeq must be a multiple of pp as well, but since pp and qq are coprime, it follows that ee is a multiple of pp. Therefore,
n=(q−r)u+eqp≥(q−r)u+q≥u+q,n=(q−r)u+eqp≥(q−r)u+q≥u+q,(25)but this contradicts to nmax<u+qnmax<u+q, thus proving the claim.
As a result, we obtain
minn=1, ⋯ ,nmax⌊nx⌋+1n=x+1quminn=1, ⋯ ,nmax⌊nx⌋+1n=x+1qu(26)where uu is the largest integer such that u≤nmaxu≤nmax and up≡−1 (modq)up≡−1 (modq).
In summary, we get the following conclusions:
If xx is irrational or rational with denominator strictly greater than nmaxnmax, then (∗)(∗) holds for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax if and only if
2kp∗q∗≤m<2kp∗q∗.2kp∗q∗≤m<2kp∗q∗.(27)If x=pqx=pq is rational with q≤nmaxq≤nmax, then (∗)(∗) holds for all n=1, ⋯ ,nmaxn=1, ⋯ ,nmax if and only if
2kpq≤m<2kpq+2kqu2kpq≤m<2kpq+2kqu(28)where uu is the largest integer such that u≤nmaxu≤nmax and up≡−1 (modq)up≡−1 (modq).
The case n<0n<0
Next, let us consider the case n<0n<0. In this case, (∗)(∗) is equivalent to
⌈|n|x⌉−1|n|=⌊nx⌋+1n<m2k≤⌊nx⌋n=⌈|n|x⌉|n|.Similarly to the case n>0, the minimum of the right-hand side is precisely p∗q∗, where again p∗q∗ is the best rational approximation from above of x with the largest q∗≤nmax.
The maximum of the left-hand side is, as one can expect, a bit more involved. It is precisely p∗q∗ (with the same definition above) except when x=p∗q∗, in which case we do some more analysis.
As in the case n>0, assume x=pq with q≤nmax, and let us find the maximizer of
⌈np/q⌉−1nfor n=1, ⋯ ,nmax. This time, let r be the unique integer such that 0≤r≤q−1 and
np=⌈npq⌉q−r,then we can rewrite our objective function as
(np+r)/q−1n=pq−q−rqn.Therefore, a maximizer of the above is a minimizer of q−rqn, which is, as we obtained in the previous case, the largest integer u such that u≤nmax and up≡−1 (modq).
Hence, we get the following conclusions:
If x is irrational or rational with denominator strictly greater than nmax, then (∗) holds for all n=−1, ⋯ ,−nmax if and only if
2kp∗q∗<m≤2kp∗q∗.If x=pq is rational with q≤nmax, then (∗) holds for all n=−1, ⋯ ,−nmax if and only if
2kpq−2kqu<m≤2kpqwhere u is the largest integer such that u≤nmax and up≡−1 (modq).
Conclusion and an example
Putting two cases together, we get the following conclusion: if x is irrational or rational with denominator strictly greater than nmax, then (∗) holds for all |n|≤nmax if and only if
2kp∗q∗<m<2kp∗q∗.For the case x=pq with q≤nmax, we cannot really conclude anything useful if we consider positive n’s and negative n’s altogether, because the inequality we get is
2kpq≤m≤2kpq,which admits a solution if and only if 2kpq is an integer, which can happen only when q is a power of 2. But for that case the problem is already trivial.
The number x in main motivating examples are irrational numbers, so (◻) is indeed a useful conclusion. Now, let us apply it to the case x=log102 to see what we can get out of it. First, we need to see how p∗q∗ and p∗q∗ are determined for given nmax. This can be done in a systematic way, as shown below:
Find the last convergent piqi whose denominator qi is at most nmax.
If i is odd, we conclude
p∗q∗=piqiandp∗q∗=pi−1+spiqi−1+sqiwhere s is the largest integer with qi−1+sqi≤nmax.
If i is even, we conclude
p∗q∗=piqiandp∗q∗=pi−1+spiqi−1+sqiwhere s is the largest integer with qi−1+sqi≤nmax.
So, for example, if nmax=1000, then i=5 with piqi=146485 and pi−1qi−1=59196 so the maximum s with qi−1+sqi≤nmax is s=1, concluding
p∗q∗=146485andp∗q∗=59+1⋅146196+1⋅485=205681.Then, the minimum k that allows the inequality
2kp∗q∗<m<2kp∗q∗to have a solution is k=18, and in that case m=78913 is the unique solution. (One can verify that 78913 is precisely ⌊218log102⌋.)
In fact, note that p∗q∗ and p∗q∗ stay the same for any 681≤nmax≤1165; here, 1165 is one less of 196+2⋅485=1166, which is the moment when p∗q∗ changes into the next semiconvergent
59+2⋅146196+2⋅485=3511166.Therefore, with the choice k=18 and m=78913, we should have
⌊nx⌋=⌊78913⋅n218⌋for all |n|≤1165. In fact, even with p∗q∗=3511166 we still have (◻), so the above should hold until the next moment when p∗q∗ changes, which is when nmax=196+3⋅485=1651. If nmax=1651, then
p∗q∗=59+3⋅146196+3⋅485=4971651,and in this case now the left-hand side of (◻) is violated. Thus,
⌊nx⌋=⌊78913⋅n218⌋holds precisely up to |n|≤1650 and the first counterexample is n=±1651.
In a similar manner, one can see that p∗q∗=4971651 holds up to nmax=196+4⋅485−1=2135, and the minimum k allowing (◻) to have a solution is k=20, with m=315653 as the unique solution. (One can also verify that 315653 is precisely ⌈220log102⌉, so this time it is not the truncated binary expansion of log102, rather that plus 1.) Hence,
⌊nx⌋=⌊315653⋅n220⌋must hold at least up to |n|≤2135. When nmax=2136, p∗q∗ changes into 6432136, but (◻) is still true with the same choice of k and m, so the above formula must be valid at least for |n|≤2620; here 2621 is the moment when p∗q∗ changes from 146485 into 7892621. If nmax=2621 so p∗q∗ changes into 7892621, then the right-hand side of the inequality (◻) is now violated, so |n|≤2620 is indeed the optimal range and n=±2621 is the first counterexample.
In general, the transition can only occur at the denominators of semiconvergents. With this method, we can figure out what is the minimum k that allows the computation up to the chosen transition point and what m should be used for that choice of minimum k. We want m to be as small as possible, so m=⌈2kp∗q∗⌉ is the best choice. This m will be probably either floor or ceil of 2kx, but we cannot determine which one is better by simply looking at the binary expansion of x. This indicates the flaw of the naive method I tried before.
Another application: minmax Euclid algorithm
At the very heart of Ryū, Ryū-printf, Grisu-Exact and Dragonbox (which are all float-to-string conversion algorithms), there is so-called minmax Euclid algorithm.
Minmax Euclid algorithm is invented by Ulf Adams who is the author of Ryū and Ryū-printf mentioned above, and it appears on his paper on Ryū. It is at the core of showing the sufficiency of the number of bits of each item in the precomputed lookup table. What it does is this: given positive integers a,b and N, compute the minimum and maximum of agmodb where g is any integer ranging from 1 to N. It sounds simple, but naive exhaustive search doesn’t scale if a,b,N are very large numbers. Indeed, in its application into float-to-string conversion algorithms, a,b are extremely large integers and N can be as large as 254. For that wide range of g, it is simply impossible to run an exhaustive search.
A rough idea of minmax Euclid algorithm is that the coefficients appearing in the Euclid algorithm for computing GCD of a and b tell us what the minimum and the maximum are. If you try to write down and see how agmodb varies as g increases, you will find out why this is the case. (Precise mathematical proof can be cumbersome, though.) Formalizing this idea will lead to the minmax Euclid algorithm.
In fact, the exact minmax Euclid algorithm described in the Ryū paper is not entirely correct because it can produce wrong outputs for some inputs. Thus, I had to give some more thought on it after I realized the flaw in 2018 when I was trying to apply the algorithm to Grisu-Exact. Eventually, I came up with an improved and corrected version of it with a hopefully correct proof, which I wrote on the paper on Grisu-Exact (Figure 3 and Theorem 4.3).
But more than 2 years after writing the long-winded and messy proof of the algorithm, I finally realized that in fact the algorithm is just an easy application of continued fractions. Well, I did not know anything about continued fractions back then, unfortunately!
The story is as follows. Obtaining the minimum and the maximum of agmodb is equivalent to obtaining the minimum and the maximum of
agmodbb=agb−⌊agb⌋=g(ab−⌊ag/b⌋g)=1−g(⌊ag/b⌋+1g−ab),which reminds us of approximating ab by ⌊ag/b⌋g or by ⌊ag/b⌋+1g, except that what we are minimizing is not the error itself, rather the error multiplied by g. Hence, we define the following concepts.
We say a rational number pq is a best rational approximation from below in the strong sense of x if pq≤x and for any rational number ab with ab≤x and b≤q, we have
|qx−p|≤|bx−a|.Similarly, we say a rational number pq is a best rational approximation from above in the strong sense of x if pq≥x and for any rational number ab with ab≥x and b≤q, we have
|qx−p|≤|bx−a|.Let us call best rational approximations from below/above defined before as “in the weak sense” to distinguish from the new definitions.
The reason why these are called “in the strong sense” is obvious; if we have
|qx−p|≤|bx−a|,then dividing both sides by q gives
|x−pq|≤bq|x−ab|,so with b≤q the right-hand side is at most |x−ab|, so this implies that pq is a better approximation than ab.
It is a well-known fact that, if we remove the directional conditions ab≤x and ab≥x, so asking for best rational approximations (in both directions) in the strong sense, then these are precisely the convergents. However, it turns out that with the directional conditions, the best rational approximations from below/above in the weak sense or in the strong sense are just same. Hence, the best rational approximations from below/above in the strong sense are also just precisely the semiconvergents. This fact is probably well-known as well, but I do not know of any reference at this point other than my own proof of it (which I do not write here). I guess this fact on semiconvergents is probably one of the necessary pieces for proving the other fact that convergents are precisely the best rational approximations in the strong sense, but I haven’t checked any proof of it so I do not know.
Anyway, because of this, the problem of finding the minimum and the maximum of agmodb reduces to the problem of finding the semiconvergents that are below and above ab with the largest denominators among all semiconvergents of denominators at most N. This is essentially what the improved minmax Euclid algorithm is doing as described in my paper. I will not discuss further details here.
Computation of ⌊nx−y⌋
When developing Dragonbox, I also had to come up with a method of computing
⌊nlog102−log1043⌋.So I did the same thing, I approximated both log102 and log1043 by their respective truncated binary expansions, and computed something like
⌊nm−s2k⌋,where m,s are both positive integers, and manually found out the range of n making the above formula valid.
More generally, we can think of computing ⌊nx−y⌋ for fixed positive real numbers x,y. Again, signs of x,y are not terribly important, but let us just assume x,y>0 to make our life a bit easier. Also, let us further assume 0<y<1 because taking the integer part of y out of the floor function is not a big deal. So, how far we can go with a similar analysis in this case? Right now, it seems to me that this problem is far subtler than the case y=0 we have done before, but anyway here I am going to give a brief overview of what I got.
The rough idea is as follows. If we approximate x by a rational number pq sufficiently well, then each nx must be very close to npq. Note that
⌊nx−y⌋={⌊nx⌋if nx−⌊nx⌋≥y,⌊nx⌋−1if nx−⌊nx⌋<y,hence, what matters here is whether the fractional part of nx is above or below y. Note that the fractional part of nx must be approximately equal to npmodqq. Thus, find the unique integer 0≤u<q−1 such that uq≤y<u+1q, then probably we can compare the fractional part of nx with y by just comparing (npmodq) with u. This is indeed the case if y is sufficiently far from both uq and u+1q, specifically, more than the error of npq from nx.
So, in some sense the situation is better if x and y are “far apart” in the sense that the denominators showing up in approximating rationals of x and y are very different, and the situation is worse if there are a lot of common divisors between those denominators. Maybe someone better than me at number theory can formalize this into a precise language?
Anyway, with a high probability the distance from y to uq and u+1q will be of O(1q), but it is well-known that the distance from x to pq can be made to be of O(1q2), which will allow the equality
⌊nx−y⌋=⌊np−uq⌋to hold for O(q)-many n’s. After getting this equality, the next step is to convert it into
⌊np−uq⌋=⌊nm−s2k⌋using a Granlund-Montgomery style analysis we did for ⌊nx⌋ with rational x.
Unfortunately, the bound I got by applying this analysis to the case x=log102 and y=log1043 was not that great, particularly because y is too close to u+1q for the choice pq=6432136, which is otherwise a very efficient and effective approximation of x. Well, but I might be overlooking some things at this point, so probably I have to give some more tries on this later.
(Edit on 02-10-2022) I included a better analysis of this in my new paper on Dragonbox, Section 4.4. But still I consider it incomplete, because it seems that the actual range of n is much wider than what the analysis method described in the paper expects.
Acknowledgements
Thanks Seung uk Jang for teaching me many things about continued fractions (including useful references) and other discussions relevant to this post.
Leave a Comment