Continued fractions and their application into fast computation of nxnx

34 minute read

Published:

When I was working on Dragonbox and Grisu-Exact (which are float-to-string conversion algorithms with some nice properties) I had to come up with a fast method for computing things like nlog102nlog102 or nlog210nlog210, or more generally nxnx for some integer nn and a fixed positive real number xx. Actually, the sign of xx isn’t extremely important, but let us just assume x>0x>0 for simplicity.

At that time I just took a fairly straightforward approach, which is to compute the multiplication of nn with a truncated binary expansion of xx. That is, select an appropriate nonnegative integer kk and then compute nm2knm2k where mm is 2k2k multiplied by the binary expansion of xx up to kk th fractional digits, which is a nonnegative integer. Hence, the whole computation can be done with one integer multiplication followed by one arithmetic shift, which is indeed quite fast.

Here is an example. Consider x=log102x=log102. According to Wolfram Alpha, it’s hexadecimal expansion is 0x0.4d104d427de7fbcc47c4acd605be48.... If we choose, for example, k=20k=20, then m=m=0x4d104=315652=315652. Hence, we try to compute nxnx by computing

315652n220315652n220(1)

instead, or in C code:

  (315652 * n) >> 20

With a manual verification, it can be seen that this computation is valid for all |n|1650|n|1650. However, when n=1651n=1651, the above formula gives 496496 while the correct value is 497497, and when n=1651n=1651, the above gives 497497 while the correct value is 498498. Obviously, one would expect that the range of valid inputs will increase if we use a larger value for kk (thus using more digits of xx). That is certainly true in general, but it will at the same time make the computation more prone to overflow.

Note that in order to make this computation truly fast, we want to keep things inside a machine word size. For example, we may want the result of multiplication nmnm to fit inside 32-bits. Actually, even on 64-bit platforms, 32-bit arithmetic is often faster, so we still want things to be in 32-bits. For example, on platforms like x86_64, (if I recall correctly) there is no instruction for integer multiplication of a register and a 64-bit immediate constant, so if the constant mm is of 64-bits, we need to first load it into a register. If mm fits inside 32-bits, we don’t need to do that because there is a multiplication instruction for 32-bit constants and this results in a faster code.

(Please don’t say “you need a benchmark for such a claim”. I mean, yeah, that’s true, but firstly, once upon a time I witnessed a noticeable performance difference between these two so even though I’ve thrown out that code into the trash bin long time ago I still believe in this claim to some extent, and secondly and more importantly, difference of 32-bit vs 64-bit is not the point of this post.)

Hence, if for example we choose k=30k=30, then m=323228496m=323228496, so we must have |n|(2311)/m=6|n|(2311)/m=6 in order to ensure nmnm to fit in 32-bits.

(In fact, at least on x86_64, it seems that there is no difference in performance of multiplying a 64-bit register and a 32-bit constant to produce a 64-bit number, versus multiplying a 32-bit register and a 32-bit constant to produce a 32-bit number, so we might only want to ensure mm to be in 32-bits and mnmn to be in 64-bits, but let us just assume that we want nmnm to be also in 32-bits for simplicity. What matters here is that we want to keep things inside a fixed number of bits.)

As a result, these competing requirements will admit a sweet spot that gives the maximum range of nn. In this specific case, the sweet spot turns out to be k=21k=21 and m=631305m=631305, which allows nn up to |n|2620|n|2620.

So far, everything is nice, we obtained a nice formula for computing nxnx and we have found a range of nn that makes the formula valid. Except that we have zero theoretical estimate on how the size of the range of nn, kk, and xx are related. I just claimed that the range can be verified manually, which is indeed what I did when working on Dragonbox and Grisu-Exact.

To be precise, I did have some estimate at the time of developing Grisu-Exact, which I briefly explained here (it has little to do with this post so readers don’t need to look at it really), but this estimate is too rough and gives a poor bound on the range of valid nn’s.

Quite recently, I realized that the concept of so called continued fractions is a very useful tool for this kind of stuffs. In a nutshell, the reason why this concept is so useful is that it gives us a way to enumerate all the best rational approximations of any given real number, and from there we can derive a very detailed analysis of the mentioned method of computing nxnx.

Continued fractions

A continued fraction is a way of writing a number xx in a form

x=a0+b1a1+b2a2+b3a3+,x=a0+b1a1+b2a2+b3a3+,(2)

either as a finite or infinite sequence. We will only consider the case of simple continued fractions, which means that all of b1,b2,  b1,b2,   are 1. From now on, by a continued fraction we will always refer to a simple continued fraction. Also, a0a0 is assumed to be an integer while all other aiai’s are assumed to be positive integers. We denote the continued fraction with the coefficients a0,a1,a2,  a0,a1,a2,   as [a0;a1,a2,  ][a0;a1,a2,  ].

With these conventions, there is a fairly simple algorithm for obtaining a continued fraction expansion of xx. First, let a0:=xa0:=x. Then xa0xa0 is always in the interval [0,1)[0,1). If it is 00 that means xx is an integer so stop there. Otherwise, compute the reciprocal x1:=1xa0(1,)x1:=1xa0(1,). Then let a1:=x1a1:=x1, then again x1a1x1a1 lies in [0,1)[0,1). If it is zero, stop, and if not, compute the reciprocal x2:=1x1a1x2:=1x1a1 and continue.

Here is an example from Wikipedia; consider x=41593x=41593. As 415=493+43415=493+43, we obtain a0=4a0=4 and x1=9343x1=9343. Then from 93=243+793=243+7 we get a1=2a1=2 and x2=437x2=437, and similarly a2=6a2=6, x3=7x3=7. Since x3x3 is an integer, we get a3=7a3=7 and the process terminates. As a result, we obtain the continued fraction expansion

41593=4+12+16+17.41593=4+12+16+17.(3)

It can be easily shown that this procedure terminates if and only if xx is a rational number. In fact, when x=pqx=pq is a rational number (whenever we say pqpq is a rational number, we implicitly assumes that it is in its reduced form; that is, qq is a positive integer and pp is an integer coprime to qq), then the coefficients aiai’s are nothing but those appearing in the Euclid algorithm of computing GCD of pp and qq (which is a priori assumed to be 11 of course).

When xx is rational and x=[a0;a1,  ,ai]x=[a0;a1,  ,ai] is the continued fraction expansion of xx obtained from the above algorithm, then either xx is an integer (so i=0i=0) or ai>1ai>1. Then [a0;a1,  ,ai1,ai1,1][a0;a1,  ,ai1,ai1,1] is another continued fraction expansion of xx. Then it can be shown that these two are the only continued fraction expansions for xx. For example, we have the following alternative continued fraction expansion

41593=4+12+16+16+1141593=4+12+16+16+11(4)

of 4159341593, and these two expansions are the only continued fraction expansions of 4159341593.

When xx is an irrational number, we can run the same algorithm to get an infinite sequence, and the obtained continued fraction expansion of xx is the unique one. Here are some list of examples from Wikipedia:

  • 19=[4;2,1,3,1,2,8,2,1,3,1,2,8,  ]19=[4;2,1,3,1,2,8,2,1,3,1,2,8,  ]
  • e=[2;1,2,1,1,4,1,1,6,1,1,8,  ]e=[2;1,2,1,1,4,1,1,6,1,1,8,  ]
  • π=[3;7,15,1,292,1,1,1,2,1,3,1,  ]π=[3;7,15,1,292,1,1,1,2,1,3,1,  ]

Also, according to Wolfram Alpha, the continued fraction expansion of log102log102 is

log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18,  ].log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18,  ].(5)

Given a continued fraction [a0;a1,  ][a0;a1,  ], the rational number [a0;a1,  ,ai][a0;a1,  ,ai] obtained by truncating the sequence at the ii th position is called the ii th convergent of the continued fraction. As it is a rational number we can write it uniquely as

[a0;a1, ai]=piqi[a0;a1, ai]=piqi(6)

for a positive integer qiqi and an integer pipi coprime to qiqi. Then one can derive a recurrence relation

{pi=pi2+aipi1,qi=qi2+aiqi1{pi=pi2+aipi1,qi=qi2+aiqi1(7)

with the initial conditions (p2,q2)=(0,1)(p2,q2)=(0,1) and (p1,q1)=(1,0)(p1,q1)=(1,0).

For example, for log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18,  ]log102=[0;3,3,9,2,2,4,6,2,1,1,3,1,18,  ], using the recurrence relation we can compute several initial convergents:

p0q0=01,p1q1=1+300+31=13,p2q2=0+311+33=310,p3q3=1+933+910=2893,p4q4=3+22810+293=59196,p5q5=28+25993+2196=146485,p6q6=59+4146196+4485=6432136,p7q7=146+6643485+62136=400413301,p8q8=643+240042136+213301=865128738,p0q0=01,p1q1=1+300+31=13,p2q2=0+311+33=310,p3q3=1+933+910=2893,p4q4=3+22810+293=59196,p5q5=28+25993+2196=146485,p6q6=59+4146196+4485=6432136,p7q7=146+6643485+62136=400413301,p8q8=643+240042136+213301=865128738,(8)

and so on.

Note that the sequence of convergents above is converging to log102log102 rapidly. Indeed, the error |p2q2x|p2q2x of the second convergent is about 1.03×1031.03×103, and |p4q4x|p4q4x is about 9.59×1069.59×106. For the sixth convergent, the error |p6q6x|p6q6x is about 3.31×1083.31×108. This is way better than other types of rational approximations, let’s say by truncated decimal expansions of log102log102, because in that case the denominator must grow approximately as large as 108108 in order to achieve the error of order 108108, but that order of error was achievable by p6q6p6q6 whose denominator is only 21362136.

Best rational approximations

A cool thing about continued fractions is that, in fact convergents are ones of the best possible rational approximations in the following sense. Given a real number xx, a rational number pqpq is called a best rational approximation of xx, if for every rational number abab with bqbq, we always have

|xpq||xab|.xpqxab.(9)

So, this means that there is no better approximation of xx by rational numbers of denominator no more than qq.

In fact, whenever pqpq is a best rational approximation of xx with q>1q>1, the equality in the inequality above is only achieved when ab=pqab=pq. Note that q=1q=1 corresponds to the uninteresting case of approximating xx with integers, and obviously in this case there is the unique best approximation of xx for all real numbers xx except only when x=n+12x=n+12 for some integer nn, and for those exceptional cases there are precisely two best approximations, namely nn and n+1n+1.

As hinted, given a continued fraction expansion of xx, any convergent piqipiqi is a best rational approximation of xx, except possibly for i=0i=0 (which is also a best rational approximation if and only if a1>1a1>1). It must be noted that, however, not every best rational approximation of xx is obtained as a convergent of its continued fraction expansion.

There is a nice description of all best rational approximations of xx in terms of convergents, but in this post that is kind of irrelevant. Rather, what matters for us is a method of enumerating all best rational approximations from below and from above. (In fact, the aforementioned description of all best rational approximations can be derived from there.)

Formally, we say a rational number pqpq is a best rational approximation from below of xx if pqxpqx and for any rational number abab with abxabx and bqbq, we have

abpqx.abpqx.(10)

Similarly, we say a rational number pqpq is a best rational approximation from above of xx if pqxpqx and for any rational number abab with abxabx and bqbq, we have

abpqx.abpqx.(11)

To describe how those best rational approximations from below/above look like, let x=[a0;a1,  ]x=[a0;a1,  ] be a continued fraction expansion of a real number xx and (piqi)i(piqi)i be the corresponding sequence of convergents. When xx is a non-integer rational number so that the expansion terminates after a nonzero number of steps, then we always assume that the expansion is of the second form, that is, the last coefficient is assumed to be 11.

It can be shown that the sequence (p2iq2i)i(p2iq2i)i of even convergents strictly increases to xx from below, while the sequence (p2i+1q2i+1)i(p2i+1q2i+1)i of odd convergents strictly decreases to xx from above. In other words, the convergence of convergents happens in a “zig-zag” manner, alternating between below and above of xx.

As the approximation errors of the even convergents decrease to zero, any sufficiently good rational approximation pqxpqx from below must lie in between some p2iq2ip2iq2i and p2i+2q2i+2p2i+2q2i+2. Similarly, any sufficiently good rational approximation pqxpqx from above must lie in between some p2i+1q2i+1p2i+1q2i+1 and p2i1q2i1p2i1q2i1.

Based on these facts, one can show the following: a rational number is a best rational approximation from below of xx if and only if it can be written as

p2i+sp2i+1q2i+sq2i+1p2i+sp2i+1q2i+sq2i+1(12)

for some integer 0sa2i+20sa2i+2, and it is a best approximation from above of xx if and only if it can be written as

p2i1+sp2iq2i1+sq2ip2i1+sp2iq2i1+sq2i(13)

for some integer 0sa2i+10sa2i+1.

In general, a rational number of the form

pi1+spiqi1+sqipi1+spiqi1+sqi(14)

for an integer 0sai+10sai+1 is called a semiconvergent. So in other words, semiconvergents are precisely the best rational approximations from below/above of xx.

It can be shown that the sequence

(p2i+sp2i+1q2i+sq2i+1)a2i+2s=0(p2i+sp2i+1q2i+sq2i+1)a2i+2s=0(15)

strictly monotonically increases from p2iq2ip2iq2i to p2i+2q2i+2p2i+2q2i+2, and similarly the sequence

(p2i1+sp2iq2i1+sq2i)a2i+1s=0(p2i1+sp2iq2i1+sq2i)a2i+1s=0(16)

strictly monotonically decreases from p2i1q2i1p2i1q2i1 to p2i+1q2i+1p2i+1q2i+1. Therefore, as ss increases, we get successively better approximations of xx.

For the case of x=log102x=log102, here are the lists of all best rational approximations from below:

0<14<27<310<31103<59196<205681<3511166<4971651<6432136<  <log102,0<14<27<310<31103<59196<205681<3511166<4971651<6432136<  <log102,(17)

and from above:

1>12>13>413>723>1033>1343>1653>1963>2273>2583>2893>  >log1021>12>13>413>723>1033>1343>1653>1963>2273>2583>2893>  >log102(18)

with convergents highlighted in bold.

Clearly, if pqpq is a best rational approximation of xx from below, then we must have p=qxp=qx. Indeed, if pp is strictly less than qxqx, then p+1qp+1q must be a strictly better approximation of xx which is still below xx, and if pp is strictly greater than qxqx, then pqpq must be strictly greater than xx.

Similarly, if pqpq is a best rational approximation of xx from above, then we must have p=qxp=qx.

Application into computation of nxnx

Alright, now let’s get back to where we started: when do we have the equality

nm2k=nxnm2k=nx()

for all |n|nmax|n|nmax? First, note that this equality can be equivalently written as the inequality

nxnm2k<nx+1.nxnm2k<nx+1.(19)

The case n>0n>0

We will first consider the case n>0n>0. In this case, the inequality above can be rewritten as

nxnm2k<nx+1n.nxnm2k<nx+1n.(20)

Obviously, thus ()() holds for all n=1,  ,nmaxn=1,  ,nmax if and only if

maxn=1,  ,nmaxnxnm2k<minn=1,  ,nmaxnx+1n.maxn=1,  ,nmaxnxnm2k<minn=1,  ,nmaxnx+1n.()

Now, note that nxnnxn and nx+1nnx+1n are rational approximations of xx, where the former is smaller than or equal to xx and the latter is strictly greater than xx.

Therefore, the left-hand side of ()() is nothing but pqpq, where pqpq is the best rational approximation from below of xx with largest qnmaxqnmax.

Similarly, the right-hand side of ()() is nothing but pqpq, where pqpq is the best rational approximation from above of xx with the largest qnmaxqnmax. Except when x=pqx=pq (which means that xx is rational and its denominator is at most nmaxnmax), in which case the situation is a bit dirtier and is analyzed as follows.

Note that the case x=pqx=pq is the case considered in the classical paper by Granlund-Montgomery (Theorem 4.2), but here we will do a sharper analysis that gives us a condition that is not only sufficient but also necessary to have ()() for all n=1,  ,nmaxn=1,  ,nmax. Since the case we are considering here is when xx is rational and its denominator is at most nmaxnmax, which means pq=pq=xpq=pq=x, so let us drop those stars from our notation and just write x=pqx=pq for brevity. Then we want to find the minimizer of

np/q+1nnp/q+1n(21)

for n=1,  ,nmaxn=1,  ,nmax. Let rr be the remainder of npnp divided by qq, then the above can be written as

(npr)/q+1n=pq+qrqn,(npr)/q+1n=pq+qrqn,(22)

so the task is to find the minimizer of qrnqrn. One can expect that the minimum will be achieved when r=q1r=q1, so let uu be the largest integer such that unmaxunmax and up1 (modq)up1 (modq). We claim that uu is an optimizer of qrnqrn. Indeed, by definition uu, we must have nmax<u+qnmax<u+q. Note that for any nn,

qrn=1u(qr)un,qrn=1u(qr)un,(23)

so it suffices to show that nn must be at most (qr)u(qr)u, so suppose n>(qr)un>(qr)u on the contrary. Since up1 (modq)up1 (modq), we have (qr)uprnp (modq)(qr)uprnp (modq). As we have np>(qr)upnp>(qr)up, we must have

np=(qr)up+eqnp=(qr)up+eq(24)

for some e1e1. However, since npnp and (qr)up(qr)up are both multiples of pp, eqeq must be a multiple of pp as well, but since pp and qq are coprime, it follows that ee is a multiple of pp. Therefore,

n=(qr)u+eqp(qr)u+qu+q,n=(qr)u+eqp(qr)u+qu+q,(25)

but this contradicts to nmax<u+qnmax<u+q, thus proving the claim.

As a result, we obtain

minn=1,  ,nmaxnx+1n=x+1quminn=1,  ,nmaxnx+1n=x+1qu(26)

where uu is the largest integer such that unmaxunmax and up1 (modq)up1 (modq).

In summary, we get the following conclusions:

  • If xx is irrational or rational with denominator strictly greater than nmaxnmax, then ()() holds for all n=1,  ,nmaxn=1,  ,nmax if and only if

    2kpqm<2kpq.2kpqm<2kpq.(27)
  • If x=pqx=pq is rational with qnmaxqnmax, then ()() holds for all n=1,  ,nmaxn=1,  ,nmax if and only if

    2kpqm<2kpq+2kqu2kpqm<2kpq+2kqu(28)

    where uu is the largest integer such that unmaxunmax and up1 (modq)up1 (modq).

The case n<0n<0

Next, let us consider the case n<0n<0. In this case, ()() is equivalent to

|n|x1|n|=nx+1n<m2knxn=|n|x|n|.

Similarly to the case n>0, the minimum of the right-hand side is precisely pq, where again pq is the best rational approximation from above of x with the largest qnmax.

The maximum of the left-hand side is, as one can expect, a bit more involved. It is precisely pq (with the same definition above) except when x=pq, in which case we do some more analysis.

As in the case n>0, assume x=pq with qnmax, and let us find the maximizer of

np/q1n

for n=1,  ,nmax. This time, let r be the unique integer such that 0rq1 and

np=npqqr,

then we can rewrite our objective function as

(np+r)/q1n=pqqrqn.

Therefore, a maximizer of the above is a minimizer of qrqn, which is, as we obtained in the previous case, the largest integer u such that unmax and up1 (modq).

Hence, we get the following conclusions:

  • If x is irrational or rational with denominator strictly greater than nmax, then () holds for all n=1,  ,nmax if and only if

    2kpq<m2kpq.
  • If x=pq is rational with qnmax, then () holds for all n=1,  ,nmax if and only if

    2kpq2kqu<m2kpq

    where u is the largest integer such that unmax and up1 (modq).

Conclusion and an example

Putting two cases together, we get the following conclusion: if x is irrational or rational with denominator strictly greater than nmax, then () holds for all |n|nmax if and only if

2kpq<m<2kpq.

For the case x=pq with qnmax, we cannot really conclude anything useful if we consider positive n’s and negative n’s altogether, because the inequality we get is

2kpqm2kpq,

which admits a solution if and only if 2kpq is an integer, which can happen only when q is a power of 2. But for that case the problem is already trivial.

The number x in main motivating examples are irrational numbers, so () is indeed a useful conclusion. Now, let us apply it to the case x=log102 to see what we can get out of it. First, we need to see how pq and pq are determined for given nmax. This can be done in a systematic way, as shown below:

  • Find the last convergent piqi whose denominator qi is at most nmax.

  • If i is odd, we conclude

    pq=piqiandpq=pi1+spiqi1+sqi

    where s is the largest integer with qi1+sqinmax.

  • If i is even, we conclude

    pq=piqiandpq=pi1+spiqi1+sqi

    where s is the largest integer with qi1+sqinmax.

So, for example, if nmax=1000, then i=5 with piqi=146485 and pi1qi1=59196 so the maximum s with qi1+sqinmax is s=1, concluding

pq=146485andpq=59+1146196+1485=205681.

Then, the minimum k that allows the inequality

2kpq<m<2kpq

to have a solution is k=18, and in that case m=78913 is the unique solution. (One can verify that 78913 is precisely 218log102.)

In fact, note that pq and pq stay the same for any 681nmax1165; here, 1165 is one less of 196+2485=1166, which is the moment when pq changes into the next semiconvergent

59+2146196+2485=3511166.

Therefore, with the choice k=18 and m=78913, we should have

nx=78913n218

for all |n|1165. In fact, even with pq=3511166 we still have (), so the above should hold until the next moment when pq changes, which is when nmax=196+3485=1651. If nmax=1651, then

pq=59+3146196+3485=4971651,

and in this case now the left-hand side of () is violated. Thus,

nx=78913n218

holds precisely up to |n|1650 and the first counterexample is n=±1651.

In a similar manner, one can see that pq=4971651 holds up to nmax=196+44851=2135, and the minimum k allowing () to have a solution is k=20, with m=315653 as the unique solution. (One can also verify that 315653 is precisely 220log102, so this time it is not the truncated binary expansion of log102, rather that plus 1.) Hence,

nx=315653n220

must hold at least up to |n|2135. When nmax=2136, pq changes into 6432136, but () is still true with the same choice of k and m, so the above formula must be valid at least for |n|2620; here 2621 is the moment when pq changes from 146485 into 7892621. If nmax=2621 so pq changes into 7892621, then the right-hand side of the inequality () is now violated, so |n|2620 is indeed the optimal range and n=±2621 is the first counterexample.

In general, the transition can only occur at the denominators of semiconvergents. With this method, we can figure out what is the minimum k that allows the computation up to the chosen transition point and what m should be used for that choice of minimum k. We want m to be as small as possible, so m=2kpq is the best choice. This m will be probably either floor or ceil of 2kx, but we cannot determine which one is better by simply looking at the binary expansion of x. This indicates the flaw of the naive method I tried before.

Another application: minmax Euclid algorithm

At the very heart of Ryū, Ryū-printf, Grisu-Exact and Dragonbox (which are all float-to-string conversion algorithms), there is so-called minmax Euclid algorithm.

Minmax Euclid algorithm is invented by Ulf Adams who is the author of Ryū and Ryū-printf mentioned above, and it appears on his paper on Ryū. It is at the core of showing the sufficiency of the number of bits of each item in the precomputed lookup table. What it does is this: given positive integers a,b and N, compute the minimum and maximum of agmodb where g is any integer ranging from 1 to N. It sounds simple, but naive exhaustive search doesn’t scale if a,b,N are very large numbers. Indeed, in its application into float-to-string conversion algorithms, a,b are extremely large integers and N can be as large as 254. For that wide range of g, it is simply impossible to run an exhaustive search.

A rough idea of minmax Euclid algorithm is that the coefficients appearing in the Euclid algorithm for computing GCD of a and b tell us what the minimum and the maximum are. If you try to write down and see how agmodb varies as g increases, you will find out why this is the case. (Precise mathematical proof can be cumbersome, though.) Formalizing this idea will lead to the minmax Euclid algorithm.

In fact, the exact minmax Euclid algorithm described in the Ryū paper is not entirely correct because it can produce wrong outputs for some inputs. Thus, I had to give some more thought on it after I realized the flaw in 2018 when I was trying to apply the algorithm to Grisu-Exact. Eventually, I came up with an improved and corrected version of it with a hopefully correct proof, which I wrote on the paper on Grisu-Exact (Figure 3 and Theorem 4.3).

But more than 2 years after writing the long-winded and messy proof of the algorithm, I finally realized that in fact the algorithm is just an easy application of continued fractions. Well, I did not know anything about continued fractions back then, unfortunately!

The story is as follows. Obtaining the minimum and the maximum of agmodb is equivalent to obtaining the minimum and the maximum of

agmodbb=agbagb=g(abag/bg)=1g(ag/b+1gab),

which reminds us of approximating ab by ag/bg or by ag/b+1g, except that what we are minimizing is not the error itself, rather the error multiplied by g. Hence, we define the following concepts.

We say a rational number pq is a best rational approximation from below in the strong sense of x if pqx and for any rational number ab with abx and bq, we have

|qxp||bxa|.

Similarly, we say a rational number pq is a best rational approximation from above in the strong sense of x if pqx and for any rational number ab with abx and bq, we have

|qxp||bxa|.

Let us call best rational approximations from below/above defined before as “in the weak sense” to distinguish from the new definitions.

The reason why these are called “in the strong sense” is obvious; if we have

|qxp||bxa|,

then dividing both sides by q gives

|xpq|bq|xab|,

so with bq the right-hand side is at most |xab|, so this implies that pq is a better approximation than ab.

It is a well-known fact that, if we remove the directional conditions abx and abx, so asking for best rational approximations (in both directions) in the strong sense, then these are precisely the convergents. However, it turns out that with the directional conditions, the best rational approximations from below/above in the weak sense or in the strong sense are just same. Hence, the best rational approximations from below/above in the strong sense are also just precisely the semiconvergents. This fact is probably well-known as well, but I do not know of any reference at this point other than my own proof of it (which I do not write here). I guess this fact on semiconvergents is probably one of the necessary pieces for proving the other fact that convergents are precisely the best rational approximations in the strong sense, but I haven’t checked any proof of it so I do not know.

Anyway, because of this, the problem of finding the minimum and the maximum of agmodb reduces to the problem of finding the semiconvergents that are below and above ab with the largest denominators among all semiconvergents of denominators at most N. This is essentially what the improved minmax Euclid algorithm is doing as described in my paper. I will not discuss further details here.

Computation of nxy

When developing Dragonbox, I also had to come up with a method of computing

nlog102log1043.

So I did the same thing, I approximated both log102 and log1043 by their respective truncated binary expansions, and computed something like

nms2k,

where m,s are both positive integers, and manually found out the range of n making the above formula valid.

More generally, we can think of computing nxy for fixed positive real numbers x,y. Again, signs of x,y are not terribly important, but let us just assume x,y>0 to make our life a bit easier. Also, let us further assume 0<y<1 because taking the integer part of y out of the floor function is not a big deal. So, how far we can go with a similar analysis in this case? Right now, it seems to me that this problem is far subtler than the case y=0 we have done before, but anyway here I am going to give a brief overview of what I got.

The rough idea is as follows. If we approximate x by a rational number pq sufficiently well, then each nx must be very close to npq. Note that

nxy={nxif nxnxy,nx1if nxnx<y,

hence, what matters here is whether the fractional part of nx is above or below y. Note that the fractional part of nx must be approximately equal to npmodqq. Thus, find the unique integer 0u<q1 such that uqy<u+1q, then probably we can compare the fractional part of nx with y by just comparing (npmodq) with u. This is indeed the case if y is sufficiently far from both uq and u+1q, specifically, more than the error of npq from nx.

So, in some sense the situation is better if x and y are “far apart” in the sense that the denominators showing up in approximating rationals of x and y are very different, and the situation is worse if there are a lot of common divisors between those denominators. Maybe someone better than me at number theory can formalize this into a precise language?

Anyway, with a high probability the distance from y to uq and u+1q will be of O(1q), but it is well-known that the distance from x to pq can be made to be of O(1q2), which will allow the equality

nxy=npuq

to hold for O(q)-many n’s. After getting this equality, the next step is to convert it into

npuq=nms2k

using a Granlund-Montgomery style analysis we did for nx with rational x.

Unfortunately, the bound I got by applying this analysis to the case x=log102 and y=log1043 was not that great, particularly because y is too close to u+1q for the choice pq=6432136, which is otherwise a very efficient and effective approximation of x. Well, but I might be overlooking some things at this point, so probably I have to give some more tries on this later.

(Edit on 02-10-2022) I included a better analysis of this in my new paper on Dragonbox, Section 4.4. But still I consider it incomplete, because it seems that the actual range of n is much wider than what the analysis method described in the paper expects.

Acknowledgements

Thanks Seung uk Jang for teaching me many things about continued fractions (including useful references) and other discussions relevant to this post.

Leave a Comment