Continued fractions and their application into fast computation of
Published:
When I was working on Dragonbox and Grisu-Exact (which are float-to-string conversion algorithms with some nice properties) I had to come up with a fast method for computing things like or , or more generally for some integer and a fixed positive real number . Actually, the sign of isn’t extremely important, but let us just assume for simplicity.
At that time I just took a fairly straightforward approach, which is to compute the multiplication of with a truncated binary expansion of . That is, select an appropriate nonnegative integer and then compute where is multiplied by the binary expansion of up to th fractional digits, which is a nonnegative integer. Hence, the whole computation can be done with one integer multiplication followed by one arithmetic shift, which is indeed quite fast.
Here is an example. Consider . According to Wolfram Alpha, it’s hexadecimal expansion is 0x0.4d104d427de7fbcc47c4acd605be48...
. If we choose, for example, , then 0x4d104
. Hence, we try to compute by computing
instead, or in C code:
(315652 * n) >> 20
With a manual verification, it can be seen that this computation is valid for all . However, when , the above formula gives while the correct value is , and when , the above gives while the correct value is . Obviously, one would expect that the range of valid inputs will increase if we use a larger value for (thus using more digits of ). That is certainly true in general, but it will at the same time make the computation more prone to overflow.
Note that in order to make this computation truly fast, we want to keep things inside a machine word size. For example, we may want the result of multiplication to fit inside 32-bits. Actually, even on 64-bit platforms, 32-bit arithmetic is often faster, so we still want things to be in 32-bits. For example, on platforms like x86_64, (if I recall correctly) there is no instruction for integer multiplication of a register and a 64-bit immediate constant, so if the constant is of 64-bits, we need to first load it into a register. If fits inside 32-bits, we don’t need to do that because there is a multiplication instruction for 32-bit constants and this results in a faster code.
(Please don’t say “you need a benchmark for such a claim”. I mean, yeah, that’s true, but firstly, once upon a time I witnessed a noticeable performance difference between these two so even though I’ve thrown out that code into the trash bin long time ago I still believe in this claim to some extent, and secondly and more importantly, difference of 32-bit vs 64-bit is not the point of this post.)
Hence, if for example we choose , then , so we must have in order to ensure to fit in 32-bits.
(In fact, at least on x86_64, it seems that there is no difference in performance of multiplying a 64-bit register and a 32-bit constant to produce a 64-bit number, versus multiplying a 32-bit register and a 32-bit constant to produce a 32-bit number, so we might only want to ensure to be in 32-bits and to be in 64-bits, but let us just assume that we want to be also in 32-bits for simplicity. What matters here is that we want to keep things inside a fixed number of bits.)
As a result, these competing requirements will admit a sweet spot that gives the maximum range of . In this specific case, the sweet spot turns out to be and , which allows up to .
So far, everything is nice, we obtained a nice formula for computing and we have found a range of that makes the formula valid. Except that we have zero theoretical estimate on how the size of the range of , , and are related. I just claimed that the range can be verified manually, which is indeed what I did when working on Dragonbox and Grisu-Exact.
To be precise, I did have some estimate at the time of developing Grisu-Exact, which I briefly explained here (it has little to do with this post so readers don’t need to look at it really), but this estimate is too rough and gives a poor bound on the range of valid ’s.
Quite recently, I realized that the concept of so called continued fractions is a very useful tool for this kind of stuffs. In a nutshell, the reason why this concept is so useful is that it gives us a way to enumerate all the best rational approximations of any given real number, and from there we can derive a very detailed analysis of the mentioned method of computing .
Continued fractions
A continued fraction is a way of writing a number in a form
either as a finite or infinite sequence. We will only consider the case of simple continued fractions, which means that all of are 1. From now on, by a continued fraction we will always refer to a simple continued fraction. Also, is assumed to be an integer while all other ’s are assumed to be positive integers. We denote the continued fraction with the coefficients as .
With these conventions, there is a fairly simple algorithm for obtaining a continued fraction expansion of . First, let . Then is always in the interval . If it is that means is an integer so stop there. Otherwise, compute the reciprocal . Then let , then again lies in . If it is zero, stop, and if not, compute the reciprocal and continue.
Here is an example from Wikipedia; consider . As , we obtain and . Then from we get and , and similarly , . Since is an integer, we get and the process terminates. As a result, we obtain the continued fraction expansion
It can be easily shown that this procedure terminates if and only if is a rational number. In fact, when is a rational number (whenever we say is a rational number, we implicitly assumes that it is in its reduced form; that is, is a positive integer and is an integer coprime to ), then the coefficients ’s are nothing but those appearing in the Euclid algorithm of computing GCD of and (which is a priori assumed to be of course).
When is rational and is the continued fraction expansion of obtained from the above algorithm, then either is an integer (so ) or . Then is another continued fraction expansion of . Then it can be shown that these two are the only continued fraction expansions for . For example, we have the following alternative continued fraction expansion
of , and these two expansions are the only continued fraction expansions of .
When is an irrational number, we can run the same algorithm to get an infinite sequence, and the obtained continued fraction expansion of is the unique one. Here are some list of examples from Wikipedia:
Also, according to Wolfram Alpha, the continued fraction expansion of is
Given a continued fraction , the rational number obtained by truncating the sequence at the th position is called the th convergent of the continued fraction. As it is a rational number we can write it uniquely as
for a positive integer and an integer coprime to . Then one can derive a recurrence relation
with the initial conditions and .
For example, for , using the recurrence relation we can compute several initial convergents:
and so on.
Note that the sequence of convergents above is converging to rapidly. Indeed, the error of the second convergent is about , and is about . For the sixth convergent, the error is about . This is way better than other types of rational approximations, let’s say by truncated decimal expansions of , because in that case the denominator must grow approximately as large as in order to achieve the error of order , but that order of error was achievable by whose denominator is only .
Best rational approximations
A cool thing about continued fractions is that, in fact convergents are ones of the best possible rational approximations in the following sense. Given a real number , a rational number is called a best rational approximation of , if for every rational number with , we always have
So, this means that there is no better approximation of by rational numbers of denominator no more than .
In fact, whenever is a best rational approximation of with , the equality in the inequality above is only achieved when . Note that corresponds to the uninteresting case of approximating with integers, and obviously in this case there is the unique best approximation of for all real numbers except only when for some integer , and for those exceptional cases there are precisely two best approximations, namely and .
As hinted, given a continued fraction expansion of , any convergent is a best rational approximation of , except possibly for (which is also a best rational approximation if and only if ). It must be noted that, however, not every best rational approximation of is obtained as a convergent of its continued fraction expansion.
There is a nice description of all best rational approximations of in terms of convergents, but in this post that is kind of irrelevant. Rather, what matters for us is a method of enumerating all best rational approximations from below and from above. (In fact, the aforementioned description of all best rational approximations can be derived from there.)
Formally, we say a rational number is a best rational approximation from below of if and for any rational number with and , we have
Similarly, we say a rational number is a best rational approximation from above of if and for any rational number with and , we have
To describe how those best rational approximations from below/above look like, let be a continued fraction expansion of a real number and be the corresponding sequence of convergents. When is a non-integer rational number so that the expansion terminates after a nonzero number of steps, then we always assume that the expansion is of the second form, that is, the last coefficient is assumed to be .
It can be shown that the sequence of even convergents strictly increases to from below, while the sequence of odd convergents strictly decreases to from above. In other words, the convergence of convergents happens in a “zig-zag” manner, alternating between below and above of .
As the approximation errors of the even convergents decrease to zero, any sufficiently good rational approximation from below must lie in between some and . Similarly, any sufficiently good rational approximation from above must lie in between some and .
Based on these facts, one can show the following: a rational number is a best rational approximation from below of if and only if it can be written as
for some integer , and it is a best approximation from above of if and only if it can be written as
for some integer .
In general, a rational number of the form
for an integer is called a semiconvergent. So in other words, semiconvergents are precisely the best rational approximations from below/above of .
It can be shown that the sequence
strictly monotonically increases from to , and similarly the sequence
strictly monotonically decreases from to . Therefore, as increases, we get successively better approximations of .
For the case of , here are the lists of all best rational approximations from below:
and from above:
with convergents highlighted in bold.
Clearly, if is a best rational approximation of from below, then we must have . Indeed, if is strictly less than , then must be a strictly better approximation of which is still below , and if is strictly greater than , then must be strictly greater than .
Similarly, if is a best rational approximation of from above, then we must have .
Application into computation of
Alright, now let’s get back to where we started: when do we have the equality
for all ? First, note that this equality can be equivalently written as the inequality
The case
We will first consider the case . In this case, the inequality above can be rewritten as
Obviously, thus holds for all if and only if
Now, note that and are rational approximations of , where the former is smaller than or equal to and the latter is strictly greater than .
Therefore, the left-hand side of is nothing but , where is the best rational approximation from below of with largest .
Similarly, the right-hand side of is nothing but , where is the best rational approximation from above of with the largest . Except when (which means that is rational and its denominator is at most ), in which case the situation is a bit dirtier and is analyzed as follows.
Note that the case is the case considered in the classical paper by Granlund-Montgomery (Theorem 4.2), but here we will do a sharper analysis that gives us a condition that is not only sufficient but also necessary to have for all . Since the case we are considering here is when is rational and its denominator is at most , which means , so let us drop those stars from our notation and just write for brevity. Then we want to find the minimizer of
for . Let be the remainder of divided by , then the above can be written as
so the task is to find the minimizer of . One can expect that the minimum will be achieved when , so let be the largest integer such that and . We claim that is an optimizer of . Indeed, by definition , we must have . Note that for any ,
so it suffices to show that must be at most , so suppose on the contrary. Since , we have . As we have , we must have
for some . However, since and are both multiples of , must be a multiple of as well, but since and are coprime, it follows that is a multiple of . Therefore,
but this contradicts to , thus proving the claim.
As a result, we obtain
where is the largest integer such that and .
In summary, we get the following conclusions:
If is irrational or rational with denominator strictly greater than , then holds for all if and only if
If is rational with , then holds for all if and only if
where is the largest integer such that and .
The case
Next, let us consider the case . In this case, is equivalent to
Similarly to the case , the minimum of the right-hand side is precisely , where again is the best rational approximation from above of with the largest .
The maximum of the left-hand side is, as one can expect, a bit more involved. It is precisely (with the same definition above) except when , in which case we do some more analysis.
As in the case , assume with , and let us find the maximizer of
for . This time, let be the unique integer such that and
then we can rewrite our objective function as
Therefore, a maximizer of the above is a minimizer of , which is, as we obtained in the previous case, the largest integer such that and .
Hence, we get the following conclusions:
If is irrational or rational with denominator strictly greater than , then holds for all if and only if
If is rational with , then holds for all if and only if
where is the largest integer such that and .
Conclusion and an example
Putting two cases together, we get the following conclusion: if is irrational or rational with denominator strictly greater than , then holds for all if and only if
For the case with , we cannot really conclude anything useful if we consider positive ’s and negative ’s altogether, because the inequality we get is
which admits a solution if and only if is an integer, which can happen only when is a power of . But for that case the problem is already trivial.
The number in main motivating examples are irrational numbers, so is indeed a useful conclusion. Now, let us apply it to the case to see what we can get out of it. First, we need to see how and are determined for given . This can be done in a systematic way, as shown below:
Find the last convergent whose denominator is at most .
If is odd, we conclude
where is the largest integer with .
If is even, we conclude
where is the largest integer with .
So, for example, if , then with and so the maximum with is , concluding
Then, the minimum that allows the inequality
to have a solution is , and in that case is the unique solution. (One can verify that is precisely .)
In fact, note that and stay the same for any ; here, is one less of , which is the moment when changes into the next semiconvergent
Therefore, with the choice and , we should have
for all . In fact, even with we still have , so the above should hold until the next moment when changes, which is when . If , then
and in this case now the left-hand side of is violated. Thus,
holds precisely up to and the first counterexample is .
In a similar manner, one can see that holds up to , and the minimum allowing to have a solution is , with as the unique solution. (One can also verify that is precisely , so this time it is not the truncated binary expansion of , rather that plus .) Hence,
must hold at least up to . When , changes into , but is still true with the same choice of and , so the above formula must be valid at least for ; here is the moment when changes from into . If so changes into , then the right-hand side of the inequality is now violated, so is indeed the optimal range and is the first counterexample.
In general, the transition can only occur at the denominators of semiconvergents. With this method, we can figure out what is the minimum that allows the computation up to the chosen transition point and what should be used for that choice of minimum . We want to be as small as possible, so is the best choice. This will be probably either floor or ceil of , but we cannot determine which one is better by simply looking at the binary expansion of . This indicates the flaw of the naive method I tried before.
Another application: minmax Euclid algorithm
At the very heart of Ryū, Ryū-printf, Grisu-Exact and Dragonbox (which are all float-to-string conversion algorithms), there is so-called minmax Euclid algorithm.
Minmax Euclid algorithm is invented by Ulf Adams who is the author of Ryū and Ryū-printf mentioned above, and it appears on his paper on Ryū. It is at the core of showing the sufficiency of the number of bits of each item in the precomputed lookup table. What it does is this: given positive integers and , compute the minimum and maximum of where is any integer ranging from to . It sounds simple, but naive exhaustive search doesn’t scale if are very large numbers. Indeed, in its application into float-to-string conversion algorithms, are extremely large integers and can be as large as . For that wide range of , it is simply impossible to run an exhaustive search.
A rough idea of minmax Euclid algorithm is that the coefficients appearing in the Euclid algorithm for computing GCD of and tell us what the minimum and the maximum are. If you try to write down and see how varies as increases, you will find out why this is the case. (Precise mathematical proof can be cumbersome, though.) Formalizing this idea will lead to the minmax Euclid algorithm.
In fact, the exact minmax Euclid algorithm described in the Ryū paper is not entirely correct because it can produce wrong outputs for some inputs. Thus, I had to give some more thought on it after I realized the flaw in 2018 when I was trying to apply the algorithm to Grisu-Exact. Eventually, I came up with an improved and corrected version of it with a hopefully correct proof, which I wrote on the paper on Grisu-Exact (Figure 3 and Theorem 4.3).
But more than 2 years after writing the long-winded and messy proof of the algorithm, I finally realized that in fact the algorithm is just an easy application of continued fractions. Well, I did not know anything about continued fractions back then, unfortunately!
The story is as follows. Obtaining the minimum and the maximum of is equivalent to obtaining the minimum and the maximum of
which reminds us of approximating by or by , except that what we are minimizing is not the error itself, rather the error multiplied by . Hence, we define the following concepts.
We say a rational number is a best rational approximation from below in the strong sense of if and for any rational number with and , we have
Similarly, we say a rational number is a best rational approximation from above in the strong sense of if and for any rational number with and , we have
Let us call best rational approximations from below/above defined before as “in the weak sense” to distinguish from the new definitions.
The reason why these are called “in the strong sense” is obvious; if we have
then dividing both sides by gives
so with the right-hand side is at most , so this implies that is a better approximation than .
It is a well-known fact that, if we remove the directional conditions and , so asking for best rational approximations (in both directions) in the strong sense, then these are precisely the convergents. However, it turns out that with the directional conditions, the best rational approximations from below/above in the weak sense or in the strong sense are just same. Hence, the best rational approximations from below/above in the strong sense are also just precisely the semiconvergents. This fact is probably well-known as well, but I do not know of any reference at this point other than my own proof of it (which I do not write here). I guess this fact on semiconvergents is probably one of the necessary pieces for proving the other fact that convergents are precisely the best rational approximations in the strong sense, but I haven’t checked any proof of it so I do not know.
Anyway, because of this, the problem of finding the minimum and the maximum of reduces to the problem of finding the semiconvergents that are below and above with the largest denominators among all semiconvergents of denominators at most . This is essentially what the improved minmax Euclid algorithm is doing as described in my paper. I will not discuss further details here.
Computation of
When developing Dragonbox, I also had to come up with a method of computing
So I did the same thing, I approximated both and by their respective truncated binary expansions, and computed something like
where are both positive integers, and manually found out the range of making the above formula valid.
More generally, we can think of computing for fixed positive real numbers . Again, signs of are not terribly important, but let us just assume to make our life a bit easier. Also, let us further assume because taking the integer part of out of the floor function is not a big deal. So, how far we can go with a similar analysis in this case? Right now, it seems to me that this problem is far subtler than the case we have done before, but anyway here I am going to give a brief overview of what I got.
The rough idea is as follows. If we approximate by a rational number sufficiently well, then each must be very close to . Note that
hence, what matters here is whether the fractional part of is above or below . Note that the fractional part of must be approximately equal to . Thus, find the unique integer such that , then probably we can compare the fractional part of with by just comparing with . This is indeed the case if is sufficiently far from both and , specifically, more than the error of from .
So, in some sense the situation is better if and are “far apart” in the sense that the denominators showing up in approximating rationals of and are very different, and the situation is worse if there are a lot of common divisors between those denominators. Maybe someone better than me at number theory can formalize this into a precise language?
Anyway, with a high probability the distance from to and will be of , but it is well-known that the distance from to can be made to be of , which will allow the equality
to hold for -many ’s. After getting this equality, the next step is to convert it into
using a Granlund-Montgomery style analysis we did for with rational .
Unfortunately, the bound I got by applying this analysis to the case and was not that great, particularly because is too close to for the choice , which is otherwise a very efficient and effective approximation of . Well, but I might be overlooking some things at this point, so probably I have to give some more tries on this later.
(Edit on 02-10-2022) I included a better analysis of this in my new paper on Dragonbox, Section 4.4. But still I consider it incomplete, because it seems that the actual range of is much wider than what the analysis method described in the paper expects.
Acknowledgements
Thanks Seung uk Jang for teaching me many things about continued fractions (including useful references) and other discussions relevant to this post.
Leave a Comment