Junekey Jeon

Circle is the only shape with the smallest maximum arc-chord ratio

2024-10-10T00:00:00-07:00

Given a nontrivial rectifiable closed plane curve $\gamma\colon\mathbb{R}/\ell\mathbb{Z}\to\mathbb{R}^{2}$ of length $\ell$ parameterized by the arclength, the arc-chord constant of $\gamma$ is defined as

\[c_{\mathrm{AC}}(\gamma) \mathrel{\unicode{x2254}} \sup_{0That is, it is the maximum ratio between the arc length and the chord length among all of its segments of length at most $\frac{\ell}{2}$. Note that the curve needs to be simple (i.e., have no self-intersection) in order for its arc-chord constant to be finite.

Obviously, the arc is in general longer than the chord, so $c_{\mathrm{AC}}(\gamma)$ is bounded below by $1$. But if you think about it, than you can immediately see that any closed curve you can imagine would have the arc-chord constant strictly away from $1$, so it is natural to ask what is the minimum value of it.

Intuitively, it sounds plausible to believe that circles are the only closed curves with the smallest arc-chord constant, which is $\frac{\pi}{2}$. This post is about a proof of this fact. The original argument is due to Seok Hyeong Lee.

Suppose that $c_{\mathrm{AC}}(\gamma) \leq \frac{\pi}{2}$. We will show that $\gamma$ must be a circle, which further shows $c_{\mathrm{AC}}(\gamma) = \frac{\pi}{2}$. Think of a new curve $f\colon\mathbb{R}/\ell\mathbb{Z}\to\mathbb{R}^{2}$ given as $f(s) \mathrel{\unicode{x2254}} \gamma(s) - \gamma(s + \ell/2)$. By the assumption $c_{\mathrm{AC}}(\gamma) \leq \frac{\pi}{2}$, for any $s\in\mathbb{R}/\ell\mathbb{Z}$, $f(s)$ cannot lie inside the open disk of radius $\ell/\pi$ centered at the origin.

On the other hand, clearly the total length of $f$ cannot exceed $2\ell$ by its definition. Since $f(s+\ell/2) = -f(s)$ holds for all $s$, the segment $f\vert_{[0,\ell/2]}$ of $f$ must have the length at least $\ell$ because $f$ cannot pass through the disk of radius $\ell/\pi$.

Although this sounds intuitively very clear, I could not find a proof simpler than the one given below. Consider a new curve $g\colon s\mapsto \frac{\ell}{\pi\vert f(s)\vert}f(s)$, i.e., the projection of $f$ onto the circle of radius $\ell/\pi$. Since $f$ is Lipschitz, Radamacher’s theorem shows that $f$ is almost everywhere differentiable and $f$ is equal to the Lebesgue integral of its derivative. The same is true for $g$ since $f$ is strictly away from the origin. Note that

\[g'(s) = \frac{\ell}{\pi\vert f(s)\vert}f'(s) - \frac{\ell(f'(s)\cdot f(s))}{\pi\vert f(s)\vert^{3}}f(s),\]

\[\begin{aligned} \vert g'(s)\vert^{2} &= \frac{\ell^{2}}{\pi^{2}\vert f(s)\vert^{6}} \left( \vert f(s)\vert^{4}\vert f'(s)\vert^{2} + (f'(s)\cdot f(s))^{2}\vert f(s)\vert^{2} - 2(f'(s)\cdot f(s))\vert f(s)\vert^{2} \right) \\ &= \frac{\ell^{2}(f'(s)^{\perp}\cdot f(s))^{2}}{\pi^{2}\vert f(s)\vert^{4}} \end{aligned}\]

where $\cdot^{\perp}$ denotes the $90^{\circ}$-degree counterclockwise rotation. Hence,

\[\vert g'(s) \vert = \frac{\ell}{\pi\vert f(s)\vert} \cdot \frac{\vert f'(s)^{\perp}\cdot f(s)\vert}{\vert f(s)\vert} \leq |f'(s)| \leq 2\]

holds for all $s$. In particular, the length of the segment $g\vert_{[0,\ell/2]}$ must be at most $\ell$, but since $g(0) = -g(\ell/2)$ and any geodesic on the circle of radius $\ell/\pi$ connecting two antipodal points must have the length $\ell$, it follows that the length of $g\vert_{[0,\ell/2]}$ is precisely $\ell$. Hence, the above inequality is actually an equality (at least for almost every $s$), thus repeating this once more with $g\vert_{[\ell/2,\ell]}$ shows that

$f$ is a constant-speed parameterization with the speed $2$ (which means that $\vert f’\vert = 2$ almost everywhere), and
$\vert f\vert = \frac{\ell}{\pi}$ holds almost everywhere.

Now, by definition of $f$, we know

\[f'(s) = \gamma'(s) - \gamma'\left(s + \frac{\ell}{2}\right),\]

and since both of $\gamma’(s)$ and $\gamma’(s+\ell/2)$ are of size $1$, $\vert f’(s)\vert = 2$ is possible only when $\gamma’(s+\ell/2) = -\gamma’(s)$. In other words,

\[\left(\frac{\gamma(s) + \gamma\left(s+\frac{\ell}{2}\right)}{2}\right)' = 0\]

holds almost everywhere, or everywhere since $\gamma$ is Lipschitz, which implies that $c \mathrel{\unicode{x2254}}\frac{\gamma(s) + \gamma\left(s+\frac{\ell}{2}\right)}{2}$ is a constant. Note that this shows that

\[\left\vert \gamma(s) - c \right\vert = \left\vert \frac{\gamma(s) - \gamma(s+\ell/2)}{2} \right\vert = \frac{1}{2}\vert f(s)\vert = \frac{\ell}{2\pi}\]

holds, thus $\gamma$ is a curve contained in the circle of radius $\frac{\ell}{2\pi}$ centered at $c$. By considering the universal covering of circles, it can be shown that the only such closed curves with finite arc-chord constant (so that without self-intersection) are the regular ones, i.e., the ones making exactly one complete turn without backtracking, in either counterclockwise or clockwise directions. Hence, the proof is done.

A generalization of the Lax-Milgram Theorem

2024-09-04T00:00:00-07:00

This is a short note on a generalization of the Lax-Milgram theorem, which is a common tool for proving existence of weak solutions to PDE.

Introduction

Let $\Omega\subseteq\mathbb{R}^{n}$ be a bounded and smooth enough domain and $f\in L^{2}(\Omega)$. Consider the PDE

\[\begin{cases} \begin{aligned} -\Delta u &= f & \textrm{on}\quad & \Omega \\ u &= 0 & \textrm{on}\quad & \partial\Omega. \end{aligned} \end{cases}\]

Suppose that $u\in \mathcal{C}^{2}(\overline{\Omega})$ is a solution to this equation. Then for each $\phi\in\mathcal{C}_{c}^{\infty}(\Omega)$, we have

\[\begin{equation*} \int_{\Omega}f\phi = -\int_{\Omega} \phi\Delta u = -\int_{\Omega}\nabla\cdot(\phi\nabla u) + \int_{\Omega}\nabla\phi\cdot\nabla u = \int_{\Omega}\nabla\phi\cdot \nabla u, \end{equation*}\]

where the last equality follows from the divergence theorem and that $\phi\vert_{\partial\Omega} = 0$. Since having

\[\begin{equation*} \int_{\Omega}f\phi = -\int_{\Omega}\phi\Delta u \end{equation*}\]

for all $\phi\in\mathcal{C}_{c}^{\infty}(\Omega)$ is equivalent to $-\Delta u = f$ (which follows from the Lebesgue differentiation theorem), we conclude that $u$ is a solution if and only if

\[\begin{equation*} \int_{\Omega}f\phi = \int_{\Omega}\nabla\phi\cdot\nabla u \end{equation*}\]

holds for all $\phi\in\mathcal{C}_{c}^{\infty}(\Omega)$. Since both sides define continuous linear functionals on the Hilbert space $H_{0}^{1}(\Omega)$ (the norm-closure of $\mathcal{C}_{c}^{\infty}(\Omega)$ inside the Sobolev space $H^{1}(\Omega)$), we have converted the problem of solving the PDE into the problem of finding an element $u\in H_{0}^{1}(\Omega)$ such that the continuous linear functional $\phi\mapsto\int_{\Omega}\nabla\phi\cdot\nabla u$ coincides with $\phi\mapsto\int_{\Omega}f\phi$. Thus, basically what we are asking here is the surjectivity of the linear map

\[\begin{align*} L\colon H_{0}^{1}(\Omega) &\to H_{0}^{1}(\Omega)^{*} \\ u &\mapsto \left(\phi\mapsto \int_{\Omega}\nabla\phi\cdot\nabla u\right). \end{align*}\]

This is called a weak formulation of the PDE. An element $u\in H_{0}^{1}(\Omega)$ satisfying $Lu = \left(\phi\mapsto\int_{\Omega}f\phi\right)$ is then called a weak solution of the PDE. Once we find a weak solution, then some other machineries can be applied to show uniqueness and regularity of the found weak solution.

In summary, we can reformulate our PDE into the following functional analysis question: given a bilinear form $B\colon F\times E\to \mathbb{K}$ where $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ is the scalar field, when the induced map

\[\begin{aligned} L\colon E &\to F^{*} \\ u &\mapsto \left(v\mapsto B(u,v)\right) \end{aligned}\]

is surjective?

The Lax-Milgram theorem is an answer to this question: $L$ is surjective when $B$ is coercive. I will define the term coercive bilinear form later, but here let me point out that this concept is somewhat quantitatively defined. I wanted to see what is “the most purely qualitative” characterization of the condition that ensures surjectivity, and got a somewhat simple answer.

Duality pairings and Mackey-Arens theorem

A lesson I learned from functional analysis is that every attempt to directly generalize the duality theory of Banach spaces into something more general will necessarily suffer, even quite a lot. This is not because the concept of duality is broken beyond Banach spaces, rather because it is already broken in the Banach space level. More precisely, the notion of “the correct dual space” is not always the usual one, the space of all continuous linear functionals with the operator norm. Rather, the correct dual space, either as a mere vector space or as a topological vector space, very much depends on the context, and there is no one-size-fit-all answer.

This is why it is a good idea to temporarily forget about the concept of continuous dual, and start with a so-called duality pairing given on an arbitrary pair $(E,F)$ of spaces, which a priori has nothing to do with any kinds of topologies we can give on those spaces.

Definition 1 (Duality pairing).

Let $E,F$ be vector spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$. A bilinear form $\left\langle\,\cdot\,,\,\cdot\,\right\rangle\colon F\times E\to \mathbb{K}$ is called a duality pairing on $(E,F)$ if:

for any $v\in E$, $\left\langle v,u\right\rangle = 0$ for all $u\in F$ implies $v=0$, and

for any $u\in F$, $\left\langle v,u\right\rangle = 0$ for all $v\in E$ implies $u=0$.

For a vector space $E$, let $E^{\star}$ denotes the algebraic dual of $E$, that is, the vector space of all linear functionals on $E$ with no further constraints. Then the conditions given in the definition above simply means that the induced maps

\[\begin{aligned} \iota\colon F &\to E^{\star} \\ v &\mapsto \left(u\mapsto\left\langle v,u\right\rangle\right) \end{aligned}\]

and

\[\begin{aligned} \iota\colon E &\to F^{\star} \\ u &\mapsto \left(v\mapsto\left\langle v,u\right\rangle\right), \end{aligned}\]

both denoted as $\iota$, are injective, that is, are linear embeddings.

Since we obviously do not dare want to get our hands dirty with dull pure algebra😗, let us put some topologies on our spaces and see what happens. Specifically, a natural question to ask is, when is $F$ precisely the continuous dual space of $E$? That is, which topologies on $E$ make $F$ the continuous dual space of $E$? The Mackey-Arens theorem answers this question.

Given a topological vector space $(E,\mathscr{T})$, let $(E,\mathscr{T})’$ denote the continuous dual space of $(E,\mathscr{T})$, that is, the linear space of all linear functionals on $E$ that are continuous with respect to the topology $\mathscr{T}$. If $\mathscr{T}$ is obvious, we may omit it and just write $E’$.

Definition 2 (Dual topologies).

Let $E,F$ be vector spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ and $\left\langle\,\cdot\,,\,\cdot\,\right\rangle\colon F\times E\to \mathbb{K}$ a duality pairing. Then a linear topology $\mathscr{T}$ on $E$ is said to be a dual topology (with respect to the pairing $\left\langle\,\cdot\,,\,\cdot\,\right\rangle$) if $\iota[F] = (E,\mathscr{T})’$. Dual topologies on $F$ with respect to the pairing are similarly defined.

Theorem 3 (Mackey-Arens).

Let $E,F$ be vector spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ and $\left\langle\,\cdot\,,\,\cdot\,\right\rangle\colon F\times E\to \mathbb{K}$ a duality pairing. Then:

$\iota[F]\subseteq (E,\mathscr{T})’$ if and only if $\mathscr{T}\supseteq\sigma(E,F)$, and

$\iota[F]\supseteq (E,\mathscr{T})’$ if and only if $\mathscr{T}\subseteq\tau(E,F)$.

In particular, $\mathscr{T}$ is a dual topology if and only if
\[\begin{equation*} \sigma(E,F) \subseteq \mathscr{T} \subseteq \tau(E,F). \end{equation*}\]

The definitions of the topologies $\sigma(E,F)$ and $\tau(E,F)$ are given below.

This theorem is basically a consequence of the bipolar theorem (which in turn is a consequence of the Hahn-Banach theorem) and the Banach-Alaoglu theorem (which in turn is a consequence of the Arzelà–Ascoli theorem). Detailed proof of this theorem is not given in this post.

Definition 4 (Weak topology).

Let $E,F$ be vector spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ and $\left\langle\,\cdot\,,\,\cdot\,\right\rangle\colon F\times E\to \mathbb{K}$ a duality pairing. Then the weak topology on $E$ (with respect to the pairing $\left\langle\,\cdot\,,\,\cdot\,\right\rangle$) is the initial topology generated by $\iota[F]$, and is denoted as $\sigma(E,F)$. Equivalently, $\sigma(E,F)$ is the topology generated by the collection of all seminorms on $E$ of the form
\[\begin{aligned} E&\to [0,\infty) \\ u&\mapsto\vert \left\langle v,u\right\rangle\vert \end{aligned}\]
for any $v\in F$.

Definition 5 (Mackey topology).

Let $E,F$ be vector spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ and $\left\langle\,\cdot\,,\,\cdot\,\right\rangle\colon F\times E\to \mathbb{K}$ a duality pairing. Then the Mackey topology on $E$ (with respect to the pairing $\left\langle\,\cdot\,,\,\cdot\,\right\rangle$) is the topology generated by the collection of all seminorms on $E$ of the form
\[\begin{aligned} E&\to [0,\infty) \\ u&\mapsto \sup_{v\in K}\vert \left\langle v,u\right\rangle\vert \end{aligned}\]
for any absolutely convex $\sigma(F,E)$-compact subsets $K$ of $F$, and is denoted as $\tau(E,F)$.

For each $u\in E$, $v\mapsto\left\langle v,u\right\rangle$ is a $\sigma(F,E)$-continuous linear functional on $F$, thus $\sup_{v\in K}\vert\left\langle v,u\right\rangle\vert$ is finite (and the supremum is achieved) for any $\sigma(F,E)$-compact subset $K$ of $F$. Hence,

\[\begin{aligned} E&\to [0,\infty) \\ u&\mapsto \sup_{v\in K}\vert \left\langle v,u\right\rangle\vert \end{aligned}\]

is indeed a seminorm for such $K$.

When $E$ is a normed space, the Mackey topology $\tau(E,E’)$ is nothing but the norm topology. Indeed, recall that the norm topology on $E$ is the subspace topology inherited from the norm topology on $E’’$, which is the topology of uniform convergence on norm-bounded subsets of $E’$. By the Banach-Alaoglu theorem, any norm-bounded subset of $E’$ is contained in an absolutely convex weak-$*$ compact subset of $E’$, and weak-$*$ topology is nothing but the weak topology $\sigma(E’,E)$. Therefore, the norm topology on $E$ is coarser than or equal to $\tau(E,E’)$ (i.e., every norm-open set is $\tau(E,E’)$-open as well). On the other hand, the uniform boundedness principle shows that any $\sigma(E’,E)$-bounded subsets (thus $\sigma(E’,E)$-compact subsets in particular) of $E’$ are norm-bounded, thus the Mackey topology is coarser than or equal to the norm topology, concluding the equivalence between the two.

It is worth noting that, on the other hand, $\tau(E’,E)$ is not the operator norm topology. Since the operator norm topology coincides with the Mackey topology $\tau(E’,E’’)$, we conclude that $\tau(E’,E)$ coincides with the operator norm topology if $E$ is reflexive. Of course, the Mackey-Arens theorem shows that the converse is also true.

A generalization of the Lax-Milgram theorem

The theorem we want to prove is the following:

Theorem 6 (Lax-Milgram).

Let $E,F$ be locally convex spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$ and $B\colon F\times E\to \mathbb{K}$ a bilinear form. Define
\[\begin{aligned} L\colon E&\to F^{\star} \\ u&\mapsto \left(v\mapsto B(v,u)\right) \end{aligned}\]
and
\[\begin{aligned} R\colon F&\to E^{\star} \\ v&\mapsto \left(u\mapsto B(v,u)\right), \end{aligned}\]
and suppose that $R[F] \subseteq E’$. Then the followings are equivalent:

$L[E] \supseteq F’$.

$R$ is injective and $R^{-1}\colon \left(R[F],\mathscr{T}\right)\to \left(F,\sigma(F,F’)\right)$ is continuous for all dual topologies $\mathscr{T}$ on $E’$ with respect to the canonical pairing between $E’$ and $E$.

$R$ is injective and $R^{-1}\colon \left(R[F],\mathscr{T}\right)\to \left(F,\sigma(F,F’)\right)$ is continuous for some dual topology $\mathscr{T}$ on $E’$ with respect to the canonical pairing between $E’$ and $E$.

Note that the only role that the topologies on $E,F$ play is on defining the dual spaces $E’,F’$: it does not matter which specific dual topologies we choose.

Proof. $(1\Rightarrow 2)$ Suppose $v\in F\setminus\{0\}$. By Hahn-Banach theorem, we can find $v^{*}\in F'$ such that $v^{*}(v) \neq 0$. Then by the assumption, there exists $u\in E$ such that $Lu = v^{*}$. Then
\[\begin{equation*} 0\neq v^{*}(v) = (Lu)(v) = B(v,u) = (Rv)(u), \end{equation*}\]
thus $Rv\neq 0$. This shows that $R$ is injective.

For continuity of $R^{-1}$, it is enough to show that $R^{-1}$ is continuous when $\mathscr{T}=\sigma(E’,E)$ because any dual topology should contain $\sigma(E’,E)$ by the Mackey-Arens theorem. Let $\left(v_{\alpha}\right)_{\alpha\in D}$ be a net in $F$ such that $\left(Rv_{\alpha}\right)_{\alpha\in D}$ is convergent to $Rv$ with respect to $\sigma(E’,E)$ for some $v\in F$. We want to show that $\left(v_{\alpha}\right)_{\alpha\in D}$ is convergent to $v$ with respect to $\sigma(F,F’)$, which means nothing but that $\left(v^{*}(v_{\alpha})\right)_{\alpha\in D}$ converges to $v^{*}(v)$ for all $v^{*}\in F'$. Given $v^{*}\in F'$, by the assumption there exists $u\in E$ such that $Lu=v^{*}$. Then $v^{*}(v_{\alpha}) = (Lu)(v_{\alpha}) = B(v_{\alpha},u) = (Rv_{\alpha})(u)$, so $\left(v^{*}(v_{\alpha})\right)_{\alpha\in D}$ converges to $(Rv)(u) = v^{*}(v)$ as desired.

$(2\Rightarrow 3)$ Trivial.

$(3\Rightarrow 1)$ Take an arbitrary $v^{*}\in F'$. Consider a linear functional
\[\begin{aligned} \lambda\colon R[F] &\to \mathbb{K} \\ Rv &\mapsto v^{*}(v). \end{aligned}\]
By the assumption, $\lambda$ is continuous with respect to $\mathscr{T}$. Hence, by Hahn-Banach theorem, there exists a linear extension $\tilde{\lambda}\colon E’\to \mathbb{K}$ of $\lambda$ which is continuous with respect to $\mathscr{T}$. Since $\mathscr{T}$ is a dual topology, there exists $u\in E$ such that $\iota(u) = \tilde{\lambda}$, that is, $u^{*}(u) = \tilde{\lambda}(u^{*})$ holds for all $u^{*}\in E'$. Then, for any $v\in F$,
\[\begin{equation*} (Lu)(v) = B(v,u) = (Rv)(u) = \tilde{\lambda}(Rv) = \lambda(Rv) = v^{*}(v), \end{equation*}\]
concluding $Lu = v^{*}$. Therefore, $L[E]$ contains $F’$.$\quad\blacksquare$

In application, a usual assumption is that $E$ is a reflexive Banach space. By definition, the usual operator norm topology on $E’$ is a dual topology with respect to the pairing between $E’$ and $E$, thus in that case usually we want to verify the continuity of $R^{-1}\colon R[F]\to F$ with respect to the norm topology on $R[F]\subseteq E’$. As noted earlier, in this situation the norm topology coincides with the Mackey topology $\tau(E’,E)$ since $E$ is reflexive.

When $E$ is not reflexive, continuity of $R^{-1}$ with respect to the norm topology is not enough, because the norm topology on $E’$ is in general finer than $\tau(E’,E)$ (i.e., every $\tau(E’,E)$-open set is norm-open, but not necessarily vice versa). Hence, we have to work with $\tau(E’,E)$ instead. Or it could be any other dual topology, but in principle $\tau(E’,E)$ is the finest one so the continuity must be easiest to show when we endow $E’$ with $\tau(E’,E)$. To work with $\tau(E’,E)$, we should first characterize absolutely convex $\sigma(E,E’)$-compact sets, in other words, absolutely convex weakly compact subsets of $E$. For instance, when $E = L^{1}(\mu)$ for some measure $\mu$, we may need to use the Dunford-Pettis theorem.

It may seem that showing continuity of $R^{-1}\colon R[F]\to F$ when the codomain $F$ is endowed with the weak topology $\sigma(F,F’)$ is easier than when $F$ is endowed with a finer topology, for instance a norm topology. However, this is in fact not the case when both $E$ and $F$ are normed spaces and $R\colon F\to E’$ is not only well-defined but also continuous. When $E,F$ are normed spaces, it is often the case that the given bilinear form $B$ is also jointly continuous, thus this continuity condition on $R$ is easily guaranteed.

Now, suppose that $E,F$ are normed spaces and $R\colon F\to E’$ is continuous. We claim that in this case, if $R$ is injective and

\[\begin{equation*} R^{-1}\colon (R[F],\|\cdot\|_{E'})\to (F,\sigma(F,F')) \end{equation*}\]

is continuous, then

\[\begin{equation*} R^{-1}\colon (R[F],\|\cdot\|_{E'})\to (F,\|\cdot\|_{F}) \end{equation*}\]

is continuous as well. Note that the continuity of the first map means nothing but that the topology on $F$ induced by the norm $v\mapsto \|Rv\|_{E'}$ is finer than or equal to $\sigma(F,F’)$. On the other hand, continuity of $R\colon F\to E’$ means that the topology induced by this new norm is coarser than or equal to the original norm topology on $F$. Hence, by the Mackey-Arens theorem, this new norm must induce exactly the same dual space $F’$, but this means that the norm topology induced by it is nothing but the Mackey topology $\tau(F,F’)$, which is nothing but the original norm topology on $F$. Hence, $R^{-1}$ should be continuous when the codomain is endowed with the norm topology as well.

Recovering the classical case

When $E$ is a reflexive Banach space, a usual way of verifying the continuity condition required in Theorem 6 is to show that the bilinear form $B$ is coercive:

Definition 8 (Coercive bilinear forms).

Let $E,F$ be normed spaces over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$. Then a bilinear form $B\colon F\times E\to\mathbb{K}$ is said to be coercive if there exists a constant $c>0$ such that
\[\begin{equation*} \inf_{v\in F\setminus\{0\}}\sup_{u\in E\setminus\{0\}} \frac{|B(v,u)|}{\|u\|_{E}\|v\|_{F}} \geq c. \end{equation*}\]

Note that coercivity is really nothing but the norm-continuity of the map $R^{-1}\colon R[F]\to F$. Indeed, recall the map $R\colon v\mapsto \left(u\mapsto B(v,u)\right)$ from $F$ to $E^{\star}$, then

\[\begin{equation*} \|Rv\|_{E'} = \sup_{u\in E\setminus\{0\}}\frac{|(Rv)(u)|}{\|u\|_{E}} = \sup_{u\in E\setminus\{0\}}\frac{|B(v,u)|}{\|u\|_{E}}, \end{equation*}\]

so $B$ being coercive is equivalent to that there exists a constant $c>0$ such that

\[\begin{equation*} \|v\|_{F}\leq \frac{1}{c}\|Rv\|_{E'}. \end{equation*}\]

Hence, we obtain the following corollary, which is a more common way the Lax-Milgram theorem is stated:

Corollary 9 (Lions-Lax-Milgram).

Let $E$ be a reflexive Banach space over $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$, $F$ a normed space over $\mathbb{K}$, and $B\colon F\times E\to \mathbb{K}$ a bilinear form. Define $L\colon E\to F^{\star}$ and $R\colon F\to E^{\star}$ as in Theorem 6 and suppose that $R[F]\subseteq E’$ and $R\colon F\to E’$ is continuous. Then $L[E] \supseteq F’$ holds if and only if $B$ is coercive.

As an example application, let us go back to the PDE we saw in the Introduction:

\[\begin{cases} \begin{aligned} -\Delta u &= f & \textrm{on}\quad & \Omega \\ u &= 0 & \textrm{on}\quad & \partial\Omega. \end{aligned} \end{cases}\]

In this case, the bilinear form we are looking at is

\[\begin{aligned} B\colon H_{0}^{1}(\Omega)\times H_{0}^{1}(\Omega) &\to \mathbb{K} \\ (\phi,u) &\mapsto \int_{\Omega}\nabla\phi \cdot \nabla u. \end{aligned}\]

Since $B(\overline{\phi},\phi) = \|\nabla\phi\|_{L^{2}(\Omega)}^{2}$, that this bilinear form is coercive follows immediately from the Poincaré inequality. Therefore, we can apply the corollary above and conclude that there exists $u\in H_{0}^{1}(\Omega)$ such that

\[\begin{equation*} \int_{\Omega}\nabla\phi \cdot \nabla u = \lambda(\phi) \end{equation*}\]

holds for all $\phi\in H_{0}^{1}(\Omega)$, for any given $\lambda\in H_{0}^{1}(\Omega)^{*}$. This space $H_{0}^{1}(\Omega)^{*}$ in particular contains all integrations against functions in $L^{2}(\Omega)$, but it contains many more elements.

A non-Banach application

Let’s have some fun with a case involving non-Banach locally convex spaces. Let us consider the bilinear form

\[\begin{aligned} B\colon \mathcal{C}_{c}^{\infty}(\Omega)\times \mathcal{C}_{c}^{\infty}(\Omega)' &\to \mathbb{K} \\ (\phi,u) &\mapsto -\left\langle \Delta u,\phi\right\rangle \mathrel{\unicode{x2254}} -\left\langle u,\Delta\phi\right\rangle, \end{aligned}\]

where $\mathcal{C}_{c}^{\infty}(\Omega)$ is endowed with the usual LF-topology. Here, we are interested in the surjectivity of the induced map

\[\begin{aligned} L\colon \mathcal{C}_{c}^{\infty}(\Omega)' &\to \mathcal{C}_{c}^{\infty}(\Omega)' \\ u &\mapsto -\Delta u, \end{aligned}\]

or putting differently, we are asking the question: can we always invert the Laplacian applied to an arbitrary distribution?

Let us try to apply Theorem 6, so we consider the other induced map

\[\begin{aligned} R\colon \mathcal{C}_{c}^{\infty}(\Omega)&\to \left(\mathcal{C}_{c}^{\infty}(\Omega)'\right)^{\star} \\ \phi &\mapsto \left(u\mapsto -\left\langle u,\Delta\phi\right\rangle \right). \end{aligned}\]

First, we are interested in checking the condition

\[\begin{equation*} R\left[\mathcal{C}_{c}^{\infty}(\Omega)\right] \subseteq \mathcal{C}_{c}^{\infty}(\Omega)''. \end{equation*}\]

Here, we need to be a bit careful about which topology we should put on the space $\mathcal{C}_{c}^{\infty}(\Omega)'$ because there are multiple possible choices. However, as noted earlier, the only relevance of this choice is on determining what is the desired dual space of $\mathcal{C}_{c}^{\infty}(\Omega)'$, and there is really just only one choice of the space which we want to be the dual space of $\mathcal{C}_{c}^{\infty}(\Omega)'$: the predual $\mathcal{C}_{c}^{\infty}(\Omega)$. Hence, let us just choose any dual topology for the pairing between the two, for instance the weak-$*$ topology. Then, asking if

\[\begin{equation*} R\left[\mathcal{C}_{c}^{\infty}(\Omega)\right] \subseteq \mathcal{C}_{c}^{\infty}(\Omega)'' \cong \mathcal{C}_{c}^{\infty}(\Omega) \end{equation*}\]

holds really just means asking if $-\Delta\phi$ belongs to $\mathcal{C}_{c}^{\infty}(\Omega)$ for any $\phi\in\mathcal{C}_{c}^{\infty}(\Omega)$ which is trivially the case.

The next question is whether or not $R$ is injective. This one is indeed easy to answer. Suppose $R\phi = 0$, then we must have $(R\phi)(\tilde{\phi}) = 0$ where $\tilde{\phi}$ is the distribution defined as

\[\begin{equation*} \tilde{\phi}\colon \psi \mapsto \int_{\Omega}\overline{\phi}\psi. \end{equation*}\]

Hence,

\[\begin{equation*} 0 = -\left\langle\tilde{\phi},\Delta\phi\right\rangle = -\int_{\Omega}\overline{\phi}\Delta\phi = -\int_{\Omega}\nabla\cdot (\overline{\phi}\nabla\phi) - |\nabla\phi|^{2} = \int_{\Omega}|\nabla\phi|^{2} \end{equation*}\]

by the divergence theorem, so we get $\phi = 0$.

Next, we want to know the continuity of the map

\[\begin{equation*} R^{-1}\colon R\left[\mathcal{C}_{c}^{\infty}(\Omega)\right] \to\left(\mathcal{C}_{c}^{\infty}(\Omega), \sigma\left(\mathcal{C}_{c}^{\infty}(\Omega),\mathcal{C}_{c}^{\infty}(\Omega)'\right)\right), \end{equation*}\]

which is where the “real work” should be done. Let us actually show that $R^{-1}$ is continuous even when the codomain is endowed with the usual LF-topology rather than the weak topology. That is, we want to see that the map

\[\begin{aligned} R\colon \mathcal{C}_{c}^{\infty}(\Omega)&\to \mathcal{C}_{c}^{\infty}(\Omega)\\ \phi &\mapsto -\Delta\phi \end{aligned}\]

is open onto its image. Recall that $\mathcal{C}_{c}^{\infty}(\Omega)$ is the locally convex direct limit of spaces $\mathcal{C}_{0}^{\infty}(\operatorname{int}K)$ for all compact subsets $K$ in $\Omega$, where $\operatorname{int}K$ is the interior of $K$ and $\mathcal{C}_{0}^{\infty}(\operatorname{int}K)$ is the Fréchet space consisting of smooth functions on $\operatorname{int}K$ whose every derivative belongs to the Banach space $\mathcal{C}_{0}(\operatorname{int}K)$. Then a basic open neighborhood of $0\in\mathcal{C}_{c}^{\infty}(\Omega)$ is the absolutely convex hull of the union of open neighborhoods of $0\in\mathcal{C}_{0}^{\infty}(\operatorname{int}K)$. Hence, it turns out, it is enough to show that each

\[\begin{aligned} R_{K}\colon \mathcal{C}_{0}^{\infty}(\operatorname{int}K)&\to \mathcal{C}_{0}^{\infty}(\operatorname{int}K)\\ \phi &\mapsto -\Delta\phi \end{aligned}\]

is an open map onto its image.

Assuming that $K$ has sufficiently smooth boundary (which we can), this is again a consequence of the Poincaré inequality. Given $\phi\in\mathcal{C}_{0}^{\infty}(\operatorname{int}K)$, note that

\[\begin{equation*} \int_{\operatorname{int}K}|\nabla\phi|^{2} = -\int_{\operatorname{int}K}\overline{\phi}\Delta\phi \leq \|\phi\|_{L^{2}(\operatorname{int}K)}\|\Delta\phi\|_{L^{2}(\operatorname{int}K)}. \end{equation*}\]

By the Poincaré inequality, there exists a constant $C_{K}\in(0,\infty)$ such that

\[\begin{equation*} \|\phi\|_{L^{2}(\operatorname{int}K)}^{2} \leq C_{K}\|\nabla\phi\|_{L^{2}(\operatorname{int}K)}^{2} \leq C_{K}\|\phi\|_{L^{2}(\operatorname{int}K)}\|\Delta\phi\|_{L^{2}(\operatorname{int}K)}, \end{equation*}\]

thus we get

\[\begin{equation*} \|\phi\|_{L^{2}(\operatorname{int}K)} \leq C_{K}\|\Delta\phi\|_{L^{2}(\operatorname{int}K)} \leq \tilde{C}_{K}\|\Delta\phi\|_{\mathcal{C}_{0}(\operatorname{int}K)} \end{equation*}\]

for some different constant $\tilde{C}_{K}$. Now, applying this inequality to derivatives of $\phi$ and then applying the Sobolev inequality, we get the conclusion that $\Delta\phi_{\alpha} \to 0$ implies $\phi_{\alpha}\to 0$ in $\mathcal{C}_{0}^{\infty}(\operatorname{int}K)$ for any net $\left(\phi_{\alpha}\right)_{\alpha\in D}$, which is the desired conclusion.

Therefore, we can now apply Theorem 6 to conclude that the map

\[\begin{aligned} L\colon \mathcal{C}_{c}^{\infty}(\Omega)' &\to \mathcal{C}_{c}^{\infty}(\Omega)' \\ u &\mapsto -\Delta u \end{aligned}\]

is surjective.

The Fourier transform of the Heaviside step function

2024-08-27T00:00:00-07:00

Back in 2010, as an electrical engineering major student I was taking an introductory course on signals and systems taught by my formal advisor. The course was mainly about four different kinds of Fourier transforms (Discrete-Time Fourier Series a.k.a. Discrete Fourier Transform, Discrete-Time Fourier Transform, Continuous-Time Fourier Series, and Continuous-Time Fourier Transform) and some additional topics like the famous sampling theorem.

The course was decent, the topics taught were interesting and I learned a lot. However, since it was not a course for math majors, many of the arguments made for developing the theory were not very rigorous, and one of the things that especially bothered me was how the Fourier transform of the Heaviside step function is obtained.

Here is how the argument given in the class goes. Given a function $f\colon\mathbb{R}\to\mathbb{R}$, its Fourier transform is the function $\hat{f}\colon\mathbb{R}\to\mathbb{R}$ given as

\[\hat{f}\colon\xi\mapsto \int_{\mathbb{R}}f(x)e^{-2\pi ix\xi}\,dx.\]

Now consider the Heaviside step function

\[H\colon x\mapsto \begin{cases}0 & \textrm{if $x < 0$,} \\ 1 & \textrm{otherwise.}\end{cases}\]

We want to compute the integral

\[\int_{\mathbb{R}}H(x)e^{-2\pi ix\xi}\,dx = \int_{0}^{\infty}e^{-2\pi ix\xi}\,dx,\]

but of course the integral does not converge. To workaround this issue, we first consider the signum function

\[\mathrm{sgn}\colon x\mapsto \begin{cases} -1 & \textrm{if $x < 0$,} \\ 1 & \textrm{otherwise.}\end{cases}\]

Since $H = \frac{1}{2}(\mathrm{sgn}+1)$, we know $\hat{H} = \frac{1}{2}\widehat{\mathrm{sgn}} + \frac{1}{2}\delta$ where $\delta$ is the Dirac delta function. But how to compute $\widehat{\mathrm{sgn}}$? Doesn’t the integral still diverge? The trick is, first we multiply the exponential decay $x\mapsto\exp(-\lambda\vert x\vert)$ for some $\lambda>0$ to $\mathrm{sgn}$, compute the Fourier transform of the resulting function, and then send the decay factor $\lambda$ to $0$. Let us see how it works out.

First, given $\lambda>0$, we compute:

\[\begin{align*} \int_{\mathbb{R}}\mathrm{sgn}(x)e^{-\lambda|x| - 2\pi i\xi x}\,dx &= \int_{-\infty}^{0}-e^{(\lambda - 2\pi i\xi)x}\,dx + \int_{0}^{\infty}e^{-(\lambda + 2\pi i\xi)x}\,dx \\ &= -\frac{1}{\lambda - 2\pi i\xi} + \frac{1}{\lambda + 2\pi i \xi} \\ &= -\frac{4\pi i\xi}{\lambda^{2} + 4\pi^{2}\xi^{2}}. \end{align*}\]

Send $\lambda\to 0^{+}$, then we get:

\[\widehat{\mathrm{sgn}}(\xi) = -\frac{4\pi i\xi}{4\pi^{2}\xi^{2}} = \frac{1}{\pi i\xi}.\]

As a result, we conclude:

\[\hat{H}(\xi) = \frac{1}{2\pi i\xi} + \frac{1}{2}\delta(\xi).\]

Although the end result turns out to be correct, anyone encountering this argument for the first time will ask this question: why do we need to first work with $\mathrm{sgn}$? Why not multiply the exponential decay to $H$ directly?

The thing is, that gives an incorrect answer: given $\lambda>0$, we have

\[\int_{\mathbb{R}}H(x)e^{-\lambda|x|-2\pi i\xi x}\,dx = \int_{0}^{\infty}e^{-(\lambda+2\pi i\xi)x}\,dx = \frac{1}{\lambda + 2\pi i\xi},\]

thus sending $\lambda\to 0^{+}$ gives

\[\hat{H}(\xi) = \frac{1}{2\pi i\xi}.\]

Eh… what happened?

The professor told us that he honestly does not really buy this exponential decay argument, and unfortunately he has no better explanation, so for the moment we should just believe that $\hat{H}(\xi) = \frac{1}{2\pi i\xi} + \frac{1}{2}\delta(\xi)$ is the right one, and we will see it is indeed the right one as we learn through and relate more things together.

Nevertheless, I desperately wanted to know why the argument seems to work for $\mathrm{sgn}$ but not for $H$, and I knew that the distribution theory is the right thing to look at. However, at that time my mathematical skill was not mature enough to truly appreciate the theory, and although I kind of figured out a way to make the argument rigorous, it was much later when I finally got a full explanation of what is really happening. (If I recall correctly, it was around 2020, 10 years after I first saw this!) This post is about that explanation.

Justification of the exponential decay trick

First of all, in fact the exponential decay trick is valid regardless of the function we are looking at. This is indeed a simple consequence of the dominated convergence theorem: let $f\colon\mathbb{R}\to\mathbb{R}$ be a locally integrable function and for each $\lambda>0$, define

\[f_{\lambda}\colon x\mapsto e^{-\lambda|x|}f(x).\]

Then $f_{\lambda} \to f$ as $\lambda\to 0^{+}$ in the distributional sense; that is, for any given test function $\phi\in\mathcal{C}_{c}^{\infty}(\mathbb{R})$, we have

\[\left\langle f_{\lambda},\phi\right\rangle \mathrel{\unicode{x2254}} \int_{\mathbb{R}}f_{\lambda}(x)\phi(x)\,dx \to \left\langle f,\phi\right\rangle\]

as $\lambda\to 0^{+}$. As pointed out, this follows immediately from the dominated convergence theorem. By the same reason, if $f\colon\mathbb{R}\to\mathbb{R}$ is a locally integrable function that defines a tempered distribution and $\phi\in\mathcal{S}(\mathbb{R})$ is a Schwartz function, then we have

\[\left\langle f_{\lambda},\phi\right\rangle \to \left\langle f,\phi\right\rangle\]

as $\lambda\to 0^{+}$. (That $f$ defines a tempered distribution simply means that $f\phi$ is integrable for every $\phi\in\mathcal{S}(\mathbb{R})$, so the dominated convergence theorem still applies.)

In particular, we do have both $\mathrm{sgn}_{\lambda}\to\mathrm{sgn}$ and $H_{\lambda}\to H$ as $\lambda\to 0^{+}$.

Next, we see that the Fourier transform $\hat{\cdot}\colon\mathcal{S}’(\mathbb{R})\to\mathcal{S}’(\mathbb{R})$ is continuous with respect to the weak-$*$ topology. Because of how it is defined, this is in fact very trivial: for a given net $\left(u_{\alpha}\right)_{\alpha\in D}$ in $\mathcal{S}’(\mathbb{R})$ convergent to some $u\in\mathcal{S}’(\mathbb{R})$ with respect to the weak-$*$ topology, we have

\[\lim_{\alpha\in D}\left\langle \hat{u}_{\alpha}, \phi\right\rangle = \lim_{\alpha\in D}\left\langle u_{\alpha}, \hat{\phi}\right\rangle = \left\langle u,\hat{\phi}\right\rangle = \left\langle \hat{u},\phi\right\rangle\]

for each $\phi\in\mathcal{S}(\mathbb{R})$.

Therefore, we have both $\widehat{\mathrm{sgn}}_{\lambda} \to \widehat{\mathrm{sgn}}$ and $\hat{H}_{\lambda}\to \hat{H}$ as $\lambda\to 0^{+}$.

But isn’t this a contradiction, as we have already seen that $\hat{H}_{\lambda}$ does not converge to $\frac{1}{2}\widehat{\mathrm{sgn}} + \frac{1}{2}\delta$? Actually, it does converge to it. The error committed by the informal, naïve argument given at the beginning of the post is not in the idea of using the exponential decay, rather it is in the way it is executed.

The limit of $\hat{H}_{\lambda}$’s

Henceforth, let us directly compute the correct distributional limit of $H_{\lambda}$’s. Fix $\lambda>0$ and $\phi\in\mathcal{S}(\mathbb{R})$, then by definition,

\[\left\langle \hat{H}_{\lambda},\phi\right\rangle = \int_{0}^{\infty}\int_{\mathbb{R}}e^{-(\lambda + 2\pi i\xi)x}\phi(\xi)\,d\xi\,dx.\]

Note that the integrand is (absolutely) integrable, so we can apply Fubini’s theorem to get

\[\left\langle \hat{H}_{\lambda},\phi\right\rangle = \int_{\mathbb{R}}\phi(\xi)\int_{0}^{\infty}e^{-(\lambda + 2\pi i\xi)x}\,dx\,d\xi = \int_{\mathbb{R}}\frac{\phi(\xi)}{\lambda + 2\pi i\xi}\,d\xi.\]

In other words, the formula $\hat{H}_{\lambda}(\xi) = \frac{1}{\lambda + 2\pi i\xi}$ is correct. However, the interesting part is on taking the limit $\lambda\to 0^{+}$: since there is no single integrable function that uniformly bounds the integrands for all small enough $\lambda>0$, we cannot blindly take the pointwise limit, and this is where the naïve argument got it wrong.

The key observation we will leverage here is the cancellation between contributions from the positive and the negative regions of $\xi$. Let us write our integral as

\[\begin{align*} \int_{\mathbb{R}}\frac{\phi(\xi)}{\lambda + 2\pi i\xi}\,d\xi &= \int_{0}^{\infty}\frac{\phi(\xi)}{\lambda + 2\pi i\xi} + \frac{\phi(-\xi)}{\lambda - 2\pi i\xi}\,d\xi \\ &= \int_{0}^{\infty}\frac{\lambda(\phi(\xi) + \phi(-\xi)) - 2\pi i\xi(\phi(\xi) - \phi(-\xi))} {\lambda^{2} + 4\pi^{2}\xi^{2}} \,d\xi. \end{align*}\]

Now, we separately compute

\[I_{1} \mathrel{\unicode{x2254}} \int_{0}^{\infty}\frac{\lambda(\phi(\xi) + \phi(-\xi))} {\lambda^{2} + 4\pi^{2}\xi^{2}} \,d\xi,\quad\quad I_{2} \mathrel{\unicode{x2254}} \int_{0}^{\infty}\frac{-2\pi i\xi(\phi(\xi) - \phi(-\xi))} {\lambda^{2} + 4\pi^{2}\xi^{2}} \,d\xi.\]

For $I_{1}$, we further divide it into the sum

\[I_{1} = \int_{0}^{\infty}\frac{\lambda(\phi(\xi) + \phi(-\xi) - 2\phi(0))} {\lambda^{2} + 4\pi^{2}\xi^{2}} + \frac{2\lambda\phi(0)} {\lambda^{2} + 4\pi^{2}\xi^{2}}\,d\xi.\]

For each $\xi\in[0,\infty)$, Taylor’s theorem shows that $\phi(\xi) + \phi(-\xi) -2\phi(0) = \frac{\xi^{2}}{2}\phi’’(\bar{\xi})$ for some $\bar{\xi}\in[0,\xi]$, thus we have

\[|\phi(\xi) + \phi(-\xi) - 2\phi(0)| \leq \min\left(4\left\|\phi\right\|_{\mathcal{C}^{0}}, \frac{\xi^{2}}{2}\left\|\phi''\right\|_{\mathcal{C}^{0}} \right)\]

where $\left\|\,\cdot\,\right\|_{\mathcal{C}^{0}}$ is the uniform norm.

Therefore, by splitting the range of integral for $\xi\leq 1$ and $\xi>1$, we can see that the integrand of

\[\int_{0}^{\infty}\frac{\lambda(\phi(\xi) + \phi(-\xi) - 2\phi(0))} {\lambda^{2} + 4\pi^{2}\xi^{2}}\,d\xi\]

is uniformly bounded by an integrable function for all $\lambda\in(0,1]$, independently to $\lambda$: for $\xi\leq 1$, (the absolute value of) the integrand is bounded by $\frac{\lambda\|\phi''\|_{\mathcal{C}^{0}}}{8\pi^{2}}$, and for $\xi>1$, it is bounded by $\frac{\lambda\|\phi\|_{\mathcal{C}^{0}}}{\pi^{2}\xi^{2}}$. Consequently, we can apply the dominated convergence theorem to see

\[\int_{0}^{\infty}\frac{\lambda(\phi(\xi) + \phi(-\xi) - 2\phi(0))} {\lambda^{2} + 4\pi^{2}\xi^{2}}\,d\xi \to 0 \quad\textrm{as}\quad \lambda\to 0^{+}.\]

For the second term of $I_{1}$, as usual we apply the change of variables $\xi = \frac{\lambda}{2\pi}\tan\theta$, then we get

\[\int_{0}^{\infty}\frac{2\lambda\phi(0)} {\lambda^{2} + 4\pi^{2}\xi^{2}}\,d\xi = \int_{0}^{\pi/2}\frac{2\lambda\phi(0)}{\lambda^{2}\sec^{2}\theta} \cdot \frac{\lambda}{2\pi}\sec^{2}\theta\,d\theta = \frac{\phi(0)}{\pi}\int_{0}^{\pi/2}d\theta = \frac{\phi(0)}{2}.\]

Therefore, we obtain $I_{1}\to \frac{1}{2}\phi(0) = \frac{1}{2}\left\langle \delta, \phi\right\rangle$ as $\lambda\to 0^{+}$.

Now, for $I_{2}$, since we want to see if we can replace the integrand by what we obtain when $\lambda = 0$, let us split $I_{2}$ in this way:

\[\begin{align*} I_{2} &= \int_{0}^{\infty}\left(\phi(\xi) - \phi(-\xi)\right) \left( \frac{-2\pi i\xi}{\lambda^{2} + 4\pi^{2}\xi^{2}} - \frac{1}{2\pi i\xi} + \frac{1}{2\pi i\xi} \right)\,d\xi \\ &= \int_{0}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi} - \frac{\lambda^{2}(\phi(\xi) - \phi(-\xi))} {2\pi i\xi(\lambda^{2} + 4\pi^{2}\xi^{2})}\,d\xi. \end{align*}\]

Similarly as for $I_{1}$, using the bound

\[|\phi(\xi) - \phi(-\xi)| \leq \min\left(2\left\|\phi\right\|_{\mathcal{C}^{0}}, 2\xi\left\|\phi'\right\|_{\mathcal{C}^{0}} \right)\]

that follows from Taylor’s theorem and then splitting the range of integral into $\xi\leq 1$ and $\xi>1$, the dominated convergence theorem yields

\[\int_{0}^{\infty}\frac{\lambda^{2}(\phi(\xi) - \phi(-\xi))} {2\pi i\xi(\lambda^{2} + 4\pi^{2}\xi^{2})}\,d\xi \to 0 \quad\textrm{as}\quad \lambda\to 0^{+}.\]

Consequently, we get

\[\left\langle \hat{H}_{\lambda},\phi\right\rangle \to \frac{1}{2}\left\langle \delta, \phi\right\rangle + \int_{0}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi \quad\textrm{as}\quad \lambda\to 0^{+}.\]

Note that the integral $\int_{0}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi$ converges absolutely because of the bound

\[|\phi(\xi) - \phi(-\xi)| \leq 2\xi\left\|\phi'\right\|_{\mathcal{C}^{0}},\]

so we know

\[\int_{0}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi = \lim_{\epsilon\to 0^{+}} \int_{\epsilon}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi.\]

Then since each term of the integrand of the right-hand side is bounded for a fixed $\epsilon>0$, we have

\[\int_{\epsilon}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi = \int_{\epsilon}^{\infty}\frac{\phi(\xi)}{2\pi i\xi}\,d\xi + \int_{\epsilon}^{\infty}\frac{\phi(-\xi)}{2\pi i(-\xi)}\,d\xi = \int_{|\xi|\geq\epsilon}\frac{\phi(\xi)}{2\pi i\xi}\,d\xi,\]

thus

\[\int_{0}^{\infty}\frac{\phi(\xi) - \phi(-\xi)}{2\pi i\xi}\,d\xi = \lim_{\epsilon\to 0^{+}} \int_{|\xi|\geq \epsilon}\frac{\phi(\xi)}{2\pi i\xi}\,d\xi.\]

The distribution

\[\phi \mapsto \lim_{\epsilon\to 0^{+}} \int_{|\xi|\geq \epsilon}\frac{\phi(\xi)}{2\pi i\xi}\,d\xi\]

is known as the principal value integral of $\frac{1}{2\pi i\xi}$, and is denoted as $\mathrm{p.v.}\left(\xi\mapsto \frac{1}{2\pi i\xi}\right)$.

Putting all pieces together, we now obtain:

\[\hat{H} = \mathrm{p.v.}\left(\xi\mapsto \frac{1}{2\pi i\xi}\right) + \frac{1}{2}\delta,\]

which verifies that

\[\hat{H}(\xi) = \frac{1}{2\pi i\xi} + \frac{1}{2}\delta(\xi)\]

is more or less the correct formula.

I will omit the details, but the same procedure shows that

\[\widehat{\mathrm{sgn}} = \mathrm{p.v.}\left(\xi\mapsto\frac{1}{\pi i\xi}\right)\]

as expected, and is of course consistent with $H = \frac{1}{2}\left(\mathrm{sgn} + 1\right)$.

How to quickly factor out a constant factor from integers

2024-04-23T00:00:00-07:00

This post revisits the topic of integer division, building upon the discussion in the previous post. Specifically, I’ll delve into removing trailing zeros in the decimal representation of an input integer, or more broadly, factoring out the highest power of a given constant that divides the input. This exploration stems from the problem of converting floating-point numbers into strings, where certain contemporary algorithms, such as Schubfach and Dragonbox, may yield outputs containing trailing zeros.

Here is the precise statement of our problem:

Given positive integer constants $n_{\max}$ and $q\leq n_{\max}$, how to find the largest integer $k$ such that $q^{k}$ divides $n$, together with the corresponding $\frac{n}{q^{k}}$, for any $n=1,\ \cdots\ ,n_{\max}$?

As one can expect, this more or less boils down to an efficient divisibility test algorithm. However, merely testing for divisibility is not enough, and we need to actually divide the input by the divisor when we know the input is indeed divisible.

Naïve algorithm

Here is the most straightforward implementation I can think of:

std::size_t s = 0;
while (true) {
    auto const r = n % q;
    if (r == 0) {
        n /= q;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Of course, this works (as long as $n\neq 0$ which we do assume throughout), but obviously our objective is to explore avenues for greater performance. Here, we are assuming that the divisor $q$ is a given constant, so any sane modern compiler knows that the dreaded generic integer division is not necessary. Rather, they would replace the division into a multiply-and-shift, or some slight variation of it, as explained in the previous post. That is great, but note that we need to compute both the quotient and the remainder. As far as I know, there is no known algorithm capable of computing both quotient and remainder with only one multiplication, which means that the above code will perform two multiplications per iteration. However, if we consider cases where the input is not divisible by the divisor, we realize that we don’t actually require the precise values of the quotient or the remainder. Our sole concern is whether the remainder is zero, and only if that is the case, we do want to know the quotient of the division. Therefore, it’s conceivable that we could accomplish this with just one multiplication per iteration, which presumably will improve the performance.

Actually, the classical paper by Granlund-Montgomery already presented such an algorithm, which is the topic of the next section.

Granlund-Montgomery modular inverse algorithm

First, let us assume that $q$ is an odd number for a while. Assume further that our input is of $b$-bit unsigned integer type, e.g. of type std::uint32_t, with $b=32$. Hence, we are assuming that $n_{\max}$ is at most $2^{b}-1$. Now, since $q$ is coprime to $2^{b}$, there uniquely exists the modular inverse of $q$ with respect to $2^{b}$. Let us call it $m$, which in general can be found using the extended Euclid’s algorithm. But how is it useful to us?

The key observation here is that, on the group $\mathbb{Z}/2^{b}$ of integer residue classes modulo $2^{b}$, the multiplication by any integer coprime to $2^{b}$ is an automorphism, i.e., a bijective group homomorphism onto itself. In particular, multiplication by such an integer induces a bijection from the set $\left\{0,1,\ \cdots\ ,2^{b}-1\right\}$ onto itself.

Now, what does this bijection, defined in particular by the modular inverse $m$ (which must be coprime to $2^{b}$), do to multiples of $q$, i.e., $0, q,\ \cdots\ ,\left\lfloor\frac{2^{b} - 1}{q}\right\rfloor q$? Note that for any integer $a$,

\[(aq)m \equiv a(qm) \equiv a\ (\operatorname{mod}\ 2^{b}),\]

thus, the bijection $\left\{0,1,\ \cdots\ ,2^{b}-1\right\}\to\left\{0,1,\ \cdots\ ,2^{b}-1\right\}$ defined by $m$ maps $aq$ into $a$. Therefore, anything that gets mapped into $\left\{0,1,\ \cdots\ ,\left\lfloor\frac{2^{b}-1}{q}\right\rfloor\right\}$ must be a multiple of $q$, because the map is a bijection, and vice versa.

Furthermore, the image of this bijection, in the case when the input $n$ was actually a multiple of $q$, is precisely $n/q$. Since multiplication modulo $2^{b}$ is what C/C++ abstract machines are supposed to do whenever they see an integer multiplication, we find that the following code does the job we want, but with only one multiplication per an iteration:

// m and threshold_value are precalculated from q.
// m is the modular inverse of q, and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
    auto const r = n * m;
    if (r < threshold_value) {
        n = r;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

All is good so far, but recall that this works only when $q$ is odd. Unfortunately, our main application, the trailing zero removal, requires $q$ to be $10$, which is even.

Actually, the aforementioned paper by Granlund-Montgomery proposes a clever solution for the case of possibly non-odd divisors, based on bit-rotation. Let us write $q = 2^{t}q_{0}$ for an integer $t$ and a positive odd number $q_{0}$. Obviously, there is no modular inverse of $q$ with respect to $2^{b}$, but we can try the next best thing, that is, the modular inverse of $q_{0}$ with respect to $2^{b-t}$. Let us call it $m$.

Now, for given $n=0,1,\ \cdots\ ,n_{\max} \leq 2^{b}-1$, let us consider the multiplication $nm$ modulo $2^{b}$ and then perform bit-rotate-to-right on it by $t$-bits. Let us call the result $r$. Clearly, the highest $t$-bits of $r$ is all-zero if and only if the lowest $t$-bits of $(nm\ \operatorname{mod}\ 2^{b})$ is all-zero, and since $m$ must be an odd number, this is the case if and only if the lowest $t$-bits of $n$ is all-zero, that is, $n$ is a multiple of $2^{t}$.

We claim that $n$ is a multiple of $q$ if and only if $r \leq \left\lfloor\frac{2^{b}-1}{q}\right\rfloor = \left\lfloor\frac{2^{b-t}-1}{q_{0}}\right\rfloor$ holds. By the argument from the previous paragraph, $r$ must be strictly larger than $\left\lfloor\frac{2^{b-t}-1}{q_{0}}\right\rfloor$ if $n$ is not a multiple of $2^{t}$, since $\left\lfloor\frac{2^{b-t}-1}{q_{0}}\right\rfloor$ is at most of $(b-t)$-bits. Thus, we only need to care about the case when $n$ is a multiple of $2^{t}$. In that case, write $n=2^{t}n_{0}$, then $r$ is in fact equal to

\[\frac{(nm\ \operatorname{mod}\ 2^{b})}{2^{t}} = \frac{nm - \lfloor nm/2^{b}\rfloor 2^{b}}{2^{t}} = n_{0}m - \left\lfloor \frac{n_{0}m}{2^{b-t}}\right\rfloor 2^{b-t} = (n_{0}m\ \operatorname{mod}\ 2^{b-t}),\]

since the lowest $t$-bits of $(nm\ \operatorname{mod}\ 2^{b})$ are all zero. By the discussion of the case when $q$ is odd, we know that $n_{0}$ is a multiple of $q_{0}$ if and only if $(n_{0}m\ \operatorname{mod}\ 2^{b-t})$ is at most $\left\lfloor\frac{2^{b-t}-1}{q_{0}}\right\rfloor$, thus finishing the proof of the claim.

Of course, when $n$ was really a multiple of $q$, the resulting $r$ is precisely $n/q$, by the same reason as in the case of odd $q$. Consequently, we obtain the following version which works for all $q$:

// t, m and threshold_value are precalculated from q.
// t is the number of trailing zero bits of q,
// m is the modular inverse of (q >> t) with respect to 2^(b-t), and
// threshold_value is (0xff....ff / q) + 1.
std::size_t s = 0;
while (true) {
    auto const r = std::rotr(n * m, t); // C++20 
    if (r < threshold_value) {
        n = r;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Lemire’s algorithm

In a paper published in 2019, Lemire et al proposed an alternative way of checking divisibility which has some advantages over Granlund-Montgomery algorithm. The theorem behind this algorithm was not optimal when first presented, but later they showed the optimal result in another paper. Here, I present a more general result which contains the result by Lemire et al as a special case. This is one of the theorems I proved in my paper on Dragonbox (Theorem 4.6), which I copied below:

Theorem 1 Let $x,\xi$ be real numbers and $n_{\max}$ be a positive integer such that
\[\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor\]
holds for all $n=1,\ \cdots\ ,n_{\max}$. Then, for a positive real number $\eta$, we have the followings.

If $x=\frac{p}{q}$ is a rational number with $2\leq q\leq n_{\max}$, then we have
\[\left\{n\in \{1,\ \cdots\ ,n_{\max}\}\colon nx\in \mathbb{Z}\right\} = \left\{n\in \{1,\ \cdots\ ,n_{\max}\}\colon n\xi - \left\lfloor n\xi\right\rfloor < \eta\right\}\]
if and only if
\[\left\lfloor \frac{n_{\max}}{q} \right\rfloor q(\xi - x) < \eta \leq u(\xi - x) + \frac{1}{q}\]
holds, where $u$ is the smallest positive integer such that $up\equiv 1\ (\operatorname{mod}\ q)$.

If $x$ is either an irrational number or a rational number with the denominator strictly greater than $n_{\max}$, then we have
\[\left\{n\in \{1,\ \cdots\ ,n_{\max}\} \colon nx\in \mathbb{Z}\right\} = \emptyset = \left\{n\in \{1,\ \cdots\ ,n_{\max}\} \colon n\xi - \left\lfloor n\xi\right\rfloor < \eta\right\}\]
if and only if
\[\eta \leq q_{*}\xi - p_{*}\]
holds, where $\frac{p_{*}}{q_{*}}$ is the best rational approximations of $x$ from below with the largest denominator $q_{*}\leq n_{\max}$.

The theorem above shows a necessary and sufficient condition for the product $nx$ to be an integer in terms of a good enough approximation $\xi$ of $x$. The second item is irrelevant in this post, so we only focus on the first item. The point here is that we are going to let $x = \frac{1}{q}$ so that $nx$ is an integer if and only if $n$ is a multiple of $q$. In that case, it is shown in the paper (Remark 2 after Theorem 4.6) that we can always take $\xi = \eta$ as long as $\xi$ satisfies the condition

\[\label{eq:precondition} \left\lfloor \frac{n}{q} \right\rfloor = \left\lfloor n\xi \right\rfloor\]

for all $n=1,\ \cdots\ ,n_{\max}$, so that $n$ is a multiple of $q$ if and only if

\[n\xi - \left\lfloor n\xi \right\rfloor < \xi.\]

Also, another theorem from the paper (Theorem 4.2) shows that a necessary and sufficient condition for having the above is that

\[\frac{1}{q} \leq \xi < \frac{1}{q} + \frac{1}{vq}\]

holds, where $v$ is the greatest integer with $v\equiv -1\ (\operatorname{mod}\ q)$ and $v\leq n_{\max}$, i.e.,

\[v=\left\lfloor \frac{n_{\max} + 1}{q} \right\rfloor q - 1.\]

In practice, we take $\xi = \frac{m}{2^{b}}$ for some positive integers $m,b$ so that

\[n\xi - \lfloor n\xi\rfloor < \xi\]

holds if and only if

\[(nm\ \operatorname{mod}\ 2^{b}) < m.\]

Hence, the above is a necessary and sufficient condition for $n$ to be divisible by $q$, as long as the inequality

\[\frac{1}{q} \leq \frac{m}{2^{b}} < \frac{1}{q} + \frac{1}{vq} = \frac{\left\lfloor (n_{\max}+1)/q\right\rfloor} {\left\lfloor (n_{\max}+1)/q\right\rfloor q - 1}\]

holds. Furthermore, since $\left\lfloor nx\right\rfloor = \left\lfloor n\xi \right\rfloor$ holds, the quotient $\left\lfloor \frac{n}{q}\right\rfloor$ can be also computed as

\[\lfloor n\xi\rfloor = \left\lfloor \frac{nm}{2^{b}} \right\rfloor.\]

In other words, we multiply $m$ to $n$, then the lowest $b$-bits can be used for inspecting divisibility, while the remaining upper bits constitutes the quotient.

Note that this is always the case, regardless of whether $n$ is divisible by $q$ or not, so this algorithm actually does more than what we are asking for. Furthermore, there is only one magic number, $m$, while Granlund-Montgomery requires two. This may have some positive impact on the code size, instruction decoding overhead, register usage overhead, etc..

However, one should also note that the cost of these “extra features” is that we must perform a “widening multiplication”, that is, if $n_{\max}$ and $m$ are of $b$-bits, then we need all of $2b$-bits of the multiplication of $n$ and $m$. It is also worth mentioning that the magic number $m$ might actually require more than $b$-bits. For details, please refer to the previous post, or this paper by Lemire et al.

In any case, the resulting algorithm would be like the following:

// m is precalculated from q.
std::size_t s = 0;
while (true) {
    auto const r = widening_mul(n, m);
    if (low_bits(r) < m) {
        n = high_bits(r);
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

Generalized modular inverse algorithm

Staring at Theorem 1, one can ask: if all we care about is divisibility, not the quotient, then do we really need to take $x = \frac{1}{q}$? That is, as long as $p$ is any integer coprime to $q$, asking if $n/q$ is an integer is exactly the same question as asking if $\frac{np}{q}$ is an integer. In fact, this observation leads to a derivation of the modular inverse divisibility check algorithm by Granlund-Montgomery explained above for the case of odd divisor, and the Dragonbox paper already did this (Remark 3 after Theorem 4.6). A few days ago, I realized the same argument actually applies to the case of general divisor as well, which leads to yet another divisibility check algorithm explained below, which I think probably is novel.

Suppose that $p$ is a positive integer coprime to $q$, then as pointed out, $n$ is divisible by $q$ if and only if $nx$ is an integer with $x\mathrel{\unicode{x2254}}\frac{p}{q}$. Therefore, by Theorem 1 (together with Theorem 4.2 from the Dragonbox paper), for any $\xi,\eta$ satisfying

\[\label{eq:condition for xi} \frac{p}{q} \leq \xi < \frac{p}{q} + \frac{1}{vq}\]

and

\[\label{eq:condition for eta} \left\lfloor \frac{n_{\max}}{q} \right\rfloor q\left(\xi - \frac{p}{q}\right) < \eta \leq u\left(\xi - \frac{p}{q}\right) + \frac{1}{q},\]

a given $n=1,\ \cdots\ ,n_{\max}$ is divisible by $q$ if and only if

\[\label{eq:integer check condition} n\xi - \left\lfloor n\xi \right\rfloor < \eta,\]

where $v$ is the greatest integer satisfying $vp\equiv -1\ (\operatorname{mod}\ q)$ and $u$ is the smallest positive integer satisfying $up\equiv 1\ (\operatorname{mod}\ q)$.

Again, since our goal is to make the evaluation of the inequality $\eqref{eq:integer check condition}$ as easy as possible, we may want to take $\xi=\frac{m}{2^{b}}$ and $\eta=\frac{s}{2^{b}}$ as before, so that $\eqref{eq:integer check condition}$ becomes

\[(nm\ \operatorname{mod}\ 2^{b}) < s.\]

Although that is what we will do eventually, let us consider a little bit more general case that $\xi=\frac{m}{N}$ and $\eta=\frac{s}{N}$ for any positive integer $N$, not necessarily of the form $2^{b}$, for the sake of ~~pedantry~~ clarity of what is going on. Of course, in this case $\eqref{eq:integer check condition}$ is equivalent to

\[(nm\ \operatorname{mod}\ N) < s.\]

With this setting, we can also rewrite $\eqref{eq:condition for xi}$ and $\eqref{eq:condition for eta}$ as

\[\label{eq:condition for xi modified} 0 \leq qm - Np < \frac{N}{v}\]

and

\[\label{eq:condition for eta modified} \left\lfloor \frac{n_{\max}}{q} \right\rfloor \left(qm - Np\right) < s \leq \frac{u}{q}\left(qm - Np\right) + \frac{N}{q},\]

respectively. Note that, both sides of the above inequality have the factor $qm - Np$, and the left-hand side multiplies it to $\left\lfloor\frac{n_{\max}}{q}\right\rfloor$ and the right-hand side multiplies it to $\frac{u}{q}$. Since $u$ is at most $q-1$, usually $\left\lfloor\frac{n_{\max}}{q}\right\rfloor$ is much larger than $\frac{u}{q}$, so it is usually the case that the inequality can hold only when the factor $qm - Np$ is small enough. So, it makes sense to actually minimize it. Note that we are to take $N = 2^{b}$ and $b$ is a factor defined by the application, while $m$ and $p$ are something we can choose. In such a situation, the smallest possible nonnegative value of $qm - Np$ is exactly $g\mathrel{\unicode{x2254}} \operatorname{gcd}(q,N)$, the greatest common divisor of $q$ and $N$. Recall that a general solution for the equation $qm - Np = g$ is given as

\[m = m_{0} + \frac{Nk}{g},\quad p = p_{0} + \frac{qk}{g}\]

where $m_{0}$ is the modular inverse of $\frac{q}{g}$ with respect to $\frac{N}{g}$, $p_{0}$ is the unique integer satisfying $qm_{0} - Np_{0} = g$, and $k$ is any integer.

Now, we cannot just take any $k\in\mathbb{Z}$, because this whole argument breaks down if $p$ and $q$ are not coprime. In particular, $p_{0}$ is not guaranteed to be coprime to $q$. For example, take $N=32$ and $q=30$, then we get $g=2$, $m_{0}=15$ and $p_{0}=14$ so that $qm_{0} - Np_{0} = 2$, but $\operatorname{gcd}(p_{0},q) = 2$. Nevertheless, it is always possible to find some $k$ such that $p = p_{0} + \frac{qk}{g}$ is coprime to $q$. The proof I wrote below of this fact is due to Seung uk Jang:

Proposition 2 Let $a,b$ be any integers and $g\mathrel{\unicode{x2254}}\operatorname{gcd}(a,b)$. Then there exist integers $x,y$ such that $ax - by = g$ and $\operatorname{gcd}(a,y) = 1$. Specifically, such $x,y$ can be found as
\[x = x_{0} + \frac{bk}{g},\quad y = y_{0} + \frac{ak}{g},\]
where $x_{0}$ is the modular inverse of $\frac{a}{g}$ with respect to $\frac{b}{g}$, $y_{0} = \frac{ax_{0} - g}{b}$, and $k$ is the smallest nonnegative integer satisfying
\[\left(\frac{a}{g}\right)k \equiv 1 - y_{0} \ \left(\operatorname{mod}\ \frac{g}{\operatorname{gcd}(a/g,g)}\right).\]

Proof. Take $x,y,k$ as given above. Note that $\frac{a}{g}$ is coprime to $g/\operatorname{gcd}(a/g,g)$, so such $k$ uniquely exists. Then we have
\[y\equiv y_{0} + (1-y_{0}) \equiv 1 \ \left(\operatorname{mod}\ \frac{g}{\operatorname{gcd}(a/g,g)}\right),\]
so $y$ is coprime to $g/\operatorname{gcd}(a/g,g)$. On the other hand, from
\[\left(\frac{a}{g}\right)x_{0} - \left(\frac{b}{g}\right)y_{0} = 1,\]
we know that $y_{0}$ is coprime to $\frac{a}{g}$, thus $y$ is coprime to $\frac{a}{g}$. This means that $y$ is coprime to $a$. Indeed, let $p$ be any prime factor of $a$, then $p$ should divide either $\frac{a}{g}$ or $g$. If $p$ divides $\frac{a}{g}$, then it cannot divide $y$ since $y$ is coprime to $\frac{a}{g}$. Also, if $p$ divides $g$ but not $\frac{a}{g}$, then it divides $g/\operatorname{gcd}(a/g,g)$ which is again coprime to $y$. $\quad\blacksquare$

For instance, applying this proposition to our example $N=32$, $q=30$ yields

\[m = 15 + 16k,\quad p = 14 + 15k\]

with $k$ being the smallest nonnegative integer satisfying

\[15k \equiv 1 - 14\ (\operatorname{mod}\ 2),\]

which is $k=1$. Hence, we take $m = 31$ and $p = 29$.

In fact, for the specific case of $N=2^{b}$ and $q\leq N$, we can always take either $k=0$ or $1$. Indeed, by the assumption, $g$ is a power of $2$ and $q/g$ is an odd number. Then, one of $p_{0}$ and $p_{0} + \frac{q}{g}$ is an odd number, so one of them is coprime to $g$. On the other hand, since

\[\left(\frac{q}{g}\right)m_{0} - \left(\frac{N}{g}\right)p_{0} = 1,\]

we know $p_{0}$ is coprime to $\frac{q}{g}$, so both $p_{0}$ and $p_{0} + \frac{q}{g}$ are coprime to $\frac{q}{g}$. Therefore, the odd one between $p_{0}$ and $p_{0}+\frac{q}{g}$ must be coprime to $q$.

In any case, let us assume from now on that we have chosen $m$ and $p$ in a way that $\operatorname{gcd}(p,q) = 1$ and $qm - Np = g$. Then $\eqref{eq:condition for xi modified}$ and $\eqref{eq:condition for eta modified}$ can be rewritten as

\[v < \frac{N}{g}\]

and

\[\left\lfloor \frac{n_{\max}}{q} \right\rfloor g < s \leq \frac{ug + N}{q},\]

respectively. Next, we claim that $\frac{ug+N}{q}$ is an integer. Indeed, we have

\[(ug+N)p \equiv g + Np \equiv qm \equiv 0 \ (\operatorname{mod}\ q),\]

and since $p$ is coprime to $q$, $ug+N$ must be a multiple of $q$. Hence, the most sensible choice of $s$ in this case is $s = \frac{ug+N}{q}$, and let us assume that from now on. Then, the left-hand side of the inequality above can be rewritten as

\[\left\lfloor\frac{n_{\max}}{q}\right\rfloor q - u < \frac{N}{g}.\]

Actually, this inequality follows from $v < \frac{N}{g}$, so is redundant, because $v$ can be written in terms of $u$ as

\[v = \left\lfloor \frac{n_{\max} + u}{q}\right\rfloor q - u.\]

Plugging in the above equation into $v \[\left\lfloor\frac{n_{\max} + u}{q}\right\rfloor < \frac{(N/g) + u}{q},\]

or equivalently,

\[\label{eq:nmax condition} n_{\max} \leq \left\lfloor \frac{(N/g)+u}{q}\right\rfloor q + q - 1 - u.\]

Therefore, as long as the above is true, we can check divisibility of $n=1,\ \cdots\ ,n_{\max}$ by inspecting the inequality

\[(nm\ \operatorname{mod}\ N) < \frac{ug+N}{q}.\]

Furthermore, if $n$ turned out to be a multiple of $q$, then we can compute $n/q$ from $(nm\ \operatorname{mod}\ N)$ as in the classical Granlund-Montgomery case. More precisely, assume that $n=aq$ for some integer $a\geq 0$, then

\[\begin{align*} (nm\ \operatorname{mod}\ N) &= aqm - \left\lfloor \frac{aqm}{N}\right\rfloor N = aqm - \left(\left\lfloor \frac{a(qm - Np)}{N}\right\rfloor + ap\right)N \\ &= a(qm - Np) - \left\lfloor \frac{ag}{N}\right\rfloor = ag - \left\lfloor \frac{ag}{N}\right\rfloor. \end{align*}\]

We show that $ag < N$ always holds as long as $1 \[\begin{align*} \left\lfloor \frac{(N/g)+u}{q} \right\rfloor q + q-1-u &\leq \frac{(N/g)+u}{q} \cdot q + q - 1 - u \\ &= \frac{N}{g} + q - 1, \end{align*}\]

and

\[\frac{qN}{g} - \left(\frac{N}{g} + q - 1\right) = \left(\frac{N}{g} - 1\right)\left(q - 1\right) > 0\]

by the condition $1 \[(nm\ \operatorname{mod}\ N) = ag,\]

thus $a$ can be recovered by computing $\frac{(nm\ \operatorname{mod}\ N)}{g}$. Note that in particular if $N=2^{b}$, division by $g$ is just a bit-shift.

Compared to the method proposed by Granlund-Montgomery, bit-rotation is never needed, but at the expense of requires an additional shift for computing $n/q$. Note that this shifting is only needed if $n$ turned out to be a multiple of $q$. The divisibility check alone can be done with one multiplication and one comparison.

So far it sounds like this new method is better than the classical Granlund-Montgomery algorithm based on bit-rotation, but note that the maximum possible value of $n_{\max}$ (i.e., the right-hand side of $\eqref{eq:nmax condition}$) is roughly of size $N/g$, so in particular, if $N=2^{b}$ and $q = 2^{t}q_{0}$ for some odd integer $q_{0}$, then $n_{\max}$ should be of at most about $(b-t)$-bits. Depending on the specific parameters, it is possible to improve the right-hand side of $\eqref{eq:nmax condition}$ a little bit by choosing a different $p$ (since $u$ is determined from $p$), but this cannot have any substantial impact on how large $n_{\max}$ can be. Also, at this moment I do not know of an elegant way of choosing $p$ that maximizes the bound on $n_{\max}$.

In summary, the algorithm works as follows, assuming $1

Write $q = 2^{t}q_{0}$ for an odd integer $q_{0}$.
Let $m_{0}$ be the modular inverse of $q_{0}$ with respect to $2^{b-t}$, and let $p_{0}\mathrel{\unicode{x2254}} (q_{0}m_{0}-1)/2^{b-t}$.
If $p_{0}$ is odd, let $p\mathrel{\unicode{x2254}} p_{0}$, otherwise, let $p\mathrel{\unicode{x2254}} p_{0} + q_{0}$.
Let $m\mathrel{\unicode{x2254}} (2^{b-t}p + 1)/q_{0}$.
Let $u$ be the modular inverse of $p$ with respect to $q$.
Then for any $n=0,1,\ \cdots\ ,\left\lfloor (2^{b-t}+u)/q\right\rfloor q + q - 1 - u$, $n$ is a multiple of $q$ if and only if
\[(nm\ \operatorname{mod}\ 2^{b}) < \frac{2^{b-t}+u}{q_{0}}.\]
In case the above inequality holds, we also have
\[\frac{n}{q} = \frac{(nm\ \operatorname{mod}\ 2^{b})}{2^{t}}.\]

The corresponding code for factoring out the highest power of $q$ would look like the following:

// m, threshold_value and t are precalculated from q.
std::size_t s = 0;
while (true) {
    auto const r = n * m;
    if (r < threshold_value) {
        n = r >> t;
        s += 1;
    }
    else {
        break;
    }
}
return {n, s};

One should keep in mind that the above only works provided $n$ is at most $\left\lfloor (2^{b-t}+u)/q\right\rfloor q + q - 1 - u$, which is roughly equal to $2^{b-t}$.

Benchmark and conclusion

Determining the most efficient approach inherently depends on a multitude of factors. To gain insight into the relative performances of various algorithms, I conducted a benchmark. Given that my primary application involves removing trailing zeros, I set the divisor $q$ to $10$ for this benchmark. Additionally, considering the requirements of Dragonbox, where trailing zero removal may only be necessary for numbers up to $8$ digits for IEEE-754 binary32 and up to $16$ digits for IEEE-754 binary64, I incorporated these assumptions in the benchmark to determine the optimal parameters for each algorithm.

Here is the data I collected on my laptop (Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Windows 10):

32-bit benchmark for numbers with at most 8 digits.

Algorithms	Average time consumed per a sample
Null (baseline)	1.4035ns
Naïve	12.7084ns
Granlund-Montgomery	11.8153ns
Lemire	12.2671ns
Generalized Granlund-Montgomery	11.2075ns
Naïve 2-1	8.92781ns
Granlund-Montgomery 2-1	7.85643ns
Lemire 2-1	7.60924ns
Generalized Granlund-Montgomery 2-1	7.85875ns
Naïve branchless	3.30768ns
Granlund-Montgomery branchless	2.52126ns
Lemire branchless	2.71366ns
Generalized Granlund-Montgomery branchless	2.51748ns

64-bit benchmark for numbers with at most 16 digits.

Algorithms	Average time consumed per a sample
Null (baseline)	1.68744ns
Naïve	16.5861ns
Granlund-Montgomery	14.1657ns
Lemire	14.3427ns
Generalized Granlund-Montgomery	15.0626ns
Naïve 2-1	13.2377ns
Granlund-Montgomery 2-1	11.3316ns
Lemire 2-1	11.6016ns
Generalized Granlund-Montgomery 2-1	11.8173ns
Naïve 8-2-1	12.5984ns
Granlund-Montgomery 8-2-1	11.0704ns
Lemire 8-2-1	13.3804ns
Generalized Granlund-Montgomery 8-2-1	11.1482ns
Naïve branchless	5.68382ns
Granlund-Montgomery branchless	4.0157ns
Lemire branchless	4.92971ns
Generalized Granlund-Montgomery branchless	4.64833ns

(The code is available here.)

And here are some detailed information on how the benchmark is done:

Samples were generated randomly using the following procedure:
- Uniformly randomly generate the total number of digits, ranging from 1 to the specified maximum number of digits.
- Given the total number of digits, uniformly randomly generate the number of trailing zeros, ranging from 0 to the total number of digits minus 1.
- Uniformly randomly generate an unsigned integer with given total number of digits and the number of trailing zeros.
Generalized Granlund-Montgomery refers to the generalized modular inverse algorithm explained in the last section.
Algorithms without any suffix iteratively remove trailing zeros one by one as demonstrated as code snippets in previous sections.
Algorithms suffixed with “2-1” initially attempt to iteratively remove two consecutive trailing zeros at once (by running the loop with $q=100$), and then remove one more zero if necessary.
Algorithms suffixed with “8-2-1” first check if the input contains at least eight trailing zeros (using the corresponding divisibility check algorithm with $q=10^{8}$), and if that is the case, then remove eight zeros and invoke the 32-bit “2-1” variants of themselves. If there are fewer than eight trailing zeros, then they proceed like their “2-1” variants.
Algorithms suffixed with “branchless” do branchless binary search, as suggested by reddit users r/pigeon768 and r/TheoreticalDumbass. (See this reddit post.)

It seems there is not so much difference between all three algorithms overall, and even the naïve one is not so bad. There were notable fluctuation with repeated runs and the top performer varied run to run, but all three consistently outperformed the naïve approach, and I could observe a certain pattern.

Firstly, Lemire’s algorithm seems to suffer for large divisors, for instance $q=10^{8}$ or maybe even $q=10^{4}$. This is probably because for large divisors the number of bits needed for a correct divisibility test often exceeds the word size. This means that comparing the low bits of the result of multiplication with the threshold value is not a simple single comparison in reality.

For instance, “Lemire 8-2-1” and “Lemire branchless” algorithms from the benchmark use $80$-bits for checking divisibility by $q=10^{8}$ for inputs of up to $16$ decimal digits. This means that, given the input is passed as a $64$-bit unsigned integer, we perform widening $128$-bit multiplication with a $64$-bit magic number $m$ (whose value is $12089258196146292$ in this case), and we check two things to decide divisibility: that the lowest $16$-bits from the upper half of the $128$-bit multiplication result is all zero, and that the lower half of this $128$-bit multiplication result is strictly less than $m$. Actually, the minimum required number of bits for a correct divisibility test is $78$ assuming inputs are limited to $16$ decimal digits, and I opted for $2$ more bits to facilitate the extraction of $16$-bits from the upper $64$-bits rather than $14$-bits.

(Note also that having to have more bits than the word size means that, even if all we care is divisibility without any need to determine the quotient, Lemire’s algorithm may still need widening multiplication!)

Secondly, it seems on x86-64 the classical Granlund-Montgomery algorithm is just better than the proposed generalization of it, especially for the branchless case. Note that ror and shr have identical performance (reference), so for the branchless case, having to bit-shift only when the input is divisible turns out to be a pessimization, because we need to evaluate the bit-shift regardless of the divisibility, and as a result it just ends up requiring an additional mov compared to the unconditional bit-rotation of classical Granlund-Montgomery algorithm. Even for the branchful case, it seems the compiler still tries to evaluate the bit-shift regardless of the divisibility and as a consequence generates one more mov.

My conclusion is that, the classical Granlund-Montgomery is probably the best for trailing zero removal, at least on x86-64. Yet, the proposed generalized modular inverse algorithm may be a better choice on machines without ror instruction, or machines with smaller word size. Lemire’s algorithm does not seem to offer any big advantage over the other two on x86-64, and I expect that the cost of widening multiplication may overwhelm the advantage of having only one magic number on machines with smaller word size.

On the other hand, Lemire’s algorithm is still very useful if I have to know the quotient regardless of divisibility. There indeed is such an occasion in Dragonbox, where I am already leveraging Lemire’s algorithm for this purpose.

Special thanks to reddit users r/pigeon768 and r/TheoreticalDumbass who proposed the branchless idea!

On the optimal bounds for integer division by constants

2023-08-10T00:00:00-07:00

It is well-known that the integer division is quite a heavy operation on modern CPU’s - so slow, in fact, that it has even become a common wisdom to avoid doing it at ALL cost in performance-critical sections of a program. I do not know why division is particularly hard to optimize from the hardware perspective. I am just guessing, maybe (1) every general algorithm is essentially just a minor variation of the good-old long division, (2) which is almost impossible to parallelize. But whatever, that is not the topic of this post.

Rather, this post is about some of the common optimization techniques for circumventing integer division. To be more precise, the post can be roughly divided into two parts. The first part discusses the well-known Granlund-Montgomery style multiply-and-shift technique and some associated issues. The second part is about my recent research on a related multiply-add-and-shift technique. In this section, I establish an optimal bound, which perhaps is a novel result. It would be intriguing to see whether modern compilers can take advantage of this new bound.

Note that by integer division, we specifically mean the computation of the quotient and/or the remainder, rather than evaluating the result as a real number. More specifically, throughout this entire post, integer division will always mean taking the quotient unless specified otherwise. Also, I will confine myself into divisions of positive integers, even though divisions of negative integers hold practical significance as well. Finally, all assembly code provided herein are for x86-64 architecture.

Turning an integer division into a multiply-and-shift

One of the most widely used techniques is converting division into a multiplication when the divisor is a known constant (or remains mostly unchanged). The idea behind this approach is quite simple. For instance, dividing by $4$ is equivalent to multiplying by $0.25$, which can be further represented as multiplying by $25$ and then dividing by $100$, where dividing by $100$ is simply a matter of moving the decimal dot into left by two positions. Since we are only interested in taking the quotient, this means throwing away the last two digits.

In this particular instance, $4$ is a divisor of $100$ so we indeed have a fairly concise such a representation, but in general the divisor might not divide a power of $10$. Let us take $7$ as an example. In this case, we cannot write $\frac{1}{7}$ as $\frac{m}{10^{k}}$ for some positive integers $m,k$. However, we can still come up with a good approximation. Note that

\[\frac{1}{7} = 0.142857142857\cdots,\]

so presumably something like $\frac{142858}{1000000}$ would be a good enough approximation of $\frac{1}{7}$. Taking that as our approximation, we may want to compute $n/7$ by multiplying $142858$ to $n$ and then throwing away the last $6$ digits. (Note that we are taking $142858$ instead of $142857$ because the latter already fails when $n=7$. In general, we must take ceiling, not floor nor half-up rounding.)

This indeed gives the right answer for all $n=1,\ \cdots\ ,166668$, but it starts to produce a wrong answer at $n=166669$, where the correct answer is $23809$, whereas our method produces $23810$. And of course such a failure is expected. Given that we are using an approximation with nonzero error, it is inevitable that the error will eventually manifest as the dividend $n$ grows large enough. But, the question at hand is, can we estimate how far it will go? Or, how to choose a good enough approximation guaranteed to work correctly when there is a given limit on how big our $n$ can be?

Obviously, our intention is to implement this concept in computer program, so the denominators of the approximations will be powers of $2$, not $10$. So, for a given positive integer $d$, our goal is to find a good enough approximation of $\frac{1}{d}$ of the form $\frac{m}{2^{k}}$. Perhaps one of the most widely known formal results in this realm is the following theorem by Granlund-Montgomery:

Theorem 1 (Granlund-Montgomery, 1994).

Suppose $m$, $d$, $k$ are nonnegative integers such that $d\neq 0$
\[2^{N+k} \leq md \leq 2^{N+k}+2^{k}.\]
Then $\left\lfloor n/d \right\rfloor = \left\lfloor mn/2^{N+k} \right\rfloor$ for every integer $n$ with $0\leq n< 2^{N}$.

Here, $d$ is the given divisor and we are supposed to approximate $\frac{1}{d}$ by $\frac{m}{2^{N+k}}$. An assumption here is that we want to perform the division $n/d$ for all $n$ from $0$ to $2^{N}-1$, where $N$ is supposed to be the bit-width of the integer data type under consideration. In this setting, the mentioned theorem establishes a sufficient condition under which we can calculate the quotient of $n/d$ by first multiplying $n$ by $m$ and subsequently right-shifting the result by $(N+k)$-bits. Note that we will need more than $N$-bits; since our dividend is of $N$-bits, the result of the multiplication $mn$ will need to be stored in $2N$-bits. Since we are shifting by $(N+k)$-bits, the lower half of the result is actually not needed, and we just take the upper half and shift it by $k$-bits.

We will not talk about the proof of this theorem because a much more general theorem will be presented in the next section. But let us see how results like this are being applied in the wild. For example, consider the following C++ code:

std::uint64_t div(std::uint64_t n) noexcept {
    return n / 17;
}

Modern compilers are well-aware of these Granlund-Montgomery-style division tricks. Indeed, my compiler (clang) adeptly harnessed such tactics, as demonstrated by its translation of the code above into the following lines of assembly instructions:

div(unsigned long):
        mov     rax, rdi
        movabs  rcx, -1085102592571150095
        mul     rcx
        mov     rax, rdx
        shr     rax, 4
        ret

(Check it out!)

Note that we have $N=64$ and $d=17$ in this case, and the magic constant $-1085102592571150095$ you can see in the second line, interpreted as an unsigned value $17361641481138401521$, is the $m$ appearing in the theorem. You can also see in the fifth line that $k$ in this case is equal to $4$. What the assembly code is doing is as follows:

Load the $64$-bit input $n$ into the $64$-bit rax register.
Load the $64$-bit constant $17361641481138401521$ into the $64$-bit rcx register.
Multiply them together, then the result is of $128$-bits. The lower half of this is stored back into rax, and the upper half is stored in the rdx register. But we do not care about the lower half, so throw it away and copy the upper half back into rax.
Shift the result to the right by $4$-bits. The result must be equal to $n/17$.

We can easily check that these $m$ and $k$ chosen by the compiler indeed satisfy the inequalities in the theorem:

\[\begin{aligned} 295147905179352825856 = 2^{64+4} &\leq md \\ &= 17361641481138401521\cdot 17 \\ &= 295147905179352825857 \\ &\leq 2^{64+4} + 2^{4} \\ &= 295147905179352825872. \end{aligned}\]

We have not yet seen how to actually find $m$ and $k$ using a result like the theorem above, but before asking such a question, let us instead ask: “is this condition on $m$ and $k$ the best possible one?”

And the answer is: No.

Here is an example, take $N=32$ and $d=102807$. In this case, the smallest $k$ that allows an integer $m$ that satisfies

\[2^{N+k} \leq md \leq 2^{N+k} + 2^{k}\]

to exist is $k=17$, and in that case the unique $m$ satisfying the above is $5475793997$. This is kind of unfortunate, because the magic constant $m=5475793997$ is of $33$-bits, so the computation of $nm$ cannot be done inside $64$-bits. However, it turns out that we can take $k=16$ and $m=2737896999$ and still the equality $\left\lfloor \frac{n}{d}\right\rfloor = \left\lfloor \frac{nm}{2^{N+k}}\right\rfloor$ holds for all $n=0,\ \cdots\ ,2^{N}-1$, although the above inequality is not satisfied in this case. Now, the new constant $2737896999$ is of $32$-bits, so we can do our computation inside $64$-bits. This might result a massive difference in practice!

It seems that even the most recent version of GCC (13.2) is still not aware of this, while clang knows that the above $m$ and $k$ work. (In the link provided, GCC is actually trying to compute $\left\lfloor \frac{nm}{2^{N+k}} \right\rfloor$ with the $33$-bit constant $m = 5475793997$. Detailed explanation will be given in a later section.)

Then what is the best possible condition? I am not sure who was the first for finding the optimal bound, but at least it seems that such a bound is written in the famous book Hacker’s Delight by H. S. Warren Jr. Also, recently (in 2021), Lemire et al. showed the optimality of an equivalent bound. (EDIT: according to a Reddit user, there was a report written by H. S. Warren Jr in 1992, even earlier than Granlund-Montgomery, which contained the optimal bound with the proof of optimality.)

I will not write down the optimal bounds obtained by these authors, because I will present a more general result proved by myself in the next section.

Here is one remark before getting into the next section. The aforementioned results on the optimal bound work for $n$ from the range $\left\{1,\ \cdots\ ,n_{\max}\right\}$ where $n_{\max}$ is not necessarily of the form $2^{N}-1$. However, even recent compilers do not seem to leverage this fact. For example, let us look at the following code:

std::uint64_t div(std::uint64_t n) {
    [[assume(n < 10000000000)]];
    return n / 10;
}

Here we are relying on a new language feature added in C++23: assume. GCC generated the following lines of assemblies:

div(unsigned long):
        movabs  rax, -3689348814741910323
        mul     rdi
        mov     rax, rdx
        shr     rax, 3
        ret

(Check it out!)

My gripe with this is the generation of a superfluous shr instruction. GCC seems to think that $k$ must be at least $67$ (which is why it shifted by $3$-bits, after throwing away the $64$-bit lower half), but actually $k=64$ is fine with the magic number $m=1844674407370955162$ thanks to the bound on $n$, in which case we do not need this additional shifting.

How about clang? Unlike GCC, clang currently does not seem to understand this new language feature assume. But it has an equivalent language extension __builtin_assume, so I tried with that:

std::uint64_t div(std::uint64_t n) {
    __builtin_assume(n < 10000000000);
    return n / 10;
}

div(unsigned long):
        mov     rax, rdi
        movabs  rcx, -3689348814741910323
        mul     rcx
        mov     rax, rdx
        shr     rax, 3
        ret

(Check it out!)

And there is no big difference😥

Turning multiplication by a real number into a multiply-and-shift

Actually, during the development of Dragonbox, I was interested in a more general problem of multiplying a rational number to $n$ and then finding out the integer part of the resulting rational number. In other words, my problem was not just about division, rather about multiplication followed by a division. This presence of multiplier certainly makes the situation a little bit more tricky, but it is anyway possible to derive the optimal bound in a similar way, which leads to the following generalization of the results mentioned in the previous section. (Disclaimer: I am definitely not claiming to be the first who proved this, and I am sure an equivalent result could be found elsewhere, though I am not aware of any.)

Theorem 2 (From this paper).

Let $x$ be a real number and $n_{\max}$ a positive integer. Then for a real number $\xi$, we have the followings.

If $x=\frac{p}{q}$ is a rational number with $q\leq n_{\max}$, then we have $\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor$ for all $n=1,\ \cdots\ ,n_{\max}$ if and only if $x \leq \xi < x + \frac{1}{vq}$ holds, where $v$ is the greatest integer such that $vp\equiv -1\ (\mathrm{mod}\ q)$ and $v\leq n_{\max}$.

If $x$ is either irrational or a rational number with the denominator strictly greater than $n_{\max}$, then we have $\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor$ for all $n=1,\ \cdots\ ,n_{\max}$ if and only if $\frac{p_{*}}{q_{*}} \leq \xi < \frac{p^{*}}{q^{*}}$ holds, where $\frac{p_{*}}{q_{*}}$, $\frac{p^{*}}{q^{*}}$ are the best rational approximations of $x$ from below and above, respectively, with the largest denominators $q_{*},q^{*}\leq n_{\max}$.

Note that $\left\lfloor nx \right\rfloor$ is supposed to be the one we actually want to compute, while $\xi$ is supposed to be the chosen approximation of $x$. For the special case when $x = \frac{1}{d}$, $n_{\max} = 2^{N}-1$, and $\xi = \frac{m}{2^{N+k}}$, we obtain the setting of Granlund-Montgomery.

So I said rational number in the beginning of this section, but actually our $x$ can be any real number. Note, however, that the result depends on whether $x$ is “effectively rational” or not over the domain $n=1,\ \cdots\ ,n_{\max}$, i.e., whether or not there exists a multiplier $n$ which makes $nx$ into an integer.

Since I was working on floating-point conversion problems when I derived this theorem, for me the more relevant case was the second case, that is, when $x$ is “effectively irrational”, because the numerator and the denominator of $x$ I was considering were some high powers of $2$ and $5$. But the first case is more relevant in the main theme of this post, i.e., integer division, so let us forget about these jargons like best rational approximations and such. (Spoiler: they will show up again in the next section.)

So let us focus on the first case. First of all, note that if $p=1$, that is, when $x = \frac{1}{q}$, then $v$ has a simpler description: it is the last multiple of $q$ in the range $1,\ \cdots\ ,n_{\max}+1$ minus one. If you care, you can check the aforementioned paper by Lemire et al. to see that their Theorem 1 exactly corresponds to the resulting bound. In fact, in this special case it is rather easy to see why the best bound should be something like that.

Indeed, note that having the equality

\[\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor\]

for all $n=1,\ \cdots\ ,n_{\max}$ is equivalent to having the inequality

\[\frac{\left\lfloor nx\right\rfloor}{n} \leq \xi < \frac{\left\lfloor nx\right\rfloor + 1}{n}\]

for all such $n$. Hence, it is enough to find the largest possible value of the left-hand side and the smallest possible value of the right-hand side. Since we are assuming that the denominator of $x$ is bounded by $n_{\max}$, obviously the maximum value of the left-hand side is just $x$. Thus, it is enough to find the minimum value of the right-hand side. Note that we can write

\[n = q\left\lfloor \frac{n}{q} \right\rfloor + r = q\left\lfloor nx\right\rfloor + r\]

where $r$ is the remainder of the division $n/q$. Replacing $\left\lfloor nx\right\rfloor$ by $n$ and $r$ using the above identity, we get

\[\frac{\left\lfloor nx\right\rfloor + 1}{n} = \frac{(n-r)/q + 1}{n} = \frac{n + (q-r)}{qn} = \frac{1}{q} + \frac{q-r}{qn}.\]

Therefore, the minimization of the left-hand side is equivalent to the minimization of $\frac{q-r}{n}$. Intuitively, it appears reasonable to believe that the minimizer $n$ must have the largest possible remainder $r=q-1$, because for example if $r$ were set to be $q-2$ instead, then the numerator gets doubled, necessitating a proportionally larger $n$ to achieve a diminished value of $\frac{q-r}{n}$. Also, among $n$’s with $r=q-1$, obviously the largest $n$ yields the smallest value of $\frac{q-r}{n}$, so it sounds rational to say that probably the greatest $n$ with $r=q-1$ is the minimizer of $\frac{q-r}{n}$. Indeed, this is quite easy to prove: suppose we call such $n$ as $v$, and suppose that there is $n$ which is even better than $v$:

\[\frac{q-r}{n} \leq \frac{1}{v}, \quad (q-r)v \leq n.\]

Now, since $v$ divided by $q$ has the remainder $q-1$, the left-hand side and the right-hand side have the same remainder when divided by $q$. Therefore, the difference between the two must be either zero or at least $q$. But since $v$ is the largest one with the remainder $q-1$, it must be at least $n_{\max} - q + 1$, thus $n$ cannot be larger than $v$ by more than $q - 1$. Thus, the only possibility is $n=v$.

When $x=\frac{p}{q}$ and $p\neq 1$, the remainder $r$ of $np/q$ depends pretty randomly on $n$, so it is somewhat harder to be intuitively convinced that still the minimizer of $\frac{\left\lfloor nx\right\rfloor + 1}{n}$ should have $r = q - 1$. But the same logic as above still works just with a little bit of tweaks in this case. The full proof can be found in the paper mentioned, or one of my previous posts.

Some applications

Here I collected some applications of the presented theorem.

Finding the first error case

In the beginning of the previous section, I claimed that

\[\left\lfloor \frac{n}{7} \right\rfloor = \left\lfloor \frac{n\cdot 142858}{1000000} \right\rfloor\]

holds for all $n=1,\ \cdots\ ,166668$ but not for $n=166669$. Now we can see how did I get this. Note that $166669\equiv 6\ (\mathrm{mod}\ 7)$, so the range $\left\{1,\ \cdots\ ,166668\right\}$ and the range $\left\{1,\ \cdots\ ,166669\right\}$ have different $v$’s: it is $166662$ for the former, while it is $166669$ for the latter. And this makes the difference, because the inequality

\[\frac{142858}{1000000} < \frac{1}{7} + \frac{1}{7v}\]

holds if and only if $v < 166667$. Thus, $n=166669$ is the first counterexample.

Coming up with a better magic number than Granlund-Montgomery

We can also see now why a better magic number worked in the example from the previous section. Let me repeat it here with the notation of Theorem 2: we have $x=1/102807$ and $n_{\max}=2^{32}-1$, so we have

\[\left\lfloor \frac{n}{102807} \right\rfloor = \left\lfloor \frac{nm}{2^{k}} \right\rfloor\]

for all $n=1,\ \cdots\ ,2^{32}-1$ if and only if $\xi=\frac{m}{2^{k}}$ satisfies

\[\frac{1}{102807} \leq \frac{m}{2^{k}} < \frac{1}{102807} + \frac{1}{v\cdot 102807}.\]

In this case, $v$ is the largest integer in the range $1,\ \cdots\ ,2^{32}-1$ which has the remainder $102806$ when divided by $102807$. Then it can be easily seen that $v=4294865231$, so the inequality above becomes

\[\frac{2^{k}}{102807} \leq m < \frac{41776\cdot 2^{k}}{4294865231}.\]

The smallest $k$ that allows an integer solution to the above inequality is $k=48$, in which case the unique solution is $m=2737896999$.

A textbook example for the case $p\neq 1$

Let me demonstrate how Theorem 2 can be used for the case $p\neq 1$. Suppose that we want to convert the temperature from Fahrenheit to Celsius. Obviously, representing such values only using integers is a funny idea, but let us pretend that we are completely serious (and hey, in 2023, Fahrenheit itself is a funny joke from the first place😂… except that I am currently living in an interesting country🤮). Or, if one desperately wants to make a really serious example, we can maybe think about doing the same thing with fixed-point fractional numbers. But whatever.

So the formula is, we first subtract $32$ and then multiply $5/9$. For the sake of simplicity, let us arbitrarily assume that our Fahrenheit temperature is ranging from $32^{\circ}\mathrm{F}$ to $580^{\circ}\mathrm{F}$ so in particular we do not run into negative numbers. After subtracting $32$, the range is from $0$ to $548$, so we take $n_{\max}=548$. With $x = \frac{5}{9}$, our $v$ is so the largest integer such that $v\leq 548$ and $5v\equiv 8\ (\mathrm{mod}\ 9)$, or equivalently, $v\equiv 7\ (\mathrm{mod}\ 9)$. The largest multiple of $9$ in the range is $540$, so we see $v=547$. Hence, the inequality we are given with is

\[\frac{5}{9} \leq \frac{m}{2^{k}} < \frac{5}{9} + \frac{1}{547\cdot 9} = \frac{304}{547}.\]

The smallest $k$ that allows an integer solution is $k=10$, and the unique solution for that case is $m=569$. Therefore, the final formula would be:

\[\left\lfloor \frac{5(n-32)}{9} \right\rfloor = \left\lfloor \frac{569(n-32)}{2^{10}} \right\rfloor.\]

Will the case $p\neq 1$ be potentially relevant for compiler-writers?

Might be, but there is a caveat: in general, the compiler is not allowed to optimize an expression n * p / q into (n * m) >> k, because n * p can overflow and in that case dividing by q will give a weird answer. To be more specific, for unsigned integer types, C/C++ standards say that any overflows should wrap around, so the expression n * p / q is not really supposed to compute $\left\lfloor \frac{np}{q}\right\rfloor$, rather it is supposed to compute $\left\lfloor \frac{(np\ \mathrm{mod}\ 2^{N})}{q}\right\rfloor$ where $N$ is the bit-width, even though it is quite likely that the one who wrote the code actually wanted the former. On the other hand, for signed integer types, (a signed-equivalent of) Theorem 2 might be applicable, because signed overflows are specifically defined to be undefined. But presumably there are lots of code out there relying on the wrong assumption that signed overflows will wrap around, so maybe compiler-writers do not want to do this kind of optimizations.

Nevertheless, there are situations where doing this is perfectly legal. For example, suppose n, p, q are all of type std::uint32_t and the code is written like static_cast((static_cast(n) * p) / q) to intentionally avoid this overflow issue. Then the compiler might recognize such a pattern and do this kind of optimizations. Or more generally, with the new assume attribute (or some other equivalent compiler-specific mechanisms), the user might give some assumptions that ensure no overflow.

It seems that currently both clang and GCC do not do this, so if they want to do so in a future, then Theorem 2 might be useful. But how many code can benefit from such an optimization? Will it be really worth implementing? I do not know, but maybe not.

When the magic number is too big

Even with the optimal bound, there exist situations where the smallest possible magic number does not fit into the word size. For example, consider the case $n_{\max} = 2^{64}-1$ and $x = 1/10961$. In this case, our $v$ is

\[v = \left\lfloor \frac{2^{64} - 10961}{10961} \right\rfloor \cdot 10961 + 10960 = 18446744073709550681.\]

Hence, the inequality we need to inspect is

\[\begin{aligned} \frac{1}{10961} \leq \frac{m}{2^{k}} & < \frac{1}{10961} + \frac{1}{10961\cdot 18446744073709550681} \\ &= \frac{1682943533775162}{18446744073709550681}, \end{aligned}\]

and the smallest $k$ allowing an integer solution is $78$, in which case the unique solution is $m = 27573346857372255605$. And unfortunately, this is a $65$-bit number!

Let us see how to deal with this case by looking at what a compiler actually does. Consider the code:

std::uint64_t div(std::uint64_t n) {
    return n / 10961;
}

My compiler (clang, again) generated the following lines of assemblies:

div(unsigned long):
        movabs  rcx, 9126602783662703989
        mov     rax, rdi
        mul     rcx
        sub     rdi, rdx
        shr     rdi
        lea     rax, [rdi + rdx]
        shr     rax, 13
        ret

(Check it out!)

This is definitely a little bit more complicated than the happy case, but if we think carefully, we can realize that this is still just computing $\left\lfloor \frac{nm}{2^{k}}\right \rfloor$:

The magic number $m’=9126602783662703989$ is precisely $m - 2^{64}$.
In the third line, we multiply this magic number with the input. Let us call the upper half of the result as $u$ (which is stored in rdx), and the lower half as $\ell$ (which is stored in rax, and we do not care about the lower half anyway).
We subtract the upper half $u$ from the input $n$ (the sub line), divide the result by 2 (the shr line), add the result back to $u$ (the lea line), and then store the end result into rax. Now this looks a bit puzzling, but what it really does is nothing but to compute $\left\lfloor (n + u)/2 \right\rfloor = \left\lfloor nm / 2^{65}\right\rfloor$. The reason why we first subtract $u$ from $n$ is to avoid overflow. On the other hand, the subtraction is totally fine as there can be no underflow, because $u = \left\lfloor nm’/2^{64}\right\rfloor$ is at most $n$, as $m’$ is less than $2^{64}$.
Recall $k = 78$, so we want to compute $\left\lfloor nm/2^{78} \right\rfloor$. Since we got $\left\lfloor nm / 2^{65}\right\rfloor$ from the previous step, we just need to shift this further by $13$-bits.

This is not so bad, we just have two more trivial instructions compared to the happy case. But the reason why the above works is largely due to that the magic number is just one bit larger than $64$-bits. Well, can it be even larger than that?

The answer is: No, fortunately, the magic number being just one bit larger than the word size is the worst case.

To see why, note that the size of the interval where $\xi=\frac{m}{2^{k}}$ can possibly live is precisely $1/vq$. Therefore, if $k$ is large enough so that $2^{k}\geq vq$, then the difference between the endpoints of the inequality

\[\frac{2^{k}}{q} \leq m < 2^{k}\left(\frac{1}{q} + \frac{1}{vq}\right)\]

is at least $1$, so the inequality must admit an integer solution. Now, we are interested in the bit-width of $\left\lceil\frac{2^{k}}{q}\right\rceil$, which must be the smallest possible magic number if $k$ admits at least one solution. Since the smallest admissible $k$ is at most the smallest $k$ satisfying $2^{k}\geq vq$, thus we must have $vq>2^{k-1}$, so $\frac{2^{k}}{q} < 2v$. Clearly, the right-hand side is of at most one bit larger than the word size.

Actually, there is an alternative way of dealing with the case of too large magic number, which is to consider a slightly different formula: instead of just doing a multiplication and then a shift, perform an addition by another magic number between those two operations. Using the notations from Theorem 2, what this means is that we have some $\zeta$ satisfying the equality

\[\left\lfloor nx \right\rfloor = \left\lfloor n\xi + \zeta \right\rfloor\]

instead of $\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor$, where $\xi=\frac{m}{2^{k}}$ and $\zeta=\frac{s}{2^{k}}$ so that

\[\left\lfloor n\xi + \zeta \right\rfloor = \left\lfloor \frac{nm + s}{2^{k}}\right\rfloor\]

can be computed by performing a multiplication, an addition, and a shift.

The inclusion of this $\zeta$ might allow us to use a smaller magic number, thereby enabling its accommodation within a single word. The next section is dedicated to the condition for having the above equality.

Here is a small remark before we start the (extensive) discussion of the optimal bound for this. Note that this trick of including the $\zeta$ term is probably not so useful for $64$-bit divisions, because addition is not really a trivial operation in this case. This is because the result of the multiplication $mn$ spans two $64$-bit blocks. Hence, we need an adc (add-with-carry) instruction or an equivalent, which is not particularly well-optimized in typical x86-64 CPU’s. While I have not conducted a benchmark, I speculate that this approach probably results in worse performance. However, this trick might give a better performance for the $32$-bit case than the method explained in this section, as every operation can be done inside $64$-bits. Interestingly, it seems modern compilers do not use this trick anyway. In the link provided, the compiler is trying to compute multiplication by $4999244749$ followed by the shift by $49$-bits. However, it turns out, by multiplying $1249811187$, adding $1249811187$, and then shifting to right by $47$-bits, we can do this computation completely within $64$-bits. I do not know whether the reason why the compilers do not leverage this trick is because it does not perform better, or just because they did not bother to implement it.

Multiply-add-and-shift rather than multiply-shift

WARNING: The contents of this section is substantially more complicated and math-heavy than previous sections.

As said in the last section, we will explore the condition for having

\[\left\lfloor nx \right\rfloor = \left\lfloor n\xi + \zeta \right\rfloor\]

for all $n=1,\ \cdots\ ,n_{\max}$, where $x$, $\xi$ are real numbers and $\zeta$ is a nonnegative real number. We will derive the optimal bound, i.e., an “if and only if” condition. We remark that an optimal bound has been obtained in the paper by Lemire et al. mentioned above for the special case when $x=\frac{1}{q}$ for some $q\leq n_{\max}$ and $\xi=\zeta$. According to their paper, the proof of the optimality of their bound is almost identical to the case of having no $\zeta$, so they even did not bother to write down the proof. I provided a proof of this special case in a later subsection. The proof I wrote seems to rely on a heavy lifting done below, but it can be done without it pretty easily as well.

However, in the general case I am dealing here, i.e., the only restriction I have is $\zeta\geq 0$, the situation is quite more complicated. Nevertheless, even in this generality, it is possible to give a very concrete description of how exactly the presence of $\zeta$ distorts the optimal bound.

Just like the case $\zeta=0$ (i.e., Theorem 2), having the equality for all $n=1,\ \cdots\ ,n_{\max}$ is equivalent to having the inequality

\[\max_{n=1,\ \cdots\ ,n_{\max}}\frac{\left\lfloor nx\right\rfloor - \zeta}{n} \leq \xi <\min_{n=1,\ \cdots\ ,n_{\max}}\frac{\left\lfloor nx\right\rfloor + 1 - \zeta}{n},\]

so the question is how to evaluate the maximum and the minimum in the above. By the reason that will become apparent as we proceed, we will in fact try to find the largest maximizer of the left-hand side and the smallest minimizer of the right-hand side.

The lower bound

Since $\zeta\geq 0$ is supposed to be just a small constant, it sounds reasonable to believe that the maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ is probably quite close to be the maximizer of $\frac{\left\lfloor nx\right\rfloor - \zeta}{n}$ as well. So let us start there: let $n_{0}$ be the largest among all maximizers of $\frac{\left\lfloor nx\right\rfloor}{n}$. Now, what should happen if $n_{0}$ were not the maximizer of $\frac{\left\lfloor nx\right\rfloor - \zeta}{n}$? Say, what can we say about $n$’s such that

\[\frac{\left\lfloor nx\right\rfloor - \zeta}{n} \geq \frac{\left\lfloor n_{0}x\right\rfloor - \zeta}{n_{0}}\]

holds? First of all, since this inequality implies

\[-\frac{\zeta}{n} \geq -\frac{\zeta}{n_{0}},\]

we must have $n\geq n_{0}$ unless $\zeta=0$, which is an uninteresting case anyway.

As a result, we can equivalently reformulate our optimization problem as follows: given $n=0,\ \cdots\ ,n_{\max}-n_{0}$, find the largest maximizer of

\[\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n}.\]

Now, we claim that

\[\label{eq:floor splits; lower bound} \left\lfloor (n_{0}+n)x \right\rfloor = \left\lfloor nx \right\rfloor + \left\lfloor n_{0}x \right\rfloor\]

holds for all such $n$. (Note that in general $\left\lfloor x+y\right\rfloor$ is equal to either $\left\lfloor x\right\rfloor + \left\lfloor y\right\rfloor$ or $\left\lfloor x\right\rfloor + \left\lfloor y\right\rfloor + 1$.) This follows from the fact that $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ is not only the best rational approximation from below in the weak sense, but also in the strong sense. Okay, so at this point there is no way to get around these jargons anymore, so let us define them formally.

Definition 3 (Best rational approximations from below/above).

Let $x$ be a real number. We say a rational number $\frac{p}{q}$ (in its reduced form, which is always assumed if not specified otherwise) is a best rational approximation from below (above, resp.) if $\frac{p}{q}\leq x$ ($\frac{p}{q}\geq x$, resp.) and for any rational number $\frac{a}{b}$ with $\frac{p}{q}\leq\frac{a}{b}\leq x$ ($\frac{p}{q}\geq\frac{a}{b}\geq x$, resp.), we always have $q\leq b$.

In other words, $\frac{p}{q}$ is a best rational approximation of $x$ if any better rational approximation must have a larger denominator.

Remark. Note that the terminology best rational approximation is pretty standard in pure mathematics, but it usually disregards the direction of approximation, from below or above. For example, $\frac{1}{3}$ is a best rational approximation from below of $\frac{3}{7}$, but it is not a best rational approximation in the usual, non-directional sense, because $\frac{1}{2}$ is a better approximation with a strictly less denominator. But this concept of non-directional best rational approximation is quite irrelevant to our application, so any usage of the term best rational approximation in this post always means the directional ones, either from below or above.

The definition provided above is the one in the weak sense, although it is not explicitly written so. The corresponding one in the strong sense is given below:

Definition 4 (Best rational approximations from below/above in the strong sense).

Let $x$ be a real number. We say a rational number $\frac{p}{q}$ (again, in its reduced form) is a best rational approximation from below (above, resp.) in the strong sense, if $\frac{p}{q}\leq x$ ($\frac{p}{q}\geq x$, resp.) and for any rational number $\frac{a}{b}$ with $qx - p\geq bx - a\geq 0$ ($p - qx\geq a - bx \geq 0$, resp.), we always have $q\leq b$.

As the name suggests, if $\frac{p}{q}$ is a best rational approximation from below (above, resp.) in the strong sense, then it is a best rational approximation from below (above, resp.) in the weak sense. To see why, suppose that $\frac{p}{q}$ is a best rational approximation from above of $x$ in the strong sense and take any rational number $\frac{a}{b}$ with $\frac{a}{b}\leq x$ and $b\leq q$. Then it is enough to show that $\frac{p}{q}\geq\frac{a}{b}$ holds. (We are only considering the “from below” case, and the “from above” case can be done in the same way.) This is indeed quite easy to show:

\[x - \frac{p}{q} = \frac{qx - p}{q} \leq \frac{bx - a}{q} \leq \frac{bx - a}{b} = x - \frac{a}{b},\]

thus we conclude $\frac{a}{b}\leq\frac{p}{q}$ as claimed, so $\frac{p}{q}$ is indeed a best rational approximation from below of $x$ in the weak sense.

Remarkably, using the theory of continued fractions, it can be shown that the converse is also true. Note that this fact is entirely not trivial only by looking at their definitions. You can find a proof of this fact in my paper on Dragonbox; see the remark after Algorithm C.13 (I shamelessly give my own writing as a reference because I do not know of any other).

Alright, but what is the point of these nonsenses?

First, as suggested before, our $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ is a best rational approximation from below of $x$. This is obvious from its definition: $n_{0}$ is the maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ for $n=1,\ \cdots\ ,n_{\max}$, which means that whenever we have

\[\frac{\left\lfloor n_{0}x \right\rfloor}{n_{0}} < \frac{a}{b} \leq x,\]

then since $a \leq \left\lfloor bx\right\rfloor$ holds (because otherwise the inequality $\frac{a}{b}\leq x$ fails to hold), we must have $b>n_{\max}$, so in particular $b>n_{0}$.

Therefore, by the aforementioned equivalence of the two concepts, we immediately know that $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ is a best rational approximation from below of $x$ in the strong sense. Note that this does not imply that $n_{0}$ is a minimizer of $nx - \left\lfloor nx\right\rfloor$ for $n=1,\ \cdots\ ,n_{\max}$ because $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ is not necessarily in its reduced form (since we chose $n_{0}$ to be the largest maximizer). Still, if we denote its reduced form as $\frac{p}{q}$, then certainly $q$ is the minimizer of it (and $p$ must be equal to $\left\lfloor qx\right\rfloor$). Indeed, by the definition of best rational approximations from below in the strong sense, if $n$ is the smallest integer such that $n>q$ and

\[nx - \left\lfloor nx\right\rfloor < qx - p,\]

then $\frac{\left\lfloor nx\right\rfloor}{n}$ itself must be a best rational approximation from below in the strong sense, thus also in the weak sense. This in particular means $\frac{\left\lfloor nx\right\rfloor}{n}\geq\frac{p}{q} = \frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ holds since $n>q$, and in fact the equality cannot hold because otherwise we get

\[qx - p = qx - \frac{q}{n}\left\lfloor nx\right\rfloor = \frac{q}{n}\left(nx - \left\lfloor nx\right\rfloor\right) < \frac{q}{n}(qx - p)\]

which contradicts to $n>q$. Therefore, by the definition of $n_{0}$, we must have $n>n_{\max}$.

Using this fact, now we can easily prove $\eqref{eq:floor splits; lower bound}$. Again let $\frac{p}{q}$ be the reduced form of $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$, then for any $n = 0,\ \cdots\ ,n_{\max} - q$, we must have

\[(q+n)x - \left\lfloor (q+n)x\right\rfloor \geq qx - p,\]

so we get

\[\left\lfloor (q+n)x\right\rfloor \leq \left\lfloor qx\right\rfloor+ nx < \left\lfloor qx\right\rfloor + \left\lfloor nx\right\rfloor + 1,\]

and this rules out the possibility $\left\lfloor (q+n)x\right\rfloor = \left\lfloor qx\right\rfloor + \left\lfloor nx\right\rfloor + 1$, thus we must have $\left\lfloor (q+n)x\right\rfloor = \left\lfloor qx\right\rfloor + \left\lfloor nx\right\rfloor$. Then inductively, we get that for any positive integer $k$ such that $kq\leq n_{\max}$ and $n=0,\ \cdots\ ,n_{\max} - kq$,

\[\left\lfloor (kq+n)x\right\rfloor = \left\lfloor qx\right\rfloor + \left\lfloor ((k-1)q + n)x\right\rfloor =\ \cdots\ = \left\lfloor kqx\right\rfloor + \left\lfloor nx\right\rfloor.\]

Hence, we must have $\left\lfloor (n_{0}+n)x\right\rfloor = \left\lfloor n_{0}x\right\rfloor + \left\lfloor nx\right\rfloor$ in particular.

(Here is a more intuitive explanation. Note that among all $nx$’s, $qx$ is the one with the smallest fractional part. Then whenever $kq + n\leq n_{\max}$, the sum of the fractional parts of $nx$ and that of $k$ copies of $qx$ should not “wrap around”, i.e., it should be strictly less than $1$, because if we choose the smallest $k$ that $(kq + n)x$ wraps around, then since $((k-1)q + n)x$ should not wrap around, its fractional part is strictly less than $1$, which means that the fractional part of $(kq + n)x$ is strictly less than that of $qx$, contradicting to the minimality of the fractional part of $qx$. Hence, since the sum of all fractional parts is still strictly less than $1$, the integer part of $(kq + n)x$ should be just the sum of the integer parts of its summands.)

Next, using the claim, now we can characterize $n=1,\ \cdots\ ,n_{\max} - n_{0}$ which yields a larger value for

\[\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n}\]

than the case $n=0$. Indeed, we have the inequality

\[\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n} \geq \frac{\left\lfloor n_{0}x \right\rfloor - \zeta}{n_{0}}\]

if and only if

\[n_{0}\left\lfloor nx\right\rfloor - n_{0}\zeta \geq n\left\lfloor n_{0}x\right\rfloor - (n_{0}+n)\zeta,\]

if and only if

\[\frac{\left\lfloor nx\right\rfloor}{n} \geq \frac{\left\lfloor n_{0}x\right\rfloor - \zeta}{n_{0}},\]

or equivalently,

\[\label{eq:lower bound iteration criterion} x - \frac{\left\lfloor nx\right\rfloor}{n} \leq \frac{\zeta}{n_{0}} + \left(x - \frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}\right).\]

Therefore, in a sense, $\frac{\left\lfloor nx\right\rfloor}{n}$ itself should be a good enough approximation of $x$, although it does not need to (and cannot) be a better approximation than $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$.

At this point, we could just enumerate all rational approximations of $x$ from below satisfying the above bound and find out the one that maximizes $\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n}$. Indeed, the theory of continued fractions allows us to develop an efficient algorithm for doing such a task. (See Algorithm C.13 in my paper on Dragonbox for an algorithm doing a closely related task.) However, we can do better here.

First of all, note that if $\zeta$ is small enough, then $\eqref{eq:lower bound iteration criterion}$ does not have any solution for $n=1,\ \cdots\ ,n_{\max} - n_{0}$, which means $n_{0}$ is indeed the largest maximizer we are looking for and there is nothing further we need to do. To be more precise, if $\zeta$ chosen so that $\frac{\left\lfloor n_{0}x\right\rfloor -\zeta}{n_{0}}$ is strictly greater than the next best rational approximation from below of $x$, then any solution to $\eqref{eq:lower bound iteration criterion}$ must be a multiple of $q_{*}$ where $q_{*}$ is the denominator of $\frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}$ in its reduced form. But since $n_{0}$ is chosen to be the largest, it follows that there is no multiple of $q_{*}$ in the range $\{1,\ \cdots\ ,n_{\max} - n_{0}\}$, so there is no solution to $\eqref{eq:lower bound iteration criterion}$ in that range.

This conclusion is consistent with the intuition that $n_{0}$ should be close enough to the maximizer of $\frac{\left\lfloor nx\right\rfloor -\zeta}{n}$ at least if $\zeta$ is small. But what if $\zeta$ is not that small?

It is quite tempting to claim that the minimizer of the left-hand side of $\eqref{eq:lower bound iteration criterion}$ is the maximizer of $\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n}$, but that is not true in general. Nevertheless, we can start from there, just like that we started from $n_{0}$ from the beginning.

In this reason, let $n_{1}$ be the largest maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ for $n=1,\ \cdots\ ,n_{\max} - n_{0}$. As pointed out earlier, if such $n_{1}$ does not satisfy $\eqref{eq:lower bound iteration criterion}$, then there is nothing further to do, so suppose that the inequality is indeed satisfied with $n=n_{1}$.

Next, we claim that

\[\frac{\left\lfloor (n_{0}+n)x \right\rfloor - \zeta}{n_{0}+n} \leq \frac{\left\lfloor (n_{0}+n_{1})x \right\rfloor - \zeta}{n_{0}+n_{1}}\]

holds for all $n=1,\ \cdots\ ,n_{1}$.

Note that by $\eqref{eq:floor splits; lower bound}$, this is equivalent to

\[\begin{aligned} &n_{0}\left\lfloor nx\right\rfloor + n_{1}\left(\left\lfloor n_{0}x\right\rfloor + \left\lfloor nx\right\rfloor\right) - (n_{0}+n_{1})\zeta \\ &\quad\quad\quad \leq n_{0}\left\lfloor n_{1}x\right\rfloor + n\left(\left\lfloor n_{0}x\right\rfloor + \left\lfloor n_{1}x\right\rfloor\right) - (n_{0}+n)\zeta, \end{aligned}\]

and rearranging it gives

\[(n_{1}-n)(\zeta - \left\lfloor n_{0}x\right\rfloor) \geq \left(n_{1}\left\lfloor nx\right\rfloor - n\left\lfloor n_{1}x\right\rfloor\right) + n_{0}\left( \left\lfloor nx\right\rfloor - \left\lfloor n_{1}x\right\rfloor \right).\]

By adding and subtracting appropriate terms, we can rewrite this as

\[\begin{aligned} &n_{0}(n_{1}-n)\left(\frac{\zeta}{n_{0}} + \left(x - \frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}\right)\right) \\ &\quad\quad\quad\geq n_{1}(n_{0}+n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) - n(n_{0}+n_{1})\left( x - \frac{\left\lfloor nx\right\rfloor}{n} \right). \end{aligned}\]

Recall that we already have assumed

\[x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \leq \frac{\zeta}{n_{0}} + \left(x - \frac{\left\lfloor n_{0}x\right\rfloor}{n_{0}}\right),\]

so it is enough to show

\[\begin{aligned} &n_{0}(n_{1} - n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) \\ &\quad\quad\quad\geq n_{1}(n_{0}+n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) - n(n_{0}+n_{1})\left( x - \frac{\left\lfloor nx\right\rfloor}{n} \right), \end{aligned}\]

or equivalently,

\[\begin{aligned} n(n_{0}+n_{1})\left(x - \frac{\left\lfloor nx\right\rfloor}{n}\right) &\geq \left(n_{1}(n_{0}+n) - n_{0}(n_{1} - n)\right) \left(x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}}\right) \\ &= n(n_{0}+n_{1}) \left(x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}}\right), \end{aligned}\]

which trivially holds by the definition of $n_{1}$. Thus the claim is proved.

As a result, we can reformulate our optimization problem again in the following way: define $N_{1}:=n_{0}+n_{1}$, then we are to find $n=0,\ \cdots\ ,n_{\max} - N_{1}$ which maximizes

\[\frac{\left\lfloor (N_{1}+n)x\right\rfloor - \zeta}{N_{1}+n}.\]

As you can see, now it resembles quite a lot what we just have done. Then the natural next step is to see whether we have

\[\left\lfloor (N_{1}+n)x\right\rfloor = \left\lfloor N_{1}x\right\rfloor + \left\lfloor nx\right\rfloor,\]

which, by $\eqref{eq:floor splits; lower bound}$, is equivalent to

\[\left\lfloor (n_{1}+n)x\right\rfloor = \left\lfloor n_{1}x\right\rfloor + \left\lfloor nx\right\rfloor.\]

But we already know this: because $\frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}}$ is a best rational approximation from below of $x$, the exact same proof of $\eqref{eq:floor splits; lower bound}$ applies.

Therefore, by the same procedure as we did, we get that

\[\frac{\left\lfloor (N_{1}+n)x \right\rfloor - \zeta}{N_{1}+n} \geq \frac{\left\lfloor N_{1}x \right\rfloor - \zeta}{N_{1}}\]

holds for $n=1,\ \cdots\ ,n_{\max} - N_{1}$ if and only if

\[\frac{\left\lfloor nx\right\rfloor}{n} \geq \frac{\left\lfloor N_{1}x\right\rfloor - \zeta}{N_{1}}.\]

Hence, we finally arrive at the following iterative algorithm for computing the maximizer of $\frac{\left\lfloor nx\right\rfloor - \zeta}{n}$:

Algorithm 5 (Computing the lower bound).

Input: $x\in\mathbb{R}$, $n_{\max}\in\mathbb{Z}_{>0}$, $\zeta\geq 0$.

Output: the largest maximizer of $\frac{\left\lfloor nx\right\rfloor - \zeta}{n}$ for $n=1,\ \cdots\ ,n_{\max}$.

Find the largest $n=1,\ \cdots\ ,n_{\max}$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$ and call it $n_{0}$.

If $n_{0} = n_{\max}$, then $n_{0}$ is the largest maximizer; return.

Otherwise, find the largest $n=1,\ \cdots\ ,n_{\max} - n_{0}$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$ and call it $n_{1}$.

Inspect the inequality \[ \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \geq \frac{\left\lfloor n_{0}x\right\rfloor - \zeta}{n_{0}}. \]

If the inequality does not hold, then $n_{0}$ is the largest maximizer; return.

If the inequality does hold, then set $n_{0}\leftarrow n_{0} + n_{1}$ and go to Step 2.

Remark. In the above, we blackboxed the operation of finding the largest maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$. Actually, this is precisely how we obtain Theorem 2. If $x$ is a rational number whose denominator $q$ is at most $n_{\max}$, then obviously the largest maximizer is the largest multiple of $q$ bounded by $n_{\max}$. Otherwise, we just compute the best rational approximation from below of $x$ with the largest denominator $q_{*}\leq n_{\max}$, where the theory of continued fractions allows us to compute this very efficiently. (Especially, when $x$ is a rational number, this efficient algorithm is really just the extended Euclid algorithm.) Once we get the best rational approximation $\frac{p_{*}}{q_{*}}$ (which must be in its reduced form), we find the largest multiple $kq_{*}$ of $q_{*}$ bounded by $n_{\max}$. Then since $\frac{p_{*}}{q_{*}}$ is the maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ for $n=1,\ \cdots\ ,n_{\max}$, it follows that $\left\lfloor kq_{*}x\right\rfloor$ must be equal to $kp_{*}$ and $kq_{*}$ is the largest maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$.

The upper bound

The computation of the upper bound, that is, finding the smallest minimizer of

\[\frac{\left\lfloor nx\right\rfloor + 1 - \zeta}{n}\]

for $n=1,\ \cdots\ ,n_{\max}$, is largely same as the case of lower bound. In this case, we start with the smallest minimizer $n_{0}$ of $\frac{\left\lfloor nx\right\rfloor + 1}{n}$. Then for any $n=1,\ \cdots\ ,n_{\max}$ with

\[\frac{\left\lfloor nx\right\rfloor + 1-\zeta}{n} \leq \frac{\left\lfloor n_{0}x\right\rfloor + 1 - \zeta}{n_{0}},\]

we must have

\[-\frac{\zeta}{n} \leq -\frac{\zeta}{n_{0}},\]

thus $n\leq n_{0}$ unless $\zeta=0$, which again is an uninteresting case.

Hence, our goal is to find $n=0,\ \cdots\ ,n_{0}-1$ minimizing

\[\frac{\left\lfloor (n_{0}-n)x\right\rfloor + 1 - \zeta}{n_{0} - n}.\]

Again, we claim that

\[\label{eq:floor splits; upper bound} \left\lfloor (n_{0}-n)x\right\rfloor = \left\lfloor n_{0}x\right\rfloor - \left\lfloor nx\right\rfloor\]

holds for all such $n$. We have two cases: (1) when $\frac{\left\lfloor n_{0}x\right\rfloor + 1}{n_{0}}$ is a best rational approximation from above of $x$, or (2) when it is not. The second case can happen only when $x$ is a rational number whose denominator is at most $n_{\max}$, so that the best rational approximation of it is $x$ itself. However, for such a case, if we let $x=\frac{p}{q}$, then according to how we derived Theorem 2, the remainder of $n_{0}p/q$ must be the largest possible value, $q-1$. Hence, the quotient of $(n_{0} - n)p/q$ cannot be strictly smaller than the difference between the quotients of $n_{0}p/q$ and $np/q$, so the claim holds in this case.

On the other hand, if we suppose that $\frac{\left\lfloor n_{0}x\right\rfloor + 1}{n_{0}}$ is a best rational approximation from above of $x$, then as we have seen in the case of lower bound, it must be a best rational approximation from above in the strong sense. Then it is not hard to see that $n_{0}$ is the minimizer of $\left\lceil nx\right\rceil - nx = \left\lfloor nx\right\rfloor + 1 - nx$, or equivalently, the maximizer of $nx - \left\lfloor nx\right\rfloor$, because we chose $n_{0}$ to be the smallest minimizer. This shows

\[(n_{0}-n)x - \left\lfloor(n_{0}-n)x\right\rfloor \leq n_{0}x - \left\lfloor n_{0}x\right\rfloor,\]

thus

\[\left\lfloor(n_{0}-n)x\right\rfloor \geq \left\lfloor n_{0}x\right\rfloor - nx > \left\lfloor n_{0}x\right\rfloor - \left\lfloor nx\right\rfloor - 1,\]

so we must have $\left\lfloor(n_{0}-n)x\right\rfloor = \left\lfloor n_{0}x\right\rfloor - \left\lfloor nx\right\rfloor$ as claimed.

Using the claim, we can again characterize $n=1,\ \cdots\ ,n_{0}-1$ which yields a smaller value of

\[\frac{\left\lfloor (n_{0}-n)x \right\rfloor + 1 - \zeta}{n_{0}-n}\]

than the case $n=n_{0}$. Indeed, we have the inequality

\[\frac{\left\lfloor (n_{0}-n)x \right\rfloor + 1 - \zeta}{n_{0}-n} \leq \frac{\left\lfloor n_{0}x \right\rfloor + 1 - \zeta}{n_{0}}\]

if and only if

\[-n_{0}\left\lfloor nx\right\rfloor + n_{0}(1-\zeta) \leq -n\left\lfloor n_{0}x\right\rfloor + (n_{0}-n)(1-\zeta)\]

if and only if

\[n_{0}\left\lfloor nx\right\rfloor \geq n\left\lfloor n_{0}x\right\rfloor + n(1-\zeta)\]

if and only if

\[\frac{\left\lfloor nx\right\rfloor}{n} \geq \frac{\left\lfloor n_{0}x\right\rfloor + 1 - \zeta}{n_{0}},\]

or equivalently,

\[\label{eq:upper bound iteration criterion} x - \frac{\left\lfloor nx\right\rfloor}{n} \leq \frac{\zeta}{n_{0}} - \left(\frac{\left\lfloor n_{0}x\right\rfloor + 1}{n_{0}} - x\right).\]

Next, let $n_{1}$ be the largest maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ for $n=1,\ \cdots\ ,n_{0}-1$. If $\eqref{eq:upper bound iteration criterion}$ does not hold with $n=n_{1}$, then we know that $n_{0}$ is the one we are looking for, so suppose the otherwise.

Next, we claim that

\[\frac{\left\lfloor (n_{0}-n)x\right\rfloor + 1 - \zeta}{n_{0}-n} \geq \frac{\left\lfloor (n_{0}-n_{1})x\right\rfloor + 1 - \zeta}{n_{0}-n_{1}}\]

holds for all $n=1,\ \cdots\ ,n_{1}$. By $\eqref{eq:floor splits; upper bound}$, we can rewrite the inequality above as

\[\begin{aligned} &-n_{0}\left\lfloor nx\right\rfloor - n_{1}\left\lfloor n_{0}x\right\rfloor + n_{1}\left\lfloor nx\right\rfloor + (n_{0}-n_{1})(1-\zeta) \\ &\quad\quad\quad \geq -n_{0}\left\lfloor n_{1}x\right\rfloor - n\left\lfloor n_{0}x\right\rfloor + n\left\lfloor n_{1}x\right\rfloor + (n_{0}-n)(1-\zeta), \end{aligned}\]

\[(n_{1}-n)\left(\left\lfloor n_{0}x\right\rfloor + 1 - \zeta\right) \leq \left( n_{1}\left\lfloor nx\right\rfloor - n\left\lfloor n_{1}x\right\rfloor \right) - n_{0}\left( \left\lfloor nx\right\rfloor - \left\lfloor n_{1}x\right\rfloor \right).\]

Then adding and subtracting appropriate terms gives

\[\begin{aligned} &n_{0}(n_{1}-n)\left( \frac{\zeta}{n_{0}} - \left(\frac{\left\lfloor n_{0}x\right\rfloor + 1}{n_{0}} - x\right) \right) \\ &\quad\quad\quad\geq n_{1}(n_{0}-n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) - n(n_{0}-n_{1})\left( x - \frac{\left\lfloor nx\right\rfloor}{n} \right). \end{aligned}\]

Recall that we already supposed

\[x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \leq \frac{\zeta}{n_{0}} - \left(\frac{\left\lfloor n_{0}x\right\rfloor + 1}{n_{0}} - x\right),\]

so it is enough to show

\[\begin{aligned} &n_{0}(n_{1}-n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) \\ &\quad\quad\quad\geq n_{1}(n_{0}-n)\left( x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \right) - n(n_{0}-n_{1})\left( x - \frac{\left\lfloor nx\right\rfloor}{n} \right), \end{aligned}\]

or equivalently,

\[\begin{aligned} n(n_{0}-n_{1})\left(x - \frac{\left\lfloor nx\right\rfloor}{n}\right) &\geq \left(n_{1}(n_{0} - n) - n_{0}(n_{1}-n)\right) \left(x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}}\right) \\ &= n(n_{0} - n_{1}) \left(x - \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}}\right), \end{aligned}\]

which trivially holds by the definition of $n_{1}$. Thus, the claim is proved.

As a result, we can reformulate our optimization problem in the following way: define $N_{1}=n_{0}-n_{1}$, then we are to find $n=0,\ \cdots\ ,N_{1}-1$ which minimizes

\[\frac{\left\lfloor (N_{1}-n)x\right\rfloor + 1 - \zeta}{N_{1}-n}.\]

Again, the next step is to show that

\[\left\lfloor (N_{1}-n)x\right\rfloor = \left\lfloor N_{1}x\right\rfloor - \left\lfloor nx\right\rfloor\]

holds for all such $n$, which by $\eqref{eq:floor splits; upper bound}$, is equivalent to

\[\left\lfloor (n_{1}+n)x\right\rfloor = \left\lfloor n_{1}x\right\rfloor + \left\lfloor nx\right\rfloor.\]

But we already have seen this in the case of lower bound, because we know $n_{1}$ is a maximizer of $\frac{\left\lfloor nx\right\rfloor}{n}$ for $n = 1,\ \cdots\ ,n_{0} - 1$. Hence, again we can show that

\[\frac{\left\lfloor (N_{1}-n)x\right\rfloor + 1 - \zeta}{N_{1}-n} \leq \frac{\left\lfloor N_{1}x\right\rfloor + 1 - \zeta}{N_{1}}\]

holds if and only if

\[\frac{\left\lfloor nx\right\rfloor}{n} \geq \frac{\left\lfloor N_{1}x\right\rfloor + 1 - \zeta}{N_{1}},\]

and repeating this procedure gives us the smallest minimizer of $\frac{\left\lfloor nx\right\rfloor + 1 - \zeta}{n}$.

Algorithm 6 (Computing the upper bound).

Input: $x\in\mathbb{R}$, $n_{\max}\in\mathbb{Z}_{>0}$, $\zeta\geq 0$.

Output: the smallest minimizer of $\frac{\left\lfloor nx\right\rfloor + 1 - \zeta}{n}$ for $n=1,\ \cdots\ ,n_{\max}$.

Find the smallest $n=1,\ \cdots\ ,n_{\max}$ that minimizes $\frac{\left\lfloor nx\right\rfloor + 1}{n}$ and call it $n_{0}$.

If $n_{0} = 1$, then $n_{0}$ is the smallest minimizer; return.

Find the largest $n=1,\ \cdots\ ,n_{0} - 1$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$ and call it $n_{1}$.

Inspect the inequality \[ \frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} \geq \frac{\left\lfloor n_{0}x\right\rfloor + 1 - \zeta}{n_{0}}. \]

If the inequality does not hold, then $n_{0}$ is the smallest minimizer; return.

If the inequality does hold, then set $n_{0}\leftarrow n_{0} - n_{1}$ and go to Step 2.

Remark. Similarly to the case of lower bound, we blackboxed the operation of finding the smallest minimizer of $\frac{\left\lfloor nx\right\rfloor}{n}$, which again is precisely how we obtain Theorem 2. If $x=\frac{p}{q}$ is a rational number with $q\leq n_{\max}$, then the minimizer is unique and is the largest $v\leq n_{\max}$ such that $vp\equiv -1\ (\mathrm{mod}\ q)$. Otherwise, we just compute the best rational approximation from above of $x$ with the largest denominator $q_{*}\leq n_{\max}$, where the theory of continued fractions allows us to compute this very efficiently. The resulting $\frac{p_{*}}{q_{*}}$ must be in its reduced form, so $q_{*}$ is the smallest minimizer of $\frac{\left\lfloor nx\right\rfloor + 1}{n}$.

Finding feasible values of $\xi$ and $\zeta$

Note that our purpose of introducing $\zeta$ was to increase the gap between the lower and the upper bounds of $\xi=\frac{m}{2^{k}}$ when the required bit-width of the constant $m$ is too large. Thus, $\zeta$ is not something given, rather is something we want to figure out. In this sense, Algorithm 5 and Algorithm 6 might seem pretty useless because they work only when the value of $\zeta$ is already given.

Nevertheless, the way those algorithms work for a fixed $\zeta$ is in fact quite special in that it allows us to figure out a good $\zeta$ fitting into our purpose of widening the gap between two bounds. The important point here is that, $\zeta$ only appears in deciding when to stop the iteration, and all other details of the actual iteration steps do not depend on $\zeta$ at all.

Hence, our strategy is as follows. First, we assume $\zeta$ lives in the interval $[0,1)$ (otherwise $\left\lfloor nx\right\rfloor = \left\lfloor n\xi+\zeta\right\rfloor$ fails to hold for $n=0$). Then we can partition the interval $[0,1)$ into subintervals of the form $[\zeta_{\min},\zeta_{\max})$ where the numbers of iterations that Algorithm 5 and Algorithm 6 will take remain constant. Then we look at each of these subintervals one by one, from left to right, while proceeding the iterations of Algorithm 5 and Algorithm 6 whenever we move onto the next subinterval and the numbers of iterations for each of them change.

Our constraint on $\xi$ and $\zeta$ is that the computation of $mn+s$ should not overflow, so suppose that there is a given limit $N_{\max}$ on how large $mn+s$ can be. Now, take a subinterval $[\zeta_{\min},\zeta_{\max})$ as described above. If there is at least one feasible choice of $\xi$ for some $\zeta\in[\zeta_{\min},\zeta_{\max})$, then such $\xi$ must lie in the interval

\[I:= \left(\frac{\left\lfloor n_{L,0}x\right\rfloor - \zeta_{\max}}{n_{L,0}}, \frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{\min}}{n_{U,0}}\right),\]

where $n_{L,0}$ and $n_{U,0}$ are the supposed output of Algorithm 5 and Algorithm 6, respectively, which must stay constant as long as $\zeta\in[\zeta_{\min},\zeta_{\max})$.

Next, we will take a loop over all numbers of the form $\frac{m}{2^{k}}$ in $I$ and check one by one if it is really a possible choice of $\xi$. The order of examination will be the lexicographic order on $(k,m)$, that is, from smaller $k$ to larger $k$, and for given $k$, from smaller $m$ to larger $m$. To find the smallest $k$, let $\Delta$ be the size of the interval above, that is,

\[\Delta := \frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{\min}}{n_{U,0}} - \frac{\left\lfloor n_{L,0}x\right\rfloor - \zeta_{\max}}{n_{L,0}}.\]

Define $k_{0}:=\left\lfloor\log_{2}\frac{1}{\Delta}\right\rfloor$, then we have

\[2^{k_{0}}\Delta \leq 1 < 2^{k_{0}+1}\Delta \leq 2,\]

and this means that there is at most one integer in the interval $2^{k_{0}}I$ while there is at least one integer in the interval $2^{k_{0}+1}I$. Therefore, we first see if there is one in $2^{k_{0}}I$, and if not, then have a look at $2^{k_{0}+1}I$ where the existence is guaranteed. Also, note that if there were none in $2^{k_{0}}I$, then $2^{k_{0}+1}I$ should have exactly one integer and that integer must be odd.

Either case, we can find the smallest possible integer $k$ such that

\[\xi = \frac{m}{2^{k}}\]

is in the interval $I$ for some $m\in\mathbb{Z}$. (When $2^{k_{0}}I$ does not contain an integer, then $k = k_{0}+1$, and if it contains an integer, then $k = k_{0}-b$ where $b$ is the highest exponent of $2$ such that $2^{b}$ divides the unique integer in $2^{k_{0}}I$.) We will start from there and successively increase the value of $k$. For currently given $k$, we take a loop over all $m$ such that $\xi = \frac{m}{2^{k}}$ stays inside $I$. (By the construction, there is exactly one possible $m$ for the smallest $k$ which we start from.) In order for this $\xi$ to be allowed by a choice of $\zeta$, according to Algorithm 5, $\zeta$ should be at least

\[\zeta_{0} := \left\lfloor n_{L,0}x \right\rfloor - n_{L,0}\xi = \frac{2^{k}\left\lfloor n_{L,0}x\right\rfloor - n_{L,0}m}{2^{k}}.\]

Since we want our $\xi$ to satisfy the upper bound given by Algorithm 6 as well, we want to choose $\zeta$ to be as small as possible as long as $\zeta\geq\zeta_{0}$ is satisfied. Hence, we want to take $\zeta=\zeta_{0}$, but this is not always possible because $\zeta_{0}\geq\zeta_{\min}$ is not always true. (The above $\zeta_{0}$ even can be negative, so in the actual algorithm we may truncate it to be nonnegative.)

If $\zeta_{0}\geq\zeta_{\min}$ is satisfied, then we may indeed take $\zeta=\zeta_{0}$. By the construction, $\zeta_{0}<\zeta_{\max}$ is automatically satisfied, so we want to check two things: (1) the smallness constraint $mn_{\max} + s \leq N_{\max}$ with $s=2^{k}\zeta_{0}$, and (2) the true upper bound

\[\xi < \frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{0}}{n_{U,0}}\]

given by Algorithm 6. If both of the conditions are satisfied, then $k$, $m$, and $s=2^{k}\zeta_{0}$ are valid solutions. Otherwise, we conclude that $\xi$ does not yield an admissable choice of $(k,m,s)$.

If $\zeta_{0}<\zeta_{\min}$, then we have to increase the numerator of $\zeta_{0}$ to put in inside $[\zeta_{\min},\zeta_{\max})$, so let $\zeta$ be the one obtained by adding the smallest positive integer to the numerator of $\zeta_{0}$ which ensures $\zeta\geq\zeta_{\min}$. In this case, since we cannot easily estimate how large $\zeta$ is compared to $\zeta_{0}$, we have to check $\zeta<\zeta_{\max}$ in addition to the other two conditions.

If we failed to find any admissible choice of $(k,m,s)$ with our $\xi$, then we increase the numerator $m$ and see if that works out until $\xi=\frac{m}{2^{k}}$ goes outside of the search interval $I$. After exhausting all $m$’s, now we increase $k$ by one and repeat the same procedure. Note that even though all of $\xi=\frac{m}{2^{k}}$’s with even numerators are already considered when we were looking at a smaller $k$, we do not exclude them from the search because for the case $\zeta_{0}<\zeta_{\min}$, having a larger denominator of $\zeta$ might allow the gap $\zeta - \zeta_{0}$ to be smaller.

This loop over $k$ should be stopped when we arrive at a point where the smallest integer $m$ in $2^{k}I$ already fails to satisfy

\[mn_{\max} + 2^{k}\zeta_{0} \leq N_{\max}.\]

If we exhausted all $k$’s satisfying the above, then now we conclude that there is no $\zeta\in[\zeta_{\min},\zeta_{\max})$ yielding an admissible choice of $(k,m,s)$, so we should move onto the next subinterval.

After filling out some omitted details, we arrive at the following algorithm.

Algorithm 7 (Finding feasible values of $\xi$ and $\zeta$).

Input: $x\in\mathbb{R}$, $n_{\max}\in\mathbb{Z}_{>0}$, $N_{\max}\in\mathbb{Z}_{>0}$.

Output: $k$, $m$, and $s$, where $\xi = \frac{m}{2^{k}}$ and $\zeta = \frac{s}{2^{k}}$, so that we have $\left\lfloor nx\right\rfloor = \left\lfloor \frac{nm + s}{2^{k}}\right\rfloor$ for all $n=1,\ \cdots\ ,n_{\max}$.

Find the largest $n=1,\ \cdots\ ,n_{\max}$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$ and call it $n_{L,0}$.

Find the smallest $n=1,\ \cdots\ ,n_{\max}$ that minimizes $\frac{\left\lfloor nx\right\rfloor + 1}{n}$ and call it $n_{U,0}$.

Set $\zeta_{\max} \leftarrow 0$, $\zeta_{L,\max} \leftarrow 0$, $\zeta_{U,\max} \leftarrow 0$, $n_{L,1} \leftarrow 0$, $n_{U,1} \leftarrow 0$.

Check if $\zeta_{\max} = \zeta_{L,\max}$. If that is the case, then we have to update $\zeta_{L,\max}$. Set $n_{L,0}\leftarrow n_{L,0}+n_{L,1}$. If $n_{L,0}=n_{\max}$, then set $\zeta_{L,\max}\leftarrow 1$. Otherwise, find the largest $n=1,\ \cdots\ ,n_{\max}-n_{L,0}$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$, assign it to $n_{L,1}$, and set \[ \zeta_{L,\max}\leftarrow \min\left(\frac{n_{L,1}\left\lfloor n_{L,0}x\right\rfloor - n_{L,0}\left\lfloor n_{L,1}x\right\rfloor}{n_{L,1}}, 1\right). \]

Check if $\zeta_{\max} = \zeta_{U,\max}$. If that is the case, then we have to update $\zeta_{U,\max}$. Set $n_{U,0}\leftarrow n_{U,0}-n_{U,1}$. If $n_{U,0}=1$, then set $\zeta_{U,\max}\leftarrow 1$. Otherwise, find the largest $n=1,\ \cdots\ ,n_{U,0}-1$ that maximizes $\frac{\left\lfloor nx\right\rfloor}{n}$, assign it to $n_{U,1}$, and set \[ \zeta_{U,\max}\leftarrow \min\left(\frac{n_{U,1}\left(\left\lfloor n_{U,0}x\right\rfloor + 1\right) - n_{U,0}\left\lfloor n_{U,1}x\right\rfloor}{n_{U,1}}, 1\right). \]

Set $\zeta_{\min}\leftarrow \zeta_{\max}$, $\zeta_{\max} \leftarrow \min\left(\zeta_{L,\max},\zeta_{U,\max}\right)$, and \[ \Delta \leftarrow \frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{\min}}{n_{U,0}} - \frac{\left\lfloor n_{L,0}x\right\rfloor - \zeta_{\max}}{n_{L,0}}. \] If $\Delta\leq 0$, then this means the search interval $I$ is empty. In this case, go to Step 16. Otherwise, set $k\leftarrow \left\lfloor\log_{2}\frac{1}{\Delta}\right\rfloor$ and proceed to the next step.

Compute \[ m \leftarrow \left\lfloor \frac{2^{k}\left(\left\lfloor n_{L,0}x\right\rfloor - \zeta_{\max}\right)} {n_{L,0}} \right\rfloor + 1 \] and inspect the inequality \[ m < \frac{2^{k}\left( \left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{\min} \right)}{n_{U,0}}. \] If it does not hold, then set $k\leftarrow k + 1$ and recompute $m$ accordingly. Otherwise, set $k\leftarrow k - b$ where $b$ is the greatest integer such that $2^{b}$ divides $m$, and set $m\leftarrow m/2^{b}$.

Set \[ \zeta_{0} \leftarrow \max\left( \frac{2^{k}\left\lfloor n_{L,0}x\right\rfloor - n_{L,0}m}{2^{k}}, 0\right) \] and inspect the inequality $mn_{\max} + 2^{k}\zeta_{0}\leq N_{\max}$. If it does not hold, then this means every candidate we can further find in the interval $[\zeta_{\min},\zeta_{\max})$ are consisting of too large numbers, so we move onto the next subinterval. In this case, go to Step 16. Otherwise, proceed to the next step.

If $\zeta_{0}\geq\zeta_{\min}$, then $\zeta_{0}\in[\zeta_{\min},\zeta_{\max})$ and $mn_{\max} + 2^{k}\zeta_{0} \leq N_{\max}$ hold, so it is enough to check the inequality \[ \frac{m}{2^{k}}<\frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta_{0}} {n_{U,0}}. \] If this holds, then we have found an admissible choice of $(k,m,s)$, so return. Otherwise, we conclude that $\xi=\frac{m}{2^{k}}$ does not yield an admissible answer. In this case, go to Step 13.

If $\zeta_{0}<\zeta_{\min}$, then set $a\leftarrow \left\lceil 2^{k}(\zeta_{\min} - \zeta_{0})\right\rceil$ and $\zeta\leftarrow \zeta_{0} + \frac{a}{2^{k}}$ so that $\zeta\geq\zeta_{\min}$ is satisfied. Check if $\zeta<\zeta_{\max}$ holds, and if that is not the case, then go to Step 13.

Check if $mn_{\max} + 2^{k}\zeta \leq N_{\max}$ holds. If that is not the case, then go to Step 13.

Inspect the inequality \[ \frac{m}{2^{k}}<\frac{\left\lfloor n_{U,0}x\right \rfloor + 1 - \zeta}{n_{U,0}}. \] If it does hold, then we have found an admissible choice of $(k,m,s)$, so return. Otherwise, proceed to the next step.

Set $m\leftarrow m+1$ and inspect the inequality \[ \frac{m}{2^{k}} < \frac{\left\lfloor n_{U,0}x\right\rfloor + 1 - \zeta_{\min}} {n_{U,0}}. \] If it does not hold, then go to Step 15.

If it does hold, then set \[ \zeta_{0} \leftarrow \max\left( \frac{2^{k}\left\lfloor n_{L,0}x\right\rfloor - n_{L,0}m}{2^{k}}, 0\right) \] and inspect the inequality $mn_{\max} + 2^{k}\zeta_{0}\leq N_{\max}$. If it does hold, then go back to Step 9. Otherwise, proceed to the next step.

Set $k\leftarrow k+1$ and \[ m \leftarrow \left\lfloor \frac{2^{k}\left(\left\lfloor n_{L,0}x\right\rfloor - \zeta_{\max}\right)} {n_{L,0}} \right\rfloor + 1, \] and go to Step 8.

We have failed to find any admissible choice of $(k,m,s)$ so far, so we have to move onto the next subinterval. If $\zeta_{\max}=1$, then we already have exhausted the whole interval, so return with FAIL. Otherwise, go back to Step 4.

Remarks.

Whenever we update $\zeta_{L,\max}$ and $\zeta_{U,\max}$, the updated values must be always strictly bigger than the previous values unless they already have reached their highest value $1$. Let us see the case of $\zeta_{L,\max}$; the case of $\zeta_{U,\max}$ is similar. Suppose that initially we had $n_{L,0} = n_{0}$, $n_{L,1} = n_{1}$, and after recomputing $n_{L,1}$ we got $n_{L,1} = n_{2}$. Then the new value of $\zeta_{L,\max}$ and the old value of it are respectively given as \[\label{eq:gap between successive zeta max} \frac{n_{2}\left\lfloor(n_{0}+n_{1})x\right\rfloor - (n_{0}+n_{1})\left\lfloor n_{2}x\right\rfloor}{n_{2}},\quad \frac{n_{1}\left\lfloor n_{0}x\right\rfloor - n_{0}\left\lfloor n_{1}x\right\rfloor}{n_{1}},\] respectively. Applying $\eqref{eq:floor splits; lower bound}$, it turns out that the first one minus the second one is equal to \[ (n_{0}+n_{1})\left(\frac{\left\lfloor n_{1}x\right\rfloor}{n_{1}} - \frac{\left\lfloor n_{2}x\right\rfloor}{n_{2}} \right).\] Now, recall the way $n_{1}$ is chosen: we first find the best rational approximation of $x$ from below in the range $\{1,\ \cdots\ ,n_{\max}-n_{0}\}$, call its denominator $q_{*}$, and set $n_{1}$ to be the largest multiple of $q_{*}$. Since $n_{1}$ is the largest multiple, it follows that $n_{\max} - n_{0} - n_{1}$ should be strictly smaller than $q_{*}$. Therefore, the best rational approximation in the range $\{1,\ \cdots\ ,n_{\max}-n_{0}-n_{1}\}$ should be strictly worse than what $q_{*}$ gives. This shows that $\eqref{eq:gap between successive zeta max}$ is strictly positive.
When there indeed exists at least one admissible choice of $(k,m,s)$ given $x$, $n_{\max}$, and $N_{\max}$, then Algorithm 7 finds the triple $(k,m,s)$ with the smallest $k$, and then $m$, and then $s$, within the first subinterval $[\zeta_{\min},\zeta_{\max})$ where a solution can be found. To make it to always find $(k,m,s)$ with the smallest $k$, $m$, and $s$ among all solutions, we can simply modify the algorithm a little bit so that it does not immediately stop when it finds a solution, rather it continues to other subintervals and then compare any solutions found there with the previous best one.

An actual implementation of this algorithm can be found here. (Click here to test it in live.)

Results by Lemire et al.

Consider the special case when $x=\frac{1}{q}$, $q\leq n_{\max}$, and $\xi=\zeta$. In this case, we have the following result:

Theorem 8 (Lemire et al., 2021).

For any positive integers $q$ and $n_{\max}$ with $q\leq n_{\max}$, we have
\[\left\lfloor\frac{n}{q}\right\rfloor = \left\lfloor (n+1)\xi\right\rfloor\]
for all $n=0,\ \cdots\ ,n_{\max}$ if and only if
\[\left(1 - \frac{1}{\left\lfloor n_{\max}/q\right\rfloor q+1}\right)\frac{1}{q} \leq \xi < \frac{1}{q}.\]

We can give an alternative proof of this fact using what we have developed so far. Essentially, this is due to the fact that Algorithm 5 finishes its iteration at the first step and do not proceed further.

Proof. $(\Rightarrow)$ Take $n=q-1$, then we have $0 = \left\lfloor q\xi\right\rfloor$, thus $\xi<\frac{1}{q}$ follows. On the other hand, take $n = \left\lfloor\frac{n_{\max}}{q}\right\rfloor q$, then we have
\[\left\lfloor\frac{n_{\max}}{q}\right\rfloor \leq \left(\left\lfloor\frac{n_{\max}}{q}\right\rfloor q + 1\right)\xi,\]
so rearranging this gives the desired lower bound on $\xi$.

$(\Leftarrow)$ It is enough to show that $\xi$ satisfies
\[\max_{n=1,\ \cdots\ ,n_{\max}} \frac{\left\lfloor n/q\right\rfloor - \xi}{n} \leq \xi < \min_{n=1,\ \cdots\ ,n_{\max}} \frac{\left\lfloor n/q\right\rfloor + 1 - \xi}{n}.\]
Following Algorithm 5, let $n_{0}$ be the largest maximizer of $\frac{\left\lfloor n/q\right\rfloor}{n}$, i.e., $n_{0} = \left\lfloor\frac{n_{\max}}{q}\right\rfloor q$. Then we know from Algorithm 5 that $n_{0}$ is the largest maximizer of $\frac{\left\lfloor n/q\right\rfloor - \xi}{n}$ if and only if
\[\frac{\left\lfloor n/q\right\rfloor}{n} < \frac{\left\lfloor n_{0}/q\right\rfloor - \xi}{n_{0}} = \frac{1}{q} - \frac{\xi}{\left\lfloor n_{\max}/q\right\rfloor q}\]
holds for all $n=1,\ \cdots\ ,n_{\max}-n_{0}$. Pick any such $n$ and let $r$ be the remainder of $n/q$, then we have
\[\frac{\left\lfloor n/q\right\rfloor}{n} = \frac{(n-r)/q}{n} = \frac{1}{q} - \frac{r}{nq}.\]
Hence, we claim that
\[\frac{\xi}{\left\lfloor n_{\max}/q\right\rfloor} < \frac{r}{n}\]
holds for all such $n$. This is clear, because $\xi<\frac{1}{q}$ while $n_{\max} - n_{0}$ is at most $q-1$, so the right-hand side is at least $\frac{1}{q-1}$. Therefore, $n_{0}$ is indeed the largest maximizer of $\frac{\left\lfloor n/q\right\rfloor - \xi}{n}$ for $n=1,\ \cdots\ ,n_{\max}$, and the inequality
\[\frac{\left\lfloor n_{0}/q\right\rfloor - \xi}{n_{0}} \leq \xi\]
is equivalent to
\[\xi \geq \frac{\left\lfloor n_{0}/q\right\rfloor}{n_{0}+1} = \frac{1}{q}\left(1 - \frac{1}{n_{0}+1}\right),\]
which we already know.

For the upper bound, note that it is enough to show that
\[\frac{1}{q} \leq \frac{\left\lfloor n/q\right\rfloor + 1}{n+1}\]
holds for all $n=1,\ \cdots\ ,n_{\max}$. We can rewrite this inequality as
\[\frac{n}{q} - \left\lfloor\frac{n}{q}\right\rfloor \leq \frac{q-1}{q},\]
which means nothing but that the remainder of $n$ divided by $q$ is at most $q-1$, which is trivially true. $\quad\blacksquare$

Note the fact that the numerator of $x$ is $1$ is crucially used in this proof.

Lemire et al. also showed that, if $n_{\max} = 2^{N} - 1$ and $q$ is not a power of $2$, then whenever the best magic constant predicted by Theorem 2 does not fit into a word, the best magic constant predicted by the above theorem should fit into a word. In fact, those assumptions about being power of $2$ or not power of $2$ and such can be removed, as shown below.

Theorem 9 (Improves Lemire et al.)

Let $x$ be a positive real number and $u,v$ be positive integers. Let $k$ be the smallest integer such that the set
\[\left[2^{k}x, 2^{k}x\left(1 + \frac{1}{v}\right)\right) \cap \mathbb{Z}\]
is not empty. If $2^{k}x \geq \max\left(u,v\right)$, then the set
\[\left[2^{k-1}x\left(1-\frac{1}{u}\right), 2^{k-1}x\right) \cap \mathbb{Z}\]
must be nonempty as well.

Thus, if we let $x=\frac{1}{q}$, $v$ to be the $v$ in Theorem 2 and $u:= \left\lfloor \frac{n_{\max}}{q}\right\rfloor q+1$, then whenever the best magic constant $m=\left\lceil \frac{2^{k}}{q} \right\rceil$ happens to be greater than or equal to $n_{\max}$, we always have

\[\left\lceil \frac{2^{k-1}}{q}\left(1 - \frac{1}{u}\right)\right\rceil < \frac{2^{k-1}}{q}.\]

Recall that we have seen in a previous section that $\frac{2^{k}}{q} < 2v$ always holds, so the right-hand side of the above is bounded by $v\leq n_{\max}$, so indeed this shows that the best magic constant predicted by Theorem 8 is bounded by $n_{\max}$ in this case.

Proof. By definition of $k$, we must have
\[\left[2^{k-1}x, 2^{k-1}x\left(1 + \frac{1}{v}\right)\right) \cap \mathbb{Z} = \emptyset.\]
Let
\[\alpha := \left\lceil 2^{k-1}x \right\rceil - 2^{k-1}x \in [0,1),\]
then hence we must have
\[2^{k-1}x + \alpha \geq 2^{k-1}x\left(1 + \frac{1}{v}\right),\]
thus
\[\label{eq:bound in the proof of the complementary result} \alpha \geq \frac{2^{k-1}x}{v}.\]
Now, note that
\[\begin{aligned} 2^{k-1}x\left(1-\frac{1}{u}\right) &= \left\lceil 2^{k-1}x \right\rceil - \alpha - \frac{2^{k-1}x}{u}, \end{aligned}\]
so it is enough to show that
\[\alpha + \frac{2^{k-1}x}{u} \geq 1.\]
Note that from $\eqref{eq:bound in the proof of the complementary result}$, we know
\[\alpha + \frac{2^{k-1}x}{u} \geq \frac{2^{k-1}x}{v} + \frac{2^{k-1}x}{u} \geq \frac{2^{k}x}{\max\left(u,v\right)},\]
and the right-hand side is bounded below by $1$ by the assumption, so we are done. $\quad\blacksquare$

Therefore, Theorem 2 and Theorem 8 are enough to cover all cases when $x=\frac{1}{q}$ with $q\leq n_{\max}$. Then does Algorithm 7 have any relevance in practice?

An example usage of Algorithm 7

I do not know, I mean, as I pointed out before, I am not sure how important in general is to cover the case $x=\frac{p}{q}$ with $p\neq 1$. But anyway when $p\neq 1$, there certainly are some cases where Algorithm 7 might be useful. Consider the following example:

std::uint32_t div(std::uint32_t n) {
    return (std::uint64_t(n) * 7) / 18;
}

Here I provide three different ways of doing this computation:

div1(unsigned int):
        mov     ecx, edi
        lea     rax, [8*rcx]
        sub     rax, rcx
        movabs  rcx, 1024819115206086201
        mul     rcx
        mov     rax, rdx
        ret

div2(unsigned int):
        mov     eax, edi
        movabs  rcx, 7173733806472429568
        mul     rcx
        mov     rax, rdx
        ret

div3(unsigned int):
        mov     ecx, edi
        mov     eax, 3340530119
        imul    rax, rcx
        add     rax, 477218588
        shr     rax, 33
        ret

(Check it out!)

The first version (div1) is what my compiler (clang) generated. It first computes n * 7 by computing n * 8 - n. And then, it computes the division by $18$ using the multiply-and-shift technique. Especially, it deliberately chose the shifting amount to be $64$ so that it can skip the shifting. Clang is indeed quite clever in the sense that it actually leverages the fact that n is only a $32$-bit integer, because, the minimum amount of shifting required for the division by $18$ is $68$, if all $64$-bit dividends are considered.

But it is not clever enough to realize that two operations * 7 and / 18 can be merged. The second version div2 is what one can get by utilizing this fact and do these two computations in one multiply-and-shift. Using Theorem 2, it can be shown that the minimum amount of shifting for doing this with all $32$-bit dividends is $36$, and in that case the corresponding magic number is $26724240953$. Since the computation cannot be done in $64$-bit anyway, I deliberately multiplied $2^{28}$ to this magic number and got $7173733806472429568$ to make the shifting amount equal to $64$, so that the actual shifting instruction is not needed. And the result is a clear win over div1, although lea and sub are just trivial instructions and the difference is not huge.

Now, if we do this with multiply-add-and-shift, we can indeed complete our computation inside $64$-bit, which is the third version div3. Using Algorithm 7, it turns out $(k,m,s) = (33, 3340530119, 477218588)$ works for all $32$-bit dividend, i.e., we have

\[\left\lfloor \frac{7n}{18}\right\rfloor = \left\lfloor \frac{3340530119\cdot n + 477218588}{2^{33}}\right\rfloor\]

for all $n=0,\ \cdots\ ,2^{32}-1$. The maximum possible value of the numerator $3340530119\cdot n + 477218588$ is strictly less than $2^{64}$ in this case. Hence, we get div3.

It is hard to tell which one between div2 and div3 will be faster. Note that div3 has one more instructions than div2, but it does not invoke the $128$-bit multiplication (the mul instruction both found in div1 and div2). In typical modern x86-64 CPU’s, $128$-bit multiplication uses twice more computational resources than the $64$-bit one, so it might more easily become a bottleneck if there are many multiplications to be done. Also, as far as I know Intel CPU’s still does not provide SIMD instructions for $128$-bit multiplication, so div3 could be more aggressively vectorized, which may result in massive speed up. Still, all of these are just speculations and of course when it comes to performance there is nothing we can be sure about. Anyway, it is good to have multiple options we can try out.

Fixed-precision formatting of floating-point numbers

2022-12-28T00:00:00-08:00

TL;DR

This post is about an algorithm I developed for formatting IEEE-754 binary floating-point numbers in the fixed-precision form, like printf.
The algorithm is an alternative to Ryū-printf (not to be confused with Ryū), and is based on a similar trick that allows the computation to be done completely inside a bounded-precision integer arithmetic of fairly small maximum precision.
But it is not just a variant of Ryū-printf, because the core formula that powers it is quite different from the corresponding one of Ryū-printf.
The precomputed cache table it needs is much smaller than that of Ryū-printf; more importantly, there is a fairly flexible and systematic way of choosing the amount of trade-off between the performance and the table size.
And still, the performance is comparable to Ryū-printf.
The algorithm really in a nutshell:
1. Precompute and store sufficiently many bits of sufficiently many powers of $5$.
2. Profit.
If you are a “shut up and show me the code” type of person, please go down here, then you will find a benchmark graph and the link to the source code.
I would not consider this work to be “Done” yet, as there are lots of things that could be improved further. Especially the implementation right now is far from complete, is a complete mess, the exact anti-thesis of the DRY principle, and furthermore I haven’t done a good enough amount of formal testing.

Introduction

When formatting binary floating-point numbers into decimal strings, there are two common approaches: the shortest form with the roundtrip guarantee, and the fixed-precision form. The main topic of this post is the latter, but let me first briefly explain the former in order to emphasize how it differs from the latter.

As the name “shortest form” suggests, the methods falling into the this category find the shortest decimal representation of the input binary floating-point number that will still be interpreted as the input number by any correct parser. In order to understand what this means, we first have to know what do we mean by “a correct parser”. Note that with a limited amount of bits, we can never encode all real numbers into a floating-point data without loss. Thus, whenever we want to convert a real number into a floating-point data type, we should round the given number to one of the numbers that can be exactly represented in the chosen data type. A correct parser thus means an algorithm which consistently performs this conversion according to given rounding rules, without error. In practice, for a given instance $w$ of a floating-point data type, thus there is an interval of real numbers that, when given to a correct parser, are converted to $w$. The job of the shortest form formatting methods aforementioned is then to look into that interval and figure out the number with the shortest decimal description.

In contrast, the methods in the second category do not perform such a fancy search. What they do is just literally evaluate the exact value of the given binary floating-point number into decimal, and then cut the digits after the prescribed precision with rounding.

For example, the exact value of the binary floating-point number 0x1.3e9e4e4c2f344p+199 (written in hexfloats) is $999999999999999949387135297074018866963645011013410073083904$. Interpreting it as an instance of IEEE-754 binary64 format (which is what the data type double usually means) with the default rounding rules, the methods from the first category will output something like $1\times 10^{60}$. However, if configured to print only $17$ decimal digits, methods from the second category will output $9.9999999999999995 \times 10^{59}$, under the default rounding rules.

Note that, here by shortest we do not mean the smallest number of characters in the final string; rather, this means the smallest number of decimal significand digits. The reason for mentioning this difference is because there is another, completely orthogonal classification of floating-point formatting methods: formatting into the fixed-point form, the scientific (aka the floating-point) form, or the general form. The fixed-point form means the usual $123.456$ or alike, while the scientific form means something like $1.234 \times 10^{17}$. In the former, we always put the decimal point “$.$” at its absolute position (between the $1$’s digit and the $0.1$’s digit), while in the latter, we always put the decimal point right after the first non-zero digit, and then append something like “$\times 10^{17}$” to indicate the distance of that position of the decimal point from its correct absolute position. The general form, on the other hand, chooses the better one between the fixed-point form and the scientific form according to certain criteria, which are usually some incarnation of the idea: “prefer the fixed-point form until it becomes too long”.

Note that the exact number of characters is highly dependent on many details of formatting, especially whether it is in the fixed-point form or the scientific form. However, that distinction has nothing to do with what specific digits will be in the final string. The one that determines the actual numbers that will appear in the final string is the choice between the shortest roundtrip versus the fixed-precision. For this reason, details like the exact number of characters in the final string are less interesting, while the aforementioned choice between two categories is something more fundamental.

Note.

The distinction between the terms “fixed-precision” and “fixed-point” is not quite standard. Indeed, in their literal sense, they just sound like the same thing, which is probably the reason for the confusion around them. Therefore, it is worth re-emphasizing that these two mean completely different and orthogonal things in this post. I hope there are some standard terms that can distinguish them unambiguously, but I have not found any existing ones, so I am currently just using “fixed-precision” vs “fixed-point”. Please let me know if there are standard terms or if you have a better suggestion.

Historically, fixed-precision methods came first. For example, functions like printf and its friends in C’s standard library only offer fixed-precision formatting. However, arguably the shortest roundtrip is a more natural approach, and it is becoming more common recently. I really think there are not so many situations where fixed-precision is absolutely desired, especially with absurdly large precision. Like, it simply makes no sense to print $100$ decimal digits of an instance of double, because it far exceeds the natural precision of the data type and it never contributes to better precision. In fact, I would even say that such an excessive number of digits promotes a false sense of precision, so should be avoided in general. If someone is purely curious about the exact decimal digits of the exact number stored in the memory, then well, that could be a valid reason for doing something like printf("%.100e", x), but I cannot think of any other reason.

A typical scenario when fixed-precision formatting is desired, rather than the shortest roundtrip, is when your output window is too small for typical outputs from the latter, or when you need the aesthetic appeal of nice alignment. Indeed, fixed-precision formatting is the natural choice for such a case. One might insist on using the shortest roundtrip formatting for such a case and then cut the results at a certain precision, but that would be considered double-rounding, which should be actively avoided. This is a very common and valid usage of fixed-precision formatting, but note that we never need things like printf("%.100e", x) in this case.

A funny thing about fixed-precision formatting is that it is a hard problem, in my opinion a way harder problem than shortest roundtrip formatting, contrary to its simpler look. And the reason is precisely because the precision requested by the user can be arbitrarily large. This lengthy blog post would not have been needed at all if we could only care about the small digits case. But if dealing with the large digits case requires too much effort while it is not something anyone really needs, then should we really need to care about it?

The thing is, someone still needs to implement it even though nobody really needs it in practice because, in my limited understanding, it is required by someone else for whatever reason, regardless of whether it is due to a sane engineering rationale or not. Seriously! Like, the C/C++ standards say it’s allowed to do printf("%.100e", x), so C/C++ standard library implementers have no choice but to make it work correctly. (To be honest, I am not a language lawyer and not 100% sure if this claim is completely correct. Apparently, it seems the only restrictions on the precision field in the format string are that (1) negative numbers are ignored, and (2) the number should be representable as an int, which allows even crazier things like printf("%.2000000000e", x). However, I am not completely sure if implementations are required to output the perfectly right answer all the time, or it is okay to just do their best.)

Actually, there is another quite compelling reason why it would be valuable for someone to think about this problem: a fast fixed-precision formatting algorithm can be used as a subroutine of a correct parser. For example, let us say that we want to convert the decimal number $1.70141193601674033557522515689509748736000000……000001 \times 10^{38}$ into IEEE-754 binary32 encoded binary number (which is what the data type float usually means). In this case, we cannot decide whether the number is closer to 0x1.p+127 or to 0x1.000002p+127 until we fully read the input. Note that doing this conversion with the straightforward multiplication of the power of $10$ with the significand requires handling of numbers with arbitrarily large precision. Or, an alternative strategy would be to generate the decimal digits of the midpoint between 0x1.p+127 and 0x1.000002p+127 using a fixed-precision formatting algorithm, and compare those digits with the input digits. This turns out to be a quite successful strategy, as demonstrated here.

Actually, there is yet another reason for me to think about this problem: it is interesting enough to catch my attention😃 This post is about my attempt at solving this problem so far.

I want to mention before getting into the main point that this work is a continuation from my previous work on Dragonbox, which is an algorithm for spitting out the shortest roundtrip output. As I pointed out, the possibility of excessive precision, which is a quite niche edge case as I claimed above, makes this work much more difficult than Dragonbox. As is often the case in many engineering problems, it is the niche edge case that makes the thing 100x more complicated :)

So what is the goal precisely?

Let’s summarize what exactly the problem we want to solve is:

For a given binary floating-point number $w$ and a positive integer $d$, how to print $d$ many decimal digits of $w$ with rounding, counting from the first nonzero digit?

First, I want to mention that I have stated the number of digits is counted from the first nonzero digit. Sometimes the users may have a different idea about the number of digits; for example it might be the number of digits after the decimal point. For instance, if our number is $0.00175$, then it has $3$ decimal digits counting from the first nonzero digit (which is $1$), but it has $5$ decimal digits after the decimal point. However, this difference will not significantly affect the core of the algorithm, so I will stick to what I have stated.

Note that, of course, this is not a very interesting problem if all we want to do is just printing the digits. However, typically this is done with so-called “big integer” arithmetic. Big integer arithmetic is notoriously slow, and also typically involves heap allocation, which is an absolute no-no in certain domains. In our case, it is probably easier to at least avoid heap allocations since the numbers we need to deal with are not horrendously large. But even without heaps, big integer arithmetic is still very heavy; the typical performance of division is especially quite dreadful. With this in mind, here is a revised version of our goal:

For a given binary floating-point number $w$ and a positive integer $d$, how to quickly print $d$ many decimal digits of $w$ with rounding, counting from the first nonzero digit, with reasonably-sized big integer arithmetic only, possibly avoiding divisions?

The current state-of-the-art algorithm to my knowledge, Ryū-printf, developed by Ulf Adams in 2019, already achieves this goal. Assuming IEEE-754 binary64, the maximum size of integers appearing inside the algorithm is $256$-bits, which is quite reasonable. It also does not directly do long division. As a consequence, Ryū-printf is very fast, and especially for large precision, it is much faster than other algorithms that use regular big integers.

However, the biggest problem with Ryū-printf is that it relies on a gigantic precomputed data table, which is about $\mathbf{102}$ KB. Compare this to the size of the Dragonbox table, which is only of $9.7$ KB. Moreover, it is not so difficult to trade a little bit of performance to compress this $9.7$ KB table down to $584$ bytes. However, it is not obvious if can we do something similar for Ryū-printf.

Back in 2020, I tried to implement Ryū-printf and was able to come up with an implementation strategy that allows us to compress the table down into $39$ KB, while delivering performance that is more or less equivalent to the reference implementation. But you know, $39$ KB is still huge. More importantly, I was not aware of any way to reasonably trade more performance for a smaller table size. I mean, I knew of a way to make the table smaller by sacrificing some performance (and I’m pretty sure Ulf Adams is probably also aware of a similar idea), but I couldn’t make it smaller than, say, $10$ KB.

The large table size is essentially due to the possibility of excessive precision, which, as I pointed out several times, is not the common use case. I do not think it’s a good idea to have a table as large as $39$ KB just to be prepared for a very rare possibility. If we can reduce the size of the table while delivering a reasonable performance for the common case, i.e., the small precision case, then that would be nice.

So here is the second revision of our goal:

For a given binary floating-point number $w$ and a positive integer $d$, how to quickly print $d$ many decimal digits of $w$ with rounding, counting from the first nonzero digit, with reasonably-sized big integer arithmetic only, possibly avoiding divisions, and also with reasonably-sized precomputed cache data, preferably with a generic method of trading the performance with the data size? Also the performance when $d$ is small is more important than the performance when $d$ is large.

As a spoiler, this is what I concluded. Assuming IEEE-754 binary64, the exact same table from Dragonbox (of size $9.7$ KB, or $584$ bytes with a little bit of performance cost) is mostly enough when $d$ is small, and the heaviest arithmetic operation we need is $128$-bit $\times$ $64$-bit multiplication. For any other cases not covered by this table, we only need an additional table of size $\mathbf{3.6}$ KB, and the heaviest arithmetic operation we need is $192$-bit $\times$ $64$-bit multiplication.

Furthermore, it is possible to flexibly trade a larger maximum size of operand to multiplications for a smaller size of this additional table, without compromising the performance of the cases covered by the Dragonbox table. For instance, it is possible to reduce the size of the additional table to $\mathbf{580}$ bytes at the cost of requiring $960$-bit $\times$ $64$-bit multiplications instead of $192$-bit $\times$ $64$-bit multiplications.

Acknowledgement

Ryū-printf is the first source of inspiration, although what I ended up with is not directly based on it. Many other crucial inspirations came from private conversations with James Edward Anhalt III about his integer formatting algorithm and related topics. This paper by Lemire et al on remainder computation was also influential. The work by Raffaello Giulietti and Dmitry Nadezhin, on which my previous work on shortest roundtrip formatting is directly based, is where I learned about the relevancy of the concept of continued fractions, which is undoubtedly the most crucial element in the development of this algorithm. Thanks to Shengdun Wang for many discussions on possible optimizations. He is also the one who told me that fixed-precision formatting could be used as a subroutine for correct parsing back in 2019-2020. Special thanks to Seung uk Jang for reviewing this lengthy post.

The core idea

Now let’s get into the main idea of the algorithm I came up with. Following the notation I used in my paper on Dragonbox, we write our binary floating-point number as

\[w = \pm f_{c}\cdot 2^{e},\]

where $f_{c}$ is an unsigned integer and $e$ is an integer within certain range. For IEEE-754 binary32, $f_{c}$ is at most $2^{24}-1$ and $e$ is from $-149$ to $104$. More precisely, $f_{c}$ is also at least $2^{23}$ unless $e=-149$ (see this). Also, for IEEE-754 binary64, $f_{c}$ is at most $2^{53}-1$ and $e$ is from $-1074$ to $971$, and $f_{c}$ is also at least $2^{52}$ unless $e=-1074$. For simplicity, we will only focus on IEEE-754 binary64 in this post, but there is nothing deep about this assumption and most of the discussions can be extended to other common formats without fundamental difficulties.

Since we are also interested in printing out the digits of midpoints (for the application into parsing), we will in fact work with the form

\[w = \pm 2f_{c}\cdot 2^{e-1}\]

so that we can easily replace $w$ by the midpoint

\[m_{w}^{+} = \pm (2f_{c} + 1) \cdot 2^{e-1}\]

if we want. For convenience, we will just use the notation $w$ to mean either of the above two, and use the notation $n$ to denote the significand part of it, i.e., either $2f_{c}$ or $2f_{c}+1$.

Also, we will always assume that $w$ is strictly positive since once positive case is done then the other cases, zero, negative, infinities, or NaN’s, can be done easily.

So what does it mean by obtaining decimal digits from $w$? By the $k$th decimal digit of $w$, we mean the number

\[\left(\left\lfloor w\cdot 10^{k} \right\rfloor \ \mathrm{mod}\ 10\right) = \left(\left\lfloor n \cdot (2^{e+k-1}\cdot 5^{k}) \right\rfloor \ \mathrm{mod}\ 10\right).\]

In fact, we do not need to take $\mathrm{mod}\ 10$. Rather, it is advantageous to consider a higher power of $10$ instead of $10$ for various reasons. Hence, we may consider

\[\left(\left\lfloor w\cdot 10^{k} \right\rfloor \ \mathrm{mod}\ 10^{\eta}\right) = \left(\left\lfloor n \cdot (2^{e+k-1}\cdot 5^{k}) \right\rfloor \ \mathrm{mod}\ 10^{\eta}\right)\]

for some positive integer $\eta$. Note that once we get the above which is an integer of (at most) $\eta$-many decimal digits, we can leverage fast integer formatting algorithms (like the one by James Edward Anhalt III) to extract decimal digits out of it. Considering $\eta>1$ essentially means that we work with a block of multiple decimal digits at once, rather than with individual digits. I will call this block of digits a segment. Of course, to really leverage fast integer formatting algorithms, we may need to choose $\eta$ to be not very big. Maybe the largest value for $\eta$ we can think of is $9$ because $10^{9}$ is the highest power of $10$ that fits in $32$-bits, or maybe $19$ if we consider $64$-bits instead. However, it turns out, we can in fact take $\eta$ even larger than that, which is a crucial factor for the space-time trade off we are aiming for. We will discuss this later.

Abstractly speaking, what we want to do is to compute

\[\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\]

where $n=1,\ \cdots\ ,n_{\max}$ is a positive integer, $x$ is a positive rational number, and $D$ is a positive integer. What this expression means is this: we multiply $n$ to the numerator of $x$, divide it by the denominator of $x$, take the quotient, then divide the quotient by $D$ and take the remainder.

As I mentioned in the previous section, we want to avoid big integer arithmetic of arbitrary precision (especially division) so we do not want to do this computation as literally described above. The required precision for doing so is indeed quite big. For instance, let’s say $D=10^{9}$, $e=-1074$ and $k=560$ so that we are obtaining the 552nd~560th digits of $w$ after the decimal point. Then the numerator of $x$ is $5^{560}$ which is $1301$-bit long, so we have to compute this $1301$-bit long number, multiply our $n$ into this $1301$-bit long number, and then divide it by $2^{515}$ (which is the denominator of $x$ in this case) which means throwing the last $515$-bits away, and then compute the division by $D$ for the resulting $800$-ish-bits number. Or, let $e=971$ and $k=-100$ so that we are obtaining 9 digits up to the 101st digit before the decimal point. Then the numerator of $x$ is $2^{870}$ and the denominator is $5^{100}$, so we may need to first left-shift $n$ by $870$-bits, and then divide it by either $5^{100}$ after computing $5^{100}$ or by lower powers of $5$ iteratively. Either way, we end up with dividing a number that is $900$-ish-bits long.

(In fact, by performing big integer multiplications in decimal, that is, using this or a slight extension of it, we can avoid doing divisions by big integers. The idea, which I learned from Shengdun Wang, is that we can always turn the denominator of $x$ into a power of $10$ by adjusting the numerator accordingly, and in decimal, dividing by a power of $10$ is just a matter of cutting off some digits. The cost to pay is that multiplication of integers in decimal involves a lot of divisions by constant powers of $10$ (but fortunately of small dividends). Some cool things about this trick are that it generalizes trivially to any binary floating-point formats, and also that we can store precomputed powers of $5$ and $2$ if needed, while the total size of the table can be quite easily tuned according to any given requirements. This is all good, but I think the method that will be explained below is probably way faster.)

The following theorem from the paper on Dragonbox again proves itself to be very useful for computing $\left(\left\lfloor nx \right\rfloor\ \mathrm{mod}\ D\right)$:

Theorem 4.2.

Let $x$ be a positive real number and $n_{\max}$ a positive integer. Then for a positive real number $\xi$, we have the followings.

If $x=\frac{p}{q}$ is a rational number with $q\leq n_{\max}$, then we have $\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor$ for all $n=1,\ \cdots\ ,n_{\max}$ if and only if $x \leq \xi < x + \frac{1}{vq}$ holds, where $v$ is the greatest integer such that $vp\equiv -1\ (\mathrm{mod}\ q)$ and $v\leq n_{\max}$.

If $x$ is either irrational or a rational number with the denominator strictly greater than $n_{\max}$, then we have $\left\lfloor nx \right\rfloor = \left\lfloor n\xi \right\rfloor$ for all $n=1,\ \cdots\ ,n_{\max}$ if and only if $\frac{p_{*}}{q_{*}} \leq \xi < \frac{p^{*}}{q^{*}}$ holds, where $\frac{p_{*}}{q_{*}}$, $\frac{p^{*}}{q^{*}}$ are the best rational approximations of $x$ from below and above, respectively, with the largest denominators $q_{*},q^{*}\leq n_{\max}$.

The core idea is to take $\xi=\frac{mD}{2^{Q}}$ for certain positive integers $m$ and $Q$ depending on $x$ and $n_{\max}$ (but not on $n$). Suppose $\xi=\frac{mD}{2^{Q}}$ satisfies the conditions given above, then, the following magic happens:

\[\begin{aligned} \left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right) &= \left\lfloor nx \right\rfloor - \left\lfloor\frac{\left\lfloor nx \right\rfloor}{D} \right\rfloor D \\ &= \left\lfloor n\xi \right\rfloor - \left\lfloor\frac{\left\lfloor n\xi \right\rfloor}{D} \right\rfloor D \\ &= \left\lfloor n\xi \right\rfloor - \left\lfloor\frac{n\xi}{D} \right\rfloor D \\ &= \left\lfloor \frac{nmD}{2^{Q}} \right\rfloor - \left\lfloor \frac{nm}{2^{Q}}\right\rfloor D \\ &= \left\lfloor \frac{nmD - \lfloor nm/2^{Q} \rfloor D2^{Q}}{2^{Q}} \right\rfloor \\ &= \left\lfloor \frac{(nm - \lfloor nm/2^{Q} \rfloor 2^{Q})D}{2^{Q}} \right\rfloor \\ &= \left\lfloor \frac{(nm\ \mathrm{mod}\ 2^{Q})D}{2^{Q}} \right\rfloor. \end{aligned}\]

In other words, we first multiply $m$ to $n$, take the lowest $Q$-bits out of it, multiply $D$ to the resulting $Q$-bits, and then throw away the lowest $Q$-bits. Then what’s remaining is precisely equal to $\left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right)$.

This trick is in fact no different from the idea presented in this paper by Lemire et al. But here we are relying on a more general result (Theorem 4.2) which gives a much better bound when the denominator of $x$ is large. A similar, slightly different idea is also used in the integer formatting algorithm I analyzed in the previous post. Really, I consider the algorithm presented here as a simple generalization of those works.

In addition to turning a modular operation into a multiplication followed by some bit manipulations, we get an even nicer achievement: we only need to know the lowest $Q$-bits of $m$, rather than all bits of $m$, because we are going to take $\mathrm{mod}\ 2^{Q}$ right after multiplying $n$ to it. That is, $\left(nm\ \mathrm{mod}\ 2^{Q}\right)$ is equal to $\left(n\left(m\ \mathrm{mod}\ 2^{Q}\right)\ \mathrm{mod}\ 2^{Q}\right)$.

By Theorem 4.2, given $x$ and $n_{\max}$, any $m$ and $Q$ such that $\xi = \frac{mD}{2^{Q}}$ satisfies the conditions will allow us to compute $\left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right)$ using the formula

\[\label{eq:core formula} \left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right) = \left\lfloor \frac{\left(n\left(m\ \mathrm{mod}\ 2^{Q}\right)\ \mathrm{mod}\ 2^{Q}\right)D}{2^{Q}} \right\rfloor.\]

Obviously, we want to choose $Q$ to be as small as possible because that will reduce the number of bits we need to throw into the computation, which both saves the memory and improves the performance. However, due to a reason that will become clear later in this section, it is beneficial to have a generic formula of $m$ that works for any large enough $Q$, rather than the one that allows us to choose the smallest $Q$. That generic formula we will use is:

\[m = \left\lceil \frac{2^{Q}x}{D} \right\rceil.\]

This choice of $m$ kinda makes sense, because morally $\xi=\frac{mD}{2^{Q}}$ is meant to be a good approximation of $x$, so $m$ should be a good approximation of $\frac{2^{Q}x}{D}$, but $m$ needs to be at least $\frac{2^{Q}x}{D}$ when the denominator of $x$ is small, due to the first case of Theorem 4.2, so we choose the ceiling.

With this $m$, $\xi = \frac{mD}{2^{Q}}$ is automatically at least $x$, so the left-hand sides of the inequalities given in Theorem 4.2 are always satisfied, thus we only need to care about the right-hand sides. That is, we look for $Q$ such that

\[\label{eq:core inequality} \xi = \frac{\left\lceil 2^{Q}x/D\right\rceil D}{2^{Q}} < \begin{cases} x + \frac{1}{vq} & \textrm{if $x=\frac{p}{q}$ is rational with $q\leq n_{\max}$},\\ \frac{p^{*}}{q^{*}} & \textrm{otherwise} \end{cases}\]

holds. Thanks to the wonderful theory of continued fractions, we can efficiently compute the right-hand side of the above inequality for any given $x$ and $n_{\max}$, allowing us to find the smallest $Q$ that satisfies the inequality.

This leads us to the following strategy for computing $\left(\left\lfloor nx \right\rfloor\ \mathrm{mod}\ D\right)$:

For all values of $x$ that we care about, find the smallest $Q$ satisfying the inequality $\eqref{eq:core inequality}$, and store the lowest $Q$-bits of $m = \left\lceil\frac{2^{Q}x}{D}\right\rceil$ and $Q$ in a cache table. This is done only once, before runtime.
During runtime, for a given value of $x$, load the corresponding $(m\ \mathrm{mod}\ 2^{Q})$ and $Q$ from the cache table. Then using the formula $\eqref{eq:core formula}$, which only requires some multiplications and bit manipulations, we can compute $\left(\left\lfloor nx\right\rfloor\ \mathrm{mod}\ D\right)$ for any given $n$.

In practice, however, this strategy is not directly applicable to our situation, because too much information needs to be stored. To illustrate this, recall that in our situation we have $x = 2^{e+k-1}\cdot 5^{k}$, so a different pair $(e,k)$ corresponds to a different $x$. After doing some analysis (which will be done in a later section), one can figure out that there are about $540,000$ pairs $(e,k)$ that can appear in the computation, and given $D=10^{\eta}$, the smallest feasible $Q$ is around $120$ when $\eta=1$, and it is even larger for larger values of $\eta$. Hence, we already need at least about $540,000\times 120\textrm{-bits}\approx 7.7$ MB just to store the lowest $Q$-bits of $m$. And that’s not even all we need, and we need to store $Q$ in addition to it. This is not acceptable.

Now the art is on how far we can compress this down. Indeed, there are several interesting features of the formula $\eqref{eq:core formula}$ we derived which together allow us to compactify this ridiculously large data into something much smaller. Let’s dig into them one by one.

(a) It is $k$ that matters, not $e$.

Recall that we have

\[m = \left\lceil \frac{2^{Q}x}{D} \right\rceil = \left\lceil 2^{Q+e+k-\eta-1}\cdot 5^{k-\eta} \right\rceil,\]

given $x = 2^{e+k-1}\cdot 5^{k}$ and $D=10^{\eta}$. Let’s for a while ignore the ceiling and pretend that it’s actually floor. Then what are the lowest $Q$-bits of the above integer?


Figure 1 - Bits of $5^{k-\eta}$ is shown in a row; bits in the window (red box) are the lowest $Q$-bits of $m$. The example shown is when $Q+e+k-\eta-1>0$. For the other case, the window will be on the left to the blue vertical line.

Those are of course bits of $5^{k-\eta}$, starting (MSB) from the $(e+k-\eta)$-th bit, ending (LSB) at the $(Q+e+k-\eta-1)$-th bit. So for a fixed $k$, a different choice of $e$ corresponds to a different choice of the window (the red box in the figure). The starting (the left-most) position of the window is determined by $e$ (once $k$ and $\eta$ are fixed), and the size of the window is determined by $Q$. Note that those windows corresponding to $e$’s have a lot of intersection between them. This means that we do not need to store $m$ for each pair of $e$ and $k$. Rather, for each $k$, we just store sufficiently many bits of $5^{k-\eta}$, just enough to cover all possible windows we may need to look at. In other words, for all $e$’s such that the pair $(e,k)$ is relevant (that is, can appear), we take the union of all the windows corresponding to those $e$’s to get a single large window, and then we store the bits inside that window.

Often, the resulting large window may contain some leading zeros, which we can remove to further reduce the amount of information that needs to be stored. This means that when we load the necessary bits from the cache table at runtime, the window corresponding to the given $e$ may go beyond the range of bits we actually have in the table. But in this case we can simply fill in those missing bits with zeros.

Now, let’s not forget that we need the ceiling rather than the floor, but this is not a tremendous issue because we can easily compute the ceiling from the floor just by checking if $2^{Q+e+k-\eta-1}\cdot 5^{k-\eta}$ is an integer, which is just a matter of inspecting the inequalities

\[Q+e+k-\eta-1 \geq 0\quad\textrm{and}\quad k-\eta \geq 0.\]

(b) We need only one $k$ per $\eta$-many $k$’s.

Since we work with segments consisting of $\eta$ digits rather than individual digits, we do not need to consider all $k$’s. Instead, we choose a small enough $k_{\min}$ and only consider $k_{\min}$, $k_{\min} + \eta$, $k_{\min} + 2\eta$, $\cdots$, up to a large enough $k_{\min} + s\eta$. Hence, roughly speaking, choosing a big $\eta$ can result in a reduction in the size of the table by a factor of $\eta$.

(c) We don’t need to remember the smallest $Q$.

Recall that we not only need to store $m$ but also $Q$, which adds a non-negligible amount of static data. However, we do not actually need to precisely know the smallest $Q$ that does the job. More specifically, the $Q$ we use to compute the $m$ that we store in the table does not need to match the actual $Q$ we use at runtime, as long as the actual $Q$ we use at runtime is greater than or equal to the $Q$ we use for building the table.

What I mean is this. When we compute $m$ stored in the table, we want to find the minimum possible $Q$ that works, so that the window we get will be of the smallest possible size. However, it is okay to use a larger window at runtime if we want. To see why, recall that the left-most position of the window is not dependent on $Q$. Using a different $Q$ just changes the size of the window. Now, let $Q$ be the one that we used for building the table, and let $Q’$ be any integer with $Q’\geq Q$. Let $m’$ be what we will end up with if we try to load $Q’$-bits from the table instead of $Q$-bits, then we have

\[m' = \left\lceil\frac{2^{Q''}x}{D}\right\rceil 2^{Q'-Q''}\]

for some $Q\leq Q''\leq Q'$. Here we are assuming that, if the window of size $Q’$ goes beyond the right limit of the stored bits (which can occur since $Q’$ is chosen to be bigger than $Q$), then we fill in the missing bits with zeros, and also we perform the ceiling at the last bit we loaded from the cache. This is why we get this additional $Q’’$. Note that, however, this $Q’’$ is guaranteed to be between $Q$ and $Q’$ due to the construction of the cache table. Then the corresponding $\xi’$ is now given as

\[\xi' = \frac{m'D}{2^{Q'}} = \frac{\left\lceil 2^{Q''}x/D \right\rceil}{2^{Q''}}.\]

Now since

\[m\cdot 2^{Q''-Q} = \left\lceil\frac{2^{Q}x}{D}\right\rceil 2^{Q''-Q} \geq \frac{2^{Q''}x}{D}\]

holds and the left-hand side is an integer, it follows that

\[m\cdot 2^{Q''-Q} \geq \left\lceil \frac{2^{Q''}x}{D} \right\rceil\]

holds, thus we get $\xi\geq \xi’$. This means that $\xi’$ still satisfies the conditions listed in Theorem 4.2, thus the formula

\[\left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right) = \left\lfloor \frac{\left(n(m'\ \mathrm{mod}\ 2^{Q'})\ \mathrm{mod}\ 2^{Q'}\right)D}{2^{Q'}} \right\rfloor\]

is still valid.

This analysis leads us to two strategies for reducing the number of bits we need to store the values of $Q$:

Find the maximum among all the smallest $Q$’s for relevant $(e,k)$-pairs. Then at runtime, we will use this maximum $Q$ for all $(e,k)$. In this case, we do not need to store $Q$’s at all since it is a constant.
Alternatively, we can partition the set of $(e,k)$-pairs into groups which share a single $Q$, which is the largest of all the smallest $Q$’s for each member of the group. One way to make such a partition is to fix a positive integer $\ell$ (which we will refer as the collapse factor), and group all $(e,k)$-pairs that share the same $k$ and the same quotient $\left\lfloor (e-e_{\min})/\ell \right\rfloor$. In this way, we can reduce the number of bits needed to store $Q$’s roughly by a factor of $\ell$. In addition to that, we may store $\left\lceil Q/64 \right\rceil$ instead of $Q$ and use $\left\lceil Q/64\right\rceil\cdot 64$ instead of $Q$, because (assuming a $64$-bit platform) it is the number of $64$-bit blocks, rather than the number of bits, which determines the amount of computation we need to perform. For example, there is no benefit of choosing $Q=125$ over choosing $Q=128$, so we just use $Q=128$ in that case. This then also reduces the required number of bits by a certain factor.

In practice, having $Q$ as a fixed constant seems to allow the compiler to perform many aggressive optimizations, so the advantage of the second strategy is only apparent when the values of $Q$ for different $x$’s vary significantly.

Decimal digit generation

In the last section, we have discussed when it is allowed to use the formula $\eqref{eq:core formula}$:

\[\left(\left\lfloor nx \right\rfloor \ \mathrm{mod}\ D\right) = \left\lfloor \frac{\left(n(m\ \mathrm{mod}\ 2^{Q})\ \mathrm{mod}\ 2^{Q}\right)D}{2^{Q}} \right\rfloor.\]

In this section, we will discuss how to actually leverage this formula to quickly compute decimal digits of $w = n\cdot 2^{e-1}$. For convenience, we will use the notation $m$ to really mean $(m\ \mathrm{mod}\ 2^{Q})$ in this section.

Let us recall what operations we need to do for computing the above.

First, we need to multiply $n$, which is of $64$-bits (it is a bit smaller than that, but that doesn’t matter a lot in this case), and $m$, which is of $Q$-bits. We only take the lowest $Q$-bits from the result.
Next, we multiply the resulting $Q$-bit number with $D=10^{\eta}$. $D$ fits in $64$-bits if $\eta\leq 19$. In this case, we discard the lowest $Q$-bits and only take the higher bits.

Note that since $Q$ can be bigger than $64$, this inevitably involves some form of big integer arithmetic. Obviously, the required bit-width of this big integer arithmetic increases (which means that the computation will take more time) as $Q$ grows. In general, if we choose a larger $\eta$, then $Q$ gets bigger as well. Thus we have a space-time trade off here, because taking a larger $\eta$ makes the size of the table smaller.

When $\eta=1$, the maximum of all $Q$’s is about $120$, which means that the multiplications involved are between a $128$-bit integer and a $64$-bit integer. When $2\leq\eta\leq 19$, the maximum of all $Q$’s lies in between $129$ and $192$, which means that we need multiplications of a $192$-bit integer and a $64$-bit integer.

We will not seriously consider $\eta=1$ case, because it not only requires ridiculously big table size ($70$ KB, even after all the compression schemes explained in the last section) but also has terrible performance, as we have to perform two $128$-bit $\times$ $64$-bit multiplications for each single digit.

When $20\leq \eta\leq 22$, $Q$ is still at most $192$, but now $D=10^{\eta}$ cannot fit inside $64$-bits. Thus one may think that probably $\eta=19$ is the best choice. However, it turns out that $20\leq \eta\leq 22$ is also quite a sensible choice as well, because there is an elegant way of extracting the digits from $\eqref{eq:core formula}$, which basically works for any $\eta$: we don’t compute all of the digits at once, rather, we extract a smaller sequence of digits iteratively, from left to right.

What happens here is really no different from what James Anhalt’s algorithm does. If we want to know the first $\gamma_{1}<\eta$ decimal digits of $\left\lfloor \frac{(nm\ \mathrm{mod}\ 2^{Q})D}{2^{Q}} \right\rfloor,$ then we compute $\left\lfloor \frac{(nm\ \mathrm{mod}\ 2^{Q})d_{1}}{2^{Q}} \right\rfloor$ with $d_{1} = 10^{\gamma_{1}}$. If we want to know the next $\gamma_{2}$ digits, then compute $\left\lfloor \frac{(nmd_{1}\ \mathrm{mod}\ 2^{Q})d_{2}}{2^{Q}} \right\rfloor$ with $d_{2} = 10^{\gamma_{2}}$, and for the next $\gamma_{3}$ digits, compute $\left\lfloor \frac{(nmd_{1}d_{2}\ \mathrm{mod}\ 2^{Q})d_{3}}{2^{Q}} \right\rfloor$ with $d_{3} = 10^{\gamma_{3}}$, and so on, until we exhaust all $\eta$ digits. And this procedure can be nicely iterated: at each iteration, we have a $Q$-bit number stored in a buffer, and we multiply $d=10^{\gamma}$ to it. The right-most $Q$ bits of the result will be stored back into the buffer for the next iteration, and the remaining left-most bits are the output of the current iteration.

This iteration approach allows for quite efficient computation of decimal digits in terms of the required number of multiplications. As an illustration, let us compare the required number of multiplications for the case $\eta=22$, $Q=192$ with the case of Ryū-printf. As in my Dragonbox paper, let us call $64$-bit $\times$ $64$-bit $\to 128$-bit multiplication as the full multiplication, and $64$-bit $\times$ $64$-bit $\to 64$-bit multiplication (i.e., taking only the lower half of the result) as the half multiplication. The point of distinguishing these two is that they have quite different performance characteristics in typical x64 machines. Or, in other machines often there is no direct instruction for doing the full multiplication, and we have to emulate it with multiple half multiplications. So generally full multiplications tend to be slower than half multiplications. Now let us analyze the number of full/half multiplications we need to perform.

In our iteration scheme, first we have to prepare the $Q$-bit number $(nm\ \mathrm{mod}\ 2^{Q})$, which means we multiply a $64$-bit integer $n$ with a $192$-bit integer $m$, where we only take the lowest $192$-bits from the result. This requires $2$ full multiplications and $1$ half multiplication. Next, we extract $16$ decimal digits (as a $64$-bit integer) from the segment by multiplying $10^{16}$. This requires $3$ full multiplications. And then, we divide this into two $8$-digits chunks (so that each of them fits in $32$-bits), which means that we divide it by $10^{8}$ and take the quotient and the remainder. This requires $1$ full multiplication and $1$ half multiplication, by applying the usual trick of turning a constant division into a multiplication followed by a shift. Next, we extract the remaining $6$ digits (as a $32$-bit integer) from our $22$-digits segment, which again requires $3$ full multiplications. At this point, we performed $9$ full multiplications and $2$ half multiplications to get three $32$-bit chunks (which I call subsegments) each consisting of $8$, $8$, and $6$ digits. Then, using James Anhalt’s algorithm, we need $4+4+3=11$ half multiplications for printing out all of them. Thus in total, we need $9$ full multiplications and $13$ half multiplications.

On the other hand, in Ryū-printf, we first multiply a $64$-bit integer to a $192$-bit cache entry, extract the upper $128$-bits from the result, shift it by a certain amount, divide it by $10^{9}$ and then take the remainder, to get a segment consisting of $9$ decimal digits. The first step of multiplying a $64$-bit integer and a $192$-bit integer requires $3$ full multiplications. Applying the usual trick, we can turn the division by $10^{9}$ into a multiplication, and in this case since the dividend is of $128$-bits, the magic number we use for multiplication is also of $128$-bits. More precisely, we need to multiply two $128$-bit integers, and then take the second $64$-bit block from the resulting $256$-bit integer. This requires $3$ full multiplications and $1$ half multiplication. Next, to compute the remainder from the quotient, we need $1$ half multiplication (in $32$-bits). So we need to perform $6$ full multiplications and $2$ half multiplications to get a $32$-bit chunk consisting of $9$ digits. Then, again using James Anhalt’s algorithm, we need $5$ half multiplication for printing these $9$ digits out. Therefore, we need $6$ full multiplications and $7$ half multiplications in total.

Hence, in average, our scheme requires $9/22\approx 0.4$ full multiplications and $13/22\approx 0.6$ half multiplications per a digit, while Ryū-printf requires $6/9\approx 0.7$ full multiplications and $7/9\approx 0.8$ half multiplications per a digit, which means that our scheme is like $60$ % better throughput compared to Ryū-printf, in terms of the required number of full multiplications, and $30$ % better in terms of half multiplications.

The throughput for $\eta=18$, $Q=192$ is even better. Preparation of $(nm\ \mathrm{mod}\ 2^{Q})$ again requires $2$ full multiplications and $1$ half multiplication, and we extract all of $18$ digits at once (as a $64$-bit integer) by performing $3$ full multiplication. Splitting it into two $9$-digits subsegments requires $1$ full multiplication and $1$ half multiplication, and printing them out requires $5+5=10$ half multiplications. So in total, we need $6$ full multiplications and $12$ half multiplications, or $6/18\approx 0.3$ full multiplications and $12/18\approx 0.7$ half multiplications per a digit.

Which $(e,k)$-pairs are relevant?

So far, we have mentioned “a relevant $(e,k)$-pair” (or alike) a lot but not really precisely defined what it means. So let’s talk about it.

Recall that we are trying to compute decimal digits of $w=n\cdot 2^{e-1}$. The way we do it is to compute $\left(\left\lfloor w\cdot 10^{k}\right\rfloor\ \mathrm{mod}\ D\right)$ with $D=10^{\eta}$. Note that, if $k$ is too large, then $w\cdot 10^{k}$ will always be an integer multiple of $D=10^{\eta}$, because $2$ is a divisor of $10$. Then we always get $\left(\left\lfloor w\cdot 10^{k}\right\rfloor\ \mathrm{mod}\ D\right)=0$, which means that we do not ever need to consider such a large $k$. Specifically, for given $e$, this always happens regardless of $n$ if:

$k\geq \eta$ when $e-1\geq0$, and
$k\geq \eta-e+1$ when $e-1<0$.

Hence, for given $k$, the pair $(e,k)$ is “relevant” only when $k\leq \eta-1$ if $e-1\geq 0$ and $k\leq\eta-e$ when $e-1<0$. In particular, we never need to consider a $k$ such that $k\geq \eta - e_{\min} + 1$ where $e_{\min}=-1074$.

How about the other side? Given $e$, finding the smallest $k$ such that $(e,k)$ is relevant means to find the position of the first nonzero decimal digit of $w = n\cdot 2^{e-1}$, for the range of $n$ given. In fact, we already know how to do it. In the Dragonbox paper, we showed that it is possible to reliably compute $\lfloor w\cdot 10^{k}\rfloor$ for $k=\kappa - \left\lfloor e\log_{10}2\right\rfloor$ for $\kappa=2$ (or $\kappa=1$ for IEEE-754 binary32) by multiplying (an appropriate shift of) $n$ to the $k$-th table entry of the Dragonbox algorithm. Here, we are not taking $\mathrm{mod}\ D$, so there is no further digit we can extract by considering a smaller $k$. Therefore, this $\left\lfloor w\cdot 10^{k}\right\rfloor$ is the first segment that we ever need to look at.

The choice $k = \kappa - \left\lfloor e\log_{10}2\right\rfloor$ guarantees that this first segment is always nonzero. Indeed, it must contain at least $3$ decimal digits, because

\[\begin{align*} w\cdot 10^{k} &= n\cdot 2^{e-1}\cdot 10^{k} \\ &\geq 2^{e}\cdot 10^{\kappa - \left\lfloor e\log_{10}2\right\rfloor} \\ &\geq 2^{e}\cdot 10^{\kappa - e\log_{10}2} \\ &= 10^{\kappa} \end{align*}\]

where the right-hand side is of $(\kappa+1)$-digits. In fact, if $w$ is not a subnormal number, we have $n\geq 2^{53}$, so

\[\begin{align*} w\cdot 10^{k} &= n\cdot 2^{e-1}\cdot 10^{k} \\ &\geq 2^{52}\cdot 2^{e}\cdot 10^{\kappa - \left\lfloor e\log_{10}2\right\rfloor} \\ &\geq 2^{52}\cdot 10^{\kappa}, \end{align*}\]

which means that $\left\lfloor w\cdot 10^{k}\right\rfloor$ must be of at least $18$-digits. On the other hand, we know $n\leq 2^{54}-1$, so

\[\begin{align*} w\cdot 10^{k} &= n\cdot 2^{e-1}\cdot 10^{k} \\ &\leq (2^{54}-1)\cdot 2^{e-1}\cdot 10^{\kappa - \left\lfloor e\log_{10}2\right\rfloor} \\ &< \left(2^{53}-\frac{1}{2}\right)\cdot 2^{e}\cdot 10^{\kappa - e\log_{10}2 + 1} \\ &= \left(2^{53}-\frac{1}{2}\right)\cdot 10^{\kappa+1}, \end{align*}\]

so $\left\lfloor w\cdot 10^{k}\right\rfloor$ must be of at most $19$-digits.

So, (with the exception of the case of subnormal numbers) this $64$-bit integer $\left\lfloor w\cdot 10^{k}\right\rfloor$ already gives us a pretty large number of digits of $w$, which means that we have already (mostly) solved the common case, i.e., the small precision case, of our problem at hand.

(Note that it is also possible to always extract $18\sim 19$ digits even from subnormal numbers by normalizing $w$, that is, by multiplying an appropriate power of $2$ to $n$ and subtracting the corresponding power from $e$. That requires a bit more table entries than the ones we used for Dragonbox though. Whether or not this is a good thing to do is not very clear to me at this point. The current implementation does not do normalization.)

Therefore, the pair $(e,k)$ is “relevant” only when $k$ is large enough so that computing

\[\left(\left\lfloor w\cdot 10^{k}\right\rfloor\ \mathrm{mod}\ D\right) = \left(\left\lfloor n\cdot 2^{e+k-1}\cdot 5^{k}\right\rfloor\ \mathrm{mod}\ D\right)\]

can produce at least one digit that is not contained in the first segment, which means that

\[k \geq \kappa - \left\lfloor e\log_{10}2\right\rfloor + 1\]

holds.

Therefore, for given $k$,

When $0\leq e-1\leq e_{\max}-1 = 970$, $(e,k)$ is relevant if and only if
\[\kappa - \left\lfloor e\log_{10}2\right\rfloor + 1 \leq k \leq \eta-1.\]
When $-1075 = e_{\min}-1 \leq e-1<0$, $(e,k)$ is relevant if and only if
\[\kappa - \left\lfloor e\log_{10}2\right\rfloor + 1 \leq k \leq \eta-e.\]

Recall that we will only consider $k$’s of the form $k = k_{\min} + s\eta$. Let us call the integer $s=0,1,2,\ \cdots\ $ the multiplier index. Once $k_{\min}$ has been chosen, the greatest multiplier index $s_{\max}$ can be figured out by finding the largest $s_{\max}$ satisfying the inequality

\[k_{\min} + s_{\max}\eta \leq \eta - e_{\min}\]

with $e_{\min}=-1074$, or equivalently,

\[s_{\max} = \left\lfloor \frac{\eta - e_{\min} - k_{\min}}{\eta}\right\rfloor.\]

To choose $k_{\min}$, note that it is enough to chose $k_{\min}$ such that when $e=e_{\max}$, $\left(\left\lfloor w\cdot 10^{k_{\min}}\right\rfloor\ \mathrm{mod}\ D\right)$ contains the first digit that cannot be obtained from the first segment. Since we extract $\eta$-digits at once, this means that

\[k_{\min} \leq \kappa - \left\lfloor e_{\max}\log_{10}2\right\rfloor + \eta\]

with $e_{\max} = 971$ is satisfied. No smaller $k$ ever needs to be considered.

In general, choosing a smaller $k_{\min}$ gives larger $s_{\max}$ so more powers of $5$ to be stored in the table, so we would want take the largest possible $k_{\min}$, that is $k_{\min} = \kappa - \left\lfloor e_{\max}\log_{10}2\right\rfloor + \eta$. However, it turns out that choosing a little bit smaller $k_{\min}$ can actually make the table a little bit smaller. In the implementation, I did some trial-and-error type experiments to find the optimal $k_{\min}$ that minimizes the table size.

In this way, we can determine for which $k$ the bits of $5^{k-\eta}$ need to be stored in the cache table.

For given $e$, the first multiplier index, which gives the first digit that cannot be obtained from the first segment, can be obtained by finding the smallest $s$ satisfying

\[k = k_{\min} + s\eta \geq \kappa - \left\lfloor e\log_{10}2\right\rfloor + 1,\]

or equivalently,

\[s = \left\lceil \frac{(\kappa - \left\lfloor e\log_{10}2\right\rfloor) - k_{\min} + 1}{\eta} \right\rceil = \left\lfloor \frac{(\kappa - \left\lfloor e\log_{10}2\right\rfloor) - k_{\min} + \eta}{\eta} \right\rfloor.\]

The cache table

At this point, almost all of the whole algorithm has been explained. In this section, I collected some missing details that are needed to actually build the cache table.

(a) How do we arrange the computed bits of $5^{k-\eta}$ in the memory?

For each multiplier index $s=0,\ \cdots\ ,s_{\max}$, we find all the necessary bits of $5^{k-\eta}$ with $k=k_{\min}+s\eta$. After cutting off all leading zeros, we simply stitch all of these bits together and store them in an array of $64$-bit integers. In theory, the first bit for each $k$ in this case will always be $1$, so we can even remove those bits, but let’s not do that as it will complicate the implementation a lot for a marginal benefit.

(b) How to locate the necessary bits from the table, for given $e$ and $k$?

For each multiplier index $s$, we store needed metadata. There are three different kinds of information in this metadata.

First, we store the position of the first stored bit of $5^{k-\eta}$ in the cache table.
Second, we store the offset value which, when added with the exponent $e$, yields the position of the first needed bit of $5^{k-\eta}$ from the table for given $e$, or in other words, the starting position of the window. Since the window shifts to right by $1$ if we increase $e$ by $1$, such an offset value is enough for determining the position. Note that this starting position of the window can go further to the left limit of the stored bits because we removed all the leading zeros.

(This in particular means that the offset value can be negative. In the implementation, thus I shifted all of these offset values by some amount to make everything nonnegative. My intention was that by doing so we can possibly use a smaller unsigned integer type for storing those offsets. In the end the size of the integer type wasn’t affected but I left it as it is.)
Finally, we also need a similar offset for the $Q$-table, defined to be the unique value which, when added with $\left\lfloor \frac{e-e_{\min}}{\ell}\right\rfloor$ yields the index in the $Q$-table. This information is only needed when we use the collapse factor $\ell$ for compressing $Q$’s, and is not needed if we use a fixed constant $Q$.

Summary of how it works

Here is how the algorithm works in a nutshell:

Precompute powers of $5$ up to the necessary precision and store them.
Convert the given floating-point number into a “fixed-point form” by multiplying appropriate bits loaded from those precomputed powers of $5$ into the significand of the input.
Iteratively extract the digits from the computed fixed-point form, just like James Anhalt’s algorithm.

And here is a little bit more detailed version of the summary:

Given a floating point number $w = n\cdot 2^{e-1}$, we want to compute $\left(\left\lfloor w\cdot 10^{k} \right\rfloor\ \mathrm{mod}\ 10^{\eta}\right)$ for some $k$’s.
For the first several digits, we can just multiply an appropriate entry from the Dragonbox table, or anything like that will work.
To prepare for the case when that will not yield sufficiently many digits, we find $Q$ that makes the equality $\begin{align*} \left(\left\lfloor w\cdot 10^{k} \right\rfloor\ \mathrm{mod}\ 10^{\eta}\right) &= \left(\left\lfloor n\cdot (2^{e+k-1}\cdot 5^{k}) \right\rfloor \ \mathrm{mod}\ 10^{\eta}\right) \\ &= \left\lfloor \frac{(n\left\lceil 2^{Q+e+k-\eta-1}\cdot 5^{k-\eta} \right\rceil \ \mathrm{mod}\ 2^{Q})\cdot 10^{\eta}}{2^{Q}} \right\rfloor \end{align*}$ to hold for all $n$ using Theorem 4.2, for all relevant $(e,k)$-pairs.
For each $k$, find the range of bits of $5^{k-\eta}$ that appears in $\left\lfloor 2^{Q+e+k-\eta-1}\cdot 5^{k-\eta} \right\rfloor \ \mathrm{mod}\ 2^{Q}$ for all $e$ such that $(e,k)$ is one of the pairs considered in the previous step.
Compute those bits and store them in a static data table, possibly along with $Q$’s.
At runtime, for given $e$ and $k$ we choose the needed bits from the table and compute $\left(n\left(\left\lceil 2^{Q+e+k-\eta-1}\cdot 5^{k-\eta} \right\rceil \ \mathrm{mod}\ 2^{Q}\right)\ \mathrm{mod}\ 2^{Q}\right).$
By iteratively multiplying a power of $10$ and taking the upper bits, we can compute the digits of $\left(\left\lfloor w\cdot 10^{k} \right\rfloor\ \mathrm{mod}\ 10^{\eta}\right)$ from left to right.

Rounding

After generating the requested number of digits, we have to perform rounding. It sounds simple, but it is in fact an incredibly complicated issue. The root cause of the difficulty is the many ramifications of different cases that all need different handling. However, fortunately, most of those cases can be categorized into two which we will explore in this section.

Rounding inside a subsegment

A segment, which consists of $\eta$-digits, is divided into several subsegments that are small enough to fit inside $32$-bits. The most common case of rounding is when it happens inside a subsegment.

Let’s say we have a $9$-digit subsegment, and we need to round at the $t$-th digit of the subsegment for some $1\leq t\leq 8$. So there are at least one digit left in the subsegment that will not be printed. In this case, the rounding happens when the following conditions are met, assuming the default rounding rule:

If the remaining $(9-t)$-digits consist a number that is strictly bigger than $10^{9-t}/2$, then round-up.
If that number is strictly smaller than $10^{9-t}/2$, then round-down.
When that number is exactly $10^{9-t}/2$, then:
- If the currently printed number is odd, then round-up.
- If there are further nonzero digits after the current subsegment, then round-up.
- Otherwise, round-down.

The problem is that $t$ is in general a runtime variable, not a constant. So it sounds like we have to compute this $(9-t)$-digits number at runtime. But it turns out, we can still do this check without computing it.

The idea is again based on the “fixed-point fractions trick” I learned from James Anhalt’s algorithm. It goes as follows.

Let’s say we have a $9$-digit subsegment. Let’s call it $n$ (this $n$ has nothing to do with the $n$ we have been using in previous sections). In order to print $n$, we find a good approximation of the rational number $n/10^{8}$ as a $64$-bit fixed-point fractional number, where the upper $32$-bits represent the integer part and the lower $32$-bits represent the fractional part. More precisely, we find a $64$-bit integer $y$ such that
\[\frac{2^{32}n}{10^{8}} \leq y < \frac{2^{32}(n+1)}{10^{8}}.\]
holds. (See the previous post for details.)
The first digit of $n$ can then be obtained as $\left\lfloor\frac{y}{2^{32}}\right\rfloor$. Also the next $(t-1)$-digits can be obtained by multiplying $10^{t-1}$ to the lower $32$-bits of $y$ and extracting the upper $32$-bits from the result. We assume that we already have done this to print $t$-digits from $n$. Let us call the remaining lower half of the result $y_{1}$.
Then the remaining $(9-t)$-digits can be obtained from $y_{1}$ by multiplying $10^{9-t}$ to it and then extracting the upper $32$-bits. But the thing is, we don’t need to precisely know these digits. All we want to do is to compare those digits with $10^{9-t}/2$. Note that, with $k=9-t$,
\[\left\lfloor \frac{y_{1}\cdot 10^{k}}{2^{32}}\right\rfloor \geq 10^{k}/2\]
holds if and only if $y_{1} \geq 2^{31}$, if and only if the MSB of $y_{1}$ is $1$. Hence, we round-down if the MSB is $0$.
However, we cannot be sure what to do if the MSB is $1$, because that can mean either the equality or the strict inequality. To distinguish those two cases, we have to inspect the inequality
\[\left\lfloor \frac{y_{1}\cdot 10^{k}}{2^{32}}\right\rfloor \geq \frac{10^{k}}{2} + 1,\]
which is equivalent to
\[y_{1} \geq 2^{31} + \frac{2^{32}}{10^{k}}, \quad\textrm{or}\quad y_{1} \geq \left\lceil 2^{31} + \frac{2^{32}}{10^{k}}\right\rceil.\]
The right-hand side is not a constant since $k$ is a runtime variable, but we can simply store all possible values of it in yet another precomputed static table and then index it with $k-1$ (note that $k$ cannot be zero).
Still, we may need to check if there are further nonzero digits left after ths current subsegment. I will not go into details of this since it is a sort of a massive case-by-case thing, but overall the idea is that this is just the matter of doing the integer-check of a number of the form $n\cdot 2^{e+k-e_{0}}\cdot 5^{k-k_{0}}$ (this $n$ is the $n$ from previous sections), which then is just the matter of counting factors of $2$ and $5$ inside $n$.

Rounding at the subsegment boundary

The above doesn’t work for $t=9$ since there is no more digit at all to inspect. To obtain further digits we have to look at the next subsegment. But that is not strictly needed: we only need to know just one more bit. Suppose that the current subsegment is equal to $\left(\left\lfloor w\cdot 10^{k}\right\rfloor\ \mathrm{mod}\ 10^{9}\right)$ for some $k$ (this $k$ is a different $k$ from previous sections because $k-k_{\min}$ in this case does not need to be a multiple of $\eta$), then this means that we need to know the first fractional bit of the rational number $w\cdot 10^{k}$. Assume that we were able to get this bit upfront. Then the rounding conditions can be checked as follows:

If the next bit is $0$, then round-down.
If the next bit is $1$ and the last digit printed is odd, then round-up.
If the next bit is $1$ and if there is at least one nonzero fractional bit of $w\cdot 10^{k}$ after that, then round-up.
Otherwise, round-down.

Checking if there are further nonzero bits is again the matter of counting factors of $2$ and $5$, so let us focus on how to get this one more additional bit. I will not go into all the details, but the core idea is this. Recall that we are computing the number

\[\left\lfloor nx \right\rfloor\ \mathrm{mod}\ D.\]

If we want to compute one more bit than this, then what we do is to compute

\[\left\lfloor 2nx \right\rfloor\ \mathrm{mod}\ 2D\]

instead. Indeed, if $b$ is the first fractional bit of $nx$, then $\left\lfloor 2nx\right\rfloor = 2\left\lfloor nx\right\rfloor + b$, so

\[\begin{aligned} 2\left(\left\lfloor nx\right\rfloor\ \mathrm{mod}\ D\right) + b &= 2\left\lfloor nx\right\rfloor - 2D\left\lfloor\frac{\left\lfloor nx\right\rfloor}{D}\right\rfloor+b \\ &= \left\lfloor 2nx\right\rfloor - \left\lfloor\frac{nx}{D}\right\rfloor 2D \\ &= \left\lfloor 2nx\right\rfloor - \left\lfloor\frac{2nx}{2D}\right\rfloor 2D\\ &= \left\lfloor 2nx\right\rfloor - \left\lfloor\frac{\left\lfloor 2nx\right\rfloor }{2D}\right\rfloor 2D\\ &= \left(\left\lfloor 2nx \right\rfloor\ \mathrm{mod}\ 2D\right). \end{aligned}\]

Thus, $b$ can be obtained by inspecting the last bit of the right-hand side, and the remaining bits precisely constitute the value of $\left(\left\lfloor nx\right\rfloor\ \mathrm{mod}\ D\right)$.

Now, suppose that $2\xi = \frac{2mD}{2^{Q}}$ is a good enough approximation of $2x$ in the sense that the conditions of Theorem 4.2, when $x$ is replaced by $2x$ and $\xi$ is replaced by $2\xi$, are satisfied, then we obtain

\[\begin{aligned} \left(\left\lfloor 2nx \right\rfloor \ \mathrm{mod}\ 2D\right) &= \left\lfloor 2nx \right\rfloor - \left\lfloor\frac{\left\lfloor 2nx \right\rfloor}{2D} \right\rfloor 2D \\ &= \left\lfloor 2n\xi \right\rfloor - \left\lfloor\frac{\left\lfloor 2n\xi \right\rfloor}{2D} \right\rfloor 2D \\ &= \left\lfloor 2n\xi \right\rfloor - \left\lfloor\frac{n\xi}{D} \right\rfloor 2D \\ &= \left\lfloor \frac{2nmD}{2^{Q}} \right\rfloor - \left\lfloor \frac{nm}{2^{Q}}\right\rfloor 2D \\ &= \left\lfloor \frac{2nmD - \lfloor nm/2^{Q} \rfloor 2D\cdot 2^{Q}}{2^{Q}} \right\rfloor \\ &= \left\lfloor \frac{(nm - \lfloor nm/2^{Q} \rfloor 2^{Q})2D}{2^{Q}} \right\rfloor \\ &= \left\lfloor \frac{(nm\ \mathrm{mod}\ 2^{Q})2D}{2^{Q}} \right\rfloor. \end{aligned}\]

So, in this case the additional bit can be simply obtained by multiplying $2D$ instead of $D$ to $(nm\ \mathrm{mod}\ 2^{Q})$, without changing anything else.

In order to make this computation valid, we need to have the inequality

\[2\xi = \frac{\left\lceil 2^{Q}x/D\right\rceil 2D}{2^{Q}} < \begin{cases} 2x + \frac{1}{\tilde{v}\tilde{q}} & \textrm{if $2x=\frac{\tilde{p}}{\tilde{q}}$ is rational with $\tilde{q}\leq n_{\max}$},\\ \frac{\tilde{p}^{*}}{\tilde{q}^{*}} & \textrm{otherwise} \end{cases}\]

instead of $\eqref{eq:core inequality}$, where things with $\tilde{\cdot}$ denote what we obtain from the corresponding things without $\tilde{\cdot}$ when we replace $x$ by $2x$. The above inequality in general is a little bit stricter than $\eqref{eq:core inequality}$, so in fact we have to use this instead of $\eqref{eq:core inequality}$ for computing $m$’s and $Q$’s if we want to ensure correct rounding.. Roughly speaking, replacing $\eqref{eq:core inequality}$ by the above increases $Q$ by $1$ overall, but fortunately this does not radically change any discussion we have done so far.

Now, suppose we have

\[\left(\left\lfloor 2nx \right\rfloor \ \mathrm{mod}\ 2D\right) = \left\lfloor \frac{(nm\ \mathrm{mod}\ 2^{Q})2D}{2^{Q}} \right\rfloor.\]

Assume that we already have extracted $\gamma_{1}$-digits from it so that we have

\[(nmd_{1}\ \mathrm{mod}\ 2^{Q})\]

as the iteration state where $d_{1} = 10^{\gamma_{1}}$, and our current subsegment is the next $\gamma_{2}$-digits, which means that it can be obtained as

\[\left\lfloor \frac{(nmd_{1}\ \mathrm{mod}\ 2^{Q})d_{2}}{2^{Q}} \right\rfloor\]

with $d_{2}=10^{\gamma_{2}}$. Then, to obtain one more bit, we simply compute

\[\left\lfloor \frac{(nmd_{1}\ \mathrm{mod}\ 2^{Q})2d_{2}}{2^{Q}} \right\rfloor\]

instead. The needed bit is then the last bit of the above, while the remaining bits constitute the actual subsegment.

Actual implementation

Here is an actual implementation. You can have some quick experiment with it here.

In the implementation, several different cache tables are prepared. The table used for computing the first segment is called the main cache table, which is consisting of the same entries as in Dragonbox. The one used for further digits is called the extended cache table. The implementation allows the users to choose which specific tables for each of those two.

There are two main cache tables, just like the Dragonbox implementation, the full table and the compressed table, each consisting of $9904$ bytes and $584$ bytes.

There are three extended cache tables:

the long extended cache table, consisting of $3680$ bytes, generated with the segment length $\eta=22$ assuming constant $Q=192$,
the compact extended cache table, consisting of $1212$ bytes, generated with the segment length $\eta=80$ and the collapse factor $\ell=64$, and
the super-compact extended cache table, consisting of $580$ bytes, generated with the segment length $\eta=252$ and the collapse factor $\ell=128$.

The implementation currently does not support the compact table.

And here is how it performs:


Figure 2 - Performance benchmark.

Red: Proposed algorithm with the full ($9904$ bytes) cache table and the long ($3680$ bytes) extended cache table.
Green: Proposed algorithm with the compressed ($584$ bytes) cache table and the super-compact ($580$ bytes) extended cache table.
Blue: Ryū-printf (reference implementation).
Purple: fmtlib (a variant of Grisu3 with Dragon4 fallback).

So, the proposed algorithm performs:

Better than Ryū-printf for the small digits case, which is mostly covered by the main cache table,
Comparable to Ryū-printf for the large digits case, when supplied with the long extended cache,
Worse than Ryū-printf but significantly faster than fmtlib (Dragon4) for the large digits case, when supplied with the super-compact extended cache.

WARNING

The implementation has NEVER been polished for real-world usage. Especially, I have NOT done any fair amount of formal testing of this implementation. Furthermore, the code is ultra-messy and is full of almost-same-but-slightly-different copies of same code blocks, which is a perfect recipe for bugs.

I’m pretty confident about the correctness of the algorithm itself though. I believe that any error is likely due to some careless mistakes rather than some fundamental issue in the algorithm.

Possible performance issues of the algorithm

Now, let’s talk about some problematic aspects of our algorithm which can result in bad performance.

Deep hierarchy

Taking large segment length $\eta$ (e.g. $\eta=22$) is a great way to reduce the size of the table, but it has a cost: it introduces deep levels of hierarchy. Let me elaborate what I mean by that.

As pointed out in the previous post, the common wisdom for printing integers is to work with two digits at a time, rather than one. So when we print integers we basically split the input integer into pairs of digits. This introduces one level of hierarchy: individual digits and pairs of digits. Recall also from the previous post that, when we are working with $64$-bit integers rather than $32$-bit integers, it is beneficial to split the $64$-bit input into several $32$-bit chunks, because usually $32$-bit math is cheaper. This introduces another level of hierarchy: individual digits, pairs of digits, and groups of pairs of digits fitting into $32$-bits.

Now, if $\eta$ is bigger than $9$, we cannot store the whole segment into $32$-bits, and if $\eta$ is bigger than $19$, even $64$-bits are insufficient. For $\eta=22$, we need at least three $32$-bit subsegments. For example, we split a segment into two $8$-digits subsegments and one $6$-digits subsegment. Note that getting a subsegment from a segment, following the method explained in the Decimal digit generation section, requires $Q$-bit math which is much more expensive than $64$-bit math. So we do not repeat the iteration $3$ times, rather, we group two subsegments into a single $64$-bit integer, so we need only $2$ iterations and then we separate this $64$-bit integer into two subsegments (which involves $64$-bit math). And this introduces yet another level of hierarchy.

And the bottom-line is, hierarchy is bad in terms of the complexity of the actual implementation. It may introduce more branching, more code, and thus in particular more instruction cache pollution. Roughly speaking, introducing one more level of hierarchy is like converting a flat for loop into a nested loop. In our case it is actually worse than that because it complicates the rounding logic a lot.

In comparison, (the standard implementation of) Ryū-printf only involves more or less three levels of hierarchy: individual digits, pairs of digits, and $9$-digits segments. We could achieve the same depth of hierarchy by just choosing $\eta=9$, but that will of course bloat the size of the static table by the factor of approximately $22/9$. I’m suspecting that maybe choosing $\eta=18$ will be the sweet spot and results in the best performance, but I did not run an actual experiment.

Too much compression

In Ryū-printf, loading the required cache bits from the static table is a no-brainer once you have the index. You just load the table entry located at that index, job done. But our compression scheme complicates this procedure a lot. For example, our compression scheme mandates us to fill in missing leading and trailing bits with zero. Also, due to the compact bitwise packing, a $64$-bit block from the cache table may contain bits from multiple different powers of $5$, so we have to manually remove all the wrong bits “flooding” from adjacent powers of $5$. Also, we have to perform bitwise-shifts spanning multiple $64$-bit blocks, and the direction of shift can depend on the input. Due to all of these (and more), the algorithm consumes quite considerable amount of time for load the required bits from the table even before it actually starts multiplying them. And note that this is not a one-time cost, as we need to do this each time we need a new segment, thus it impacts the throughput quite a lot. Probably this is the reason why our implementation doesn’t show a clear win over Ryū-printf in terms of the throughput.

Again, by not applying many of our compression strategies (e.g. removing leading zeros), this complication can be relaxed as much as we want. But I decided to compress the table as much as possible at any cost of performance loss, because the small precision case is already mostly covered by the main cache table (the Dragonbox table) alone, and the extended table is essentially used only for pretty large precisions. Anyway, it’s worth noting that this is another place where there is a space-time trade-off.

Overlapping digits

Using two separate tables, one for first several digits and another for further digits, has an issue of possible overlapping digits. What happens here is that, for the first several digits, we basically select any $k$ that maximizes the number of digits we get at once. That is, we select $k$ such that $\left\lfloor w\cdot 10^{k}\right\rfloor$ is maximized while still fitting inside $64$-bits. This allows us to squeeze out $18\sim 19$ digits (for normal numbers, and possibly less for subnormal numbers). However, when we need further digits, we cannot select the right $k$ that will give us the $\eta$-digits that immediately follow the ones we got from the first segment, because we only store one power of $5$ per $\eta$ exponents. In the worst case scenario, the $k$ we select will give us only one new digit. Furthermore, the number of these “overlapping digits” between the first segment (which we get using the main cache table) and the second segment (which we get using the extended cache table) can vary depending on the input. However, this is again a price worth paying, because this separation of two tables allows us to compress the extended cache table as much as we want, while not compromising the performance for the small digits case.

Having to load/store

If we choose $Q$ to vary, then it somewhat mandates the access to the stored blocks in the digit generation procedure to be through the stack rather than registers. This is because there is usually no concept of “arrays” of registers, and it is not possible to dynamically index the registers. In theory, it should be possible to load the memory into the register only once and use it indefinitely because the maximum size of the array is fixed (and often small, say $3 = 192/64$). This means that the dynamic indexing could be converted into constant indexing plus some branching. However, this is complex enough that there may be no benefit in doing so, and no actual compiler seems to do something like this. This is why I chose a fixed constant $Q$ for $\eta=22$ in the implementation.

Conclusion

It is possible to achieve a comparable performance to Ryū-printf while having much smaller amount of static data. To be fair, I should say that the benchmark I’ve done is not quite fair, because the reference implementation of Ryū-printf does not do a lot of integer-formatting tricks I did for my implementation. I think the Ryū-printf implementation could be made a lot faster by doing so.

However, I would dare say that the proposed algorithm is overall just strictly superior, because the core formula behind it is simpler than the corresponding formula in Ryū-printf. Both formulae are more or less equally powerful in terms of their ability to compute $\left(\left\lfloor nx\right\rfloor\ \mathrm{mod}\ 10^{\eta}\right)$ for the same segment size $\eta$, but the formula for the proposed algorithm is simpler and requires less number of multiplications. If we give up many of the compression ideas given and adopt a similar approach to Ryū-printf, but use the core formula from the proposed algorithm instead, then I don’t doubt that the resulting algorithm would perform way better than Ryū-printf. However, I haven’t bothered to conduct this experiment because my main motivation was to develop an algorithm that only requires small cache table.

There are still tons of places where things can get further improved, and it would be interesting to see how it could be used to implement an arbitrary-precision floating-point parser in the future. I hope we can get there soon!

Appendix: Fixed-point fraction trick revisited

The implementation extensively relies on the fixed-point fraction trick explained in the previous post. Due to excessive variety of different combinations of the parameters, I felt obliged to give some more shots on it to come up with a better analysis. Here is what I got.

Fixed-length case

Here I describe a better analysis of the fixed-length case treated in the last section of the previous post. Recall that for given $n$, we want to find $y$ such that

\[\frac{2^{D}n}{10^{k}} \leq y < \frac{2^{D}(n+1)}{10^{k}}.\]

First of all, why not generalize this a little bit:

\[\frac{np}{q} \leq y < \frac{(n+1)p}{q}.\]

So our original problem is when $p = 2^{D-k}$ and $q=5^{k}$. We attempt to let

\[y = \left\lfloor\frac{nm}{2^{L}}\right\rfloor + 1\]

and see if there is a possible choice of $m$ and $L$. Again, let’s generalize this a little bit and look at

\[y = \left\lfloor n\xi\right\rfloor + 1\]

instead. (Note that these generalizations are not just for the sake of generalizations; the point is really to simplify the analysis by throwing away all the details that play little to no role in the core arguments.) Then the inequality we need to consider is

\[\label{eq:jeaiii fixed-length} \frac{1}{n}\left\lceil\frac{np}{q}\right\rceil - \frac{1}{n} \leq \xi < \frac{1}{n}\left\lceil\frac{(n+1)p}{q}\right\rceil - \frac{1}{n}.\]

Hence, we need to maximize the left-hand side and minimize the right-hand side of the above inequality over the given range of $n$ to see if there is a feasible choice of $\xi$.

First, let us write

\[\left\lceil\frac{np}{q}\right\rceil = \frac{np}{q} + \frac{r}{q}\]

where $0\leq r < q$ is an integer. Then the left-hand side of $\eqref{eq:jeaiii fixed-length}$ becomes

\[\frac{p}{q} - \frac{q - r}{nq}.\]

Hence, we want to find $n$ that maximizes the above, or equivalently, minimizes

\[\frac{q-r}{n}.\]

This minimization problem is almost equivalent to what’s done in the first case of Theorem 4.2. Indeed, we can apply the same proof idea here. Let $v$ be the greatest integer such that $vp\equiv 1\ (\mathrm{mod}\ q)$ and $v\leq n_{\max}$. We claim that $n=v$ is the minimizer of the above. Suppose not, then there exists $n\leq n_{\max}$ such that

\[\frac{q-r}{n} < \frac{1}{v}\]

where $r$ is the smallest nonnegative integer such that $np\equiv -r\ (\mathrm{mod}\ q)$. Multiplying $nvp$ to both sides, we get

\[vp(q-r)< np.\]

However, since both sides are congruent to $-r$ modular $q$, there should exist a positive integer $e$ such that

\[np = vp(q-r) + eq.\]

Now, since $p,q$ are coprime and both $np$ and $vp(q-r)$ are multiples of $p$, it follows that $e$ is a multiple of $p$, so in particular $e\geq p$. Then we get

\[n = v(q-r) + (e/p)q \geq v + q.\]

This in particular implies $v+q\leq n_{\max}$, but that contradicts to the definition of $v$. Hence, $v$ must be the minimizer.

As a result, we can rewrite $\eqref{eq:jeaiii fixed-length}$ as

\[\frac{p}{q} - \frac{1}{vq} \leq \xi < \frac{1}{n}\left\lceil\frac{(n+1)p}{q}\right\rceil - \frac{1}{n}.\]

For the right-hand side, note that

\[\frac{(n+1)p}{q} = \frac{np}{q} + \frac{p}{q} = \left\lceil\frac{np}{q}\right\rceil + \frac{p - r}{q},\]

\[\left\lceil \frac{(n+1)p}{q}\right\rceil = \left\lceil\frac{np}{q}\right\rceil + \left\lceil \frac{p - r}{q} \right\rceil = \frac{np}{q} + \frac{r}{q} + \left\lceil \frac{p - r}{q} \right\rceil.\]

Hence, the right-hand side of $\eqref{eq:jeaiii fixed-length}$ can be written as

\[\frac{p}{q} + \frac{1}{n} \left(\left\lceil \frac{p - r}{q} \right\rceil - \frac{q-r}{q}\right).\]

So we have to minimize this. Well, I don’t know if there is a general and precise way of doing that, but we can find a good enough lower bound for typical numbers we care about. A thing to note here is the fact that $\left\lceil\frac{p-r}{q}\right\rceil$ can only have two different values: $\left\lceil\frac{p}{q}\right\rceil$ and $\left\lceil\frac{p}{q}\right\rceil - 1$. The first one is obtained when $r$ is strictly less than $(p\ \mathrm{mod}\ q)$, and the second one is obtained otherwise. Hence, if we ignore the factor $\frac{1}{n}$, then

\[\left\lceil\frac{p-r}{q}\right\rceil - \frac{q-r}{q}\]

is minimized when $r = (p\ \mathrm{mod}\ q)$. Indeed, this $r$ is clearly the one that minimizes the above among those $r$ with $\left\lceil \frac{p-r}{q}\right\rceil = \left\lceil\frac{p}{q}\right\rceil - 1$. The one that minimizes the above among those $r$ with $\left\lceil \frac{p-r}{q}\right\rceil = \left\lceil\frac{p}{q}\right\rceil$, on the other hand, is $r=0$, in which case we get

\[\left\lceil\frac{p-r}{q}\right\rceil - \frac{q-r}{q} = \left\lceil\frac{p}{q}\right\rceil - 1,\]

which is bigger than the value we get for $r = (p\ \mathrm{mod}\ q)$, which is equal to

\[\left\lfloor \frac{p}{q} \right\rfloor - 1 + \frac{r}{q} = \frac{p}{q} - 1.\]

(This is valid even for $q=1$.)

Now, considering the factor $\frac{1}{n}$, I have no idea if I can find the tight bound. However, at least we get a good lower bound

\[\frac{p}{q} + \frac{1}{n_{\max}}\left(\frac{p}{q} - 1\right)\]

for the optimal value.

Consequently, we get a sufficient condition

\[\frac{p}{q} - \frac{1}{vq} \leq \xi < \frac{p}{q} + \frac{1}{n_{\max}}\left(\frac{p}{q} - 1\right)\]

to have

\[n = \left\lfloor \frac{q(\left\lfloor n\xi\right\rfloor + 1)}{p} \right\rfloor.\]

Specializing to the case $\xi = \frac{m}{2^{L}}$ then gives

\[\label{eq:jeaiii fixed-length specialized} \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \leq m < \frac{2^{L}p}{q} + \frac{2^{L}}{n_{\max}}\left(\frac{p}{q} - 1\right).\]

Hence, it is enough to find $L$ such that the ceiling of the left-hand side is strictly smaller than the ceiling of the right-hand side, and in this case we take

\[m = \left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil\]

(or any other $m$ that is greater than or equal to the above and strictly smaller than the ceiling of the right-hand side of $\eqref{eq:jeaiii fixed-length specialized}$.)

Note that $v$, the greatest integer such that $vp\equiv 1\ (\mathrm{mod}\ q)$ and $v\leq n_{\max}$, is determined independently to $L$.

Some example applications

Here I collected some example applications of the above analysis that I used in the actual implementation of the algorithm.

$D=32$, $k=4$, $n\in[0,10^{6})$.

In this case, we have
- $p=2^{D-k}=2^{28}$, $q=5^{k}=5^{4}$.
- $\mathrm{ModInv}(p,q)=196$, so $v=\left\lfloor\frac{n_{\max}-196}{q}\right\rfloor q + 196 = 999571$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=0$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =429497$, which is of $19$-bits.
$D=32$, $k=5$, $n\in[0,10^{6})$.

In this case, we have
- $p=2^{D-k}=2^{27}$, $q=5^{k}=5^{5}$.
- $\mathrm{ModInv}(p,q)=1642$, so $v=\left\lfloor\frac{n_{\max}-1642}{q}\right\rfloor q + 1642 = 998517$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=4$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =687195$, which is of $20$-bits.
$D=32$, $k=5$, $n\in[0,10^{7})$.

In this case, we have
- $p=2^{D-k}=2^{27}$, $q=5^{k}=5^{5}$.
- $\mathrm{ModInv}(p,q)=1642$, so $v=\left\lfloor\frac{n_{\max}-1642}{q}\right\rfloor q + 1642 = 9998517$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=8$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =10995117$, which is of $24$-bits.
$D=32$, $k=6$, $n\in[0,10^{7})$.

In this case, we have
- $p=2^{D-k}=2^{26}$, $q=5^{k}=5^{6}$.
- $\mathrm{ModInv}(p,q)=12659$, so $v=\left\lfloor\frac{n_{\max}-12659}{q}\right\rfloor q + 12659 = 9997034$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=12$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =17592187$, which is of $25$-bits.
$D=32$, $k=6$, $n\in[0,10^{8})$.

In this case, we have
- $p=2^{D-k}=2^{26}$, $q=5^{k}=5^{6}$.
- $\mathrm{ModInv}(p,q)=12659$, so $v=\left\lfloor\frac{n_{\max}-12659}{q}\right\rfloor q + 12659 = 99997034$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=15$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =140737489$, which is of $28$-bits.
$D=32$, $k=7$, $n\in[0,10^{8})$.

In this case, we have
- $p=2^{D-k}=2^{25}$, $q=5^{k}=5^{7}$.
- $\mathrm{ModInv}(p,q)=56568$, so $v=\left\lfloor\frac{n_{\max}-56568}{q}\right\rfloor q + 56568 = 99978443$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=18$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =112589991$, which is of $27$-bits.
$D=32$, $k=7$, $n\in[0,10^{9})$.

In this case, we have
- $p=2^{D-k}=2^{25}$, $q=5^{k}=5^{7}$.
- $\mathrm{ModInv}(p,q)=56568$, so $v=\left\lfloor\frac{n_{\max}-56568}{q}\right\rfloor q + 56568 = 999978443$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=20$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =450359963$, which is of $29$-bits.
$D=32$, $k=8$, $n\in[0,10^{9})$.

In this case, we have
- $p=2^{D-k}=2^{24}$, $q=5^{k}=5^{8}$.
- $\mathrm{ModInv}(p,q)=35011$, so $v=\left\lfloor\frac{n_{\max}-35011}{q}\right\rfloor q + 35011 = 999644386$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=24$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =720575941$, which is of $30$-bits.
$D=64$, $k=8$, $n\in[0,10^{10})$.

In this case, we have
- $p=2^{D-k}=2^{56}$, $q=5^{k}=5^{8}$.
- $\mathrm{ModInv}(p,q)=233416$, so $v=\left\lfloor\frac{n_{\max}-233416}{q}\right\rfloor q + 233416 = 9999842791$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=0$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =184467440738$, which is of $38$-bits.
$D=64$, $k=9$, $n\in[0,10^{10})$.

In this case, we have
- $p=2^{D-k}=2^{55}$, $q=5^{k}=5^{9}$.
- $\mathrm{ModInv}(p,q)=857457$, so $v=\left\lfloor\frac{n_{\max}-857457}{q}\right\rfloor q + 857457 = 9998904332$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=0$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =18446744074$, which is of $35$-bits.
$D=64$, $k=13$, $n\in[0,10^{14})$.

In this case, we have
- $p=2^{D-k}=2^{51}$, $q=5^{k}=5^{13}$.
- $\mathrm{ModInv}(p,q)=888719312$, so $v=\left\lfloor\frac{n_{\max}-888719312}{q}\right\rfloor q + 888719312 = 99999668016187$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii fixed-length specialized}$ is $L=26$, and in this case $m=\left\lceil \frac{2^{L}p}{q} - \frac{2^{L}}{vq} \right \rceil =123794003928539$, which is of $47$-bits.

Variable-length case

Recall that all we want to do is to obtain $y$ satisfying

\[\frac{np}{q} \leq y < \frac{(n+1)p}{q}.\]

In this section, let us analyze the choice

\[y = \left\lfloor\frac{nm}{2^{L}}\right\rfloor,\]

or more generally

\[y = \left\lfloor n\xi\right\rfloor,\]

or equivalently,

\[\label{eq:jeaiii variable-length} \frac{1}{n}\left\lceil\frac{np}{q}\right\rceil \leq \xi < \frac{1}{n}\left\lceil\frac{(n+1)p}{q}\right\rceil.\]

This inequality almost always does not have any solution if we consider the full range $[1,n_{\max}]$ of $n$, but it often does have a solution if we have a big enough lower bound on $n$; that is, we consider the range $[n_{\min},n_{\max}]$ instead for some $n_{\min}

Again, our goal is to find an upper bound of the left-hand side and a lower bound of the right-hand side. At this point I still don’t know if there is an elegant strategy to figure out the optimal bounds, but here I explain a way to obtain good enough bounds, which I guess are better than the ones I derived in the previous post. Just like we did in the previous section, let us write

\[\left\lceil\frac{np}{q}\right\rceil = \frac{np}{q} + \frac{r}{q}\]

where $0\leq r < q$ is an integer, which then gives us

\[\left\lceil \frac{(n+1)p}{q}\right\rceil = \left\lceil\frac{np}{q}\right\rceil + \left\lceil \frac{p - r}{q} \right\rceil = \frac{np}{q} + \frac{r}{q} + \left\lceil \frac{p - r}{q} \right\rceil.\]

Then we can rewrite $\eqref{eq:jeaiii variable-length}$ as

\[\frac{p}{q} + \frac{r}{nq} \leq \xi < \frac{p}{q} + \frac{1}{n} \left(\frac{r}{q} + \left\lceil\frac{p-r}{q}\right\rceil\right).\]

Clearly, the left-hand side is bounded above by

\[\frac{q-1}{n_{\min}q},\]

and the right-hand side is bounded below by

\[\frac{1}{n_{\max}}\left(\frac{r}{q} + \frac{p-r}{q}\right) = \frac{p}{n_{\max}q}.\]

Thus, we obtain the following sufficient condition:

\[\frac{p}{q} + \frac{q-1}{n_{\min}q} \leq \xi < \frac{p}{q} + \frac{p}{n_{\max}q}\]

to have

\[n = \left\lfloor \frac{q\left\lfloor n\xi\right\rfloor}{p} \right\rfloor.\]

Specializing to the case $\xi = \frac{m}{2^{L}}$ then gives

\[\label{eq:jeaiii variable-length specialized} \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \leq m < \frac{2^{L}p}{q} + \frac{2^{L}p}{n_{\max}q}.\]

Some example applications

Here I collected some example applications of the above analysis that I used in the actual implementation of the algorithm.

$D=64$, $k=18$, $n\in[10^{18},10^{19})$.

In this case, we have
- $p=2^{D-k}=2^{46}$, $q=5^{k}=5^{18}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=56$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =1329227995784915873$, which is of $61$-bits.
$D=64$, $k=17$, $n\in[10^{18},10^{19})$.

In this case, we have
- $p=2^{D-k}=2^{47}$, $q=5^{k}=5^{17}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=55$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =6646139978924579365$, which is of $63$-bits. Or, when $L=56$, we get $m=13292279957849158730$ which is of $64$-bits.
$D=64$, $k=17$, $n\in[10^{17},10^{18})$.

In this case, we have
- $p=2^{D-k}=2^{47}$, $q=5^{k}=5^{17}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=52$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =830767497365572421$, which is of $60$-bits.
$D=64$, $k=16$, $n\in[10^{17},10^{18})$.

In this case, we have
- $p=2^{D-k}=2^{48}$, $q=5^{k}=5^{16}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=48$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =519229685853482763$, which is of $59$-bits. Or, when $L=52$, we get $m=8307674973655724206$ which is of $63$-bits.
$D=64$, $k=16$, $n\in[10^{16},10^{17})$.

In this case, we have
- $p=2^{D-k}=2^{48}$, $q=5^{k}=5^{16}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=44$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =32451855365842673$, which is of $55$-bits.
$D=64$, $k=15$, $n\in[10^{16},10^{17})$.

In this case, we have
- $p=2^{D-k}=2^{49}$, $q=5^{k}=5^{15}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=41$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =40564819207303341$, which is of $56$-bits. Or, when $L=44$, we get $m=324518553658426727$ which is of $59$-bits.
$D=32$, $k=8$, $n\in[10^{8},10^{9})$.

In this case, we have
- $p=2^{D-k}=2^{24}$, $q=5^{k}=5^{8}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=24$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =720575941$, which is of $30$-bits.
$D=32$, $k=7$, $n\in[10^{8},10^{9})$.

In this case, we have
- $p=2^{D-k}=2^{25}$, $q=5^{k}=5^{7}$.
The smallest $L$ that allows an integer solution to $\eqref{eq:jeaiii variable-length specialized}$ is $L=20$, and in this case $m=\left\lceil \frac{2^{L}p}{q} + \frac{2^{L}(q-1)}{n_{\min}q} \right \rceil =450359963$, which is of $29$-bits.

Faster integer formatting - James Anhalt (jeaiii)’s algorithm

2022-02-16T00:00:00-08:00

This post is about an ingenious algorithm for printing integers into decimal strings. It sounds like an extremely simple problem, but it is in fact quite more complicated than one might imagine. Let us more precisely define what we want to do: we take an integer of specific bit-width and a byte buffer, and convert the input integer into a string consisting of its decimal digits, and then write it into the given buffer. For simplicity, we will assume that the integer is unsigned and is of $32$-bits. So, we want to implement the following function written in C++:

char* itoa(std::uint32_t n, char* buffer) {
  // Convert n into decimal digit string and write it into buffer.
  // Returns the position right next to the last character written.
}

There are numerous algorithms for doing this, and I will dig into a clever algorithm invented by James Anhalt (jeaiii), which seems to be the fastest known algorithm at the point of writing this post.

Disclaimer

I actually have not looked (and will not look) carefully at his code (MACROS, oh my god 😱) and have no idea what precisely was the method of analysis he had in mind. All I write here is purely my own analysis inspired by reading these lines of comment he wrote:

// 1. form a 7.32 bit fixed point numner: t = u * 2^32 / 10^log10(u)
// 2. convert 2 digits at a time [00, 99] by lookup from the integer portion of the fixed point number (the upper 32 bits)
// 3. multiply the fractional part (the lower 32 bits) by 100 and repeat until 1 or 0 digits left
// 4. if 1 digit left mulptipy by 10 and convert it (add '0')
//
// N == log10(u)
// finding N by binary search for a 32bit number N = [0, 9]
// this is fast and selected such 1 & 2 digit numbers are faster (2 branches) than long numbers (4 branches) for the 10 cases
//          
//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9

So it is totally possible that in fact what’s written in this post has nothing to do with what he actually did; however, I strongly believe that what I ended up with is more or less equivalent to what his code is doing, modulo some small minor differences.

Naïve implementations

The very first problem that anyone who tries to implement such a function will face is that we want to write digits from left to right, but naturally we compute the digits from right to left. Hence, unless we know the number of decimal digits in the input upfront, we do not know the exact position in the buffer that we can write our digits into. There are several different strategies to cope with this issue. Probably the simplest (and quite effective) one is to just print the digit from right to left but into a temporary buffer, and after we get all digits of $n$ we copy the temporary buffer back to the destination buffer. At the point of obtaining the left-most decimal digit of $n$, we also get the length of the string, so we know what exact bytes to copy.

With this strategy, we can think of the following implementation:

char* itoa_naive(std::uint32_t n, char* buffer) {
  char temp[10];
  char* ptr = temp + sizeof(temp) - 1;
  while (n >= 10) {
    *ptr = char('0' + (n % 10));
    n /= 10;
    --ptr;
  }
  *ptr = char('0' + n);
  auto length = temp + sizeof(temp) - ptr;
  std::memcpy(buffer, ptr, length);
  return buffer + length;
}

(Demo: https://godbolt.org/z/7G7ecs7r4)

The size of the temporary buffer is set to $10$, because that’s the maximum possible decimal length for std::uint32_t.

The mismatch between the order of computation and the order of desired output is indeed a quite nasty problem, but let us forget about this issue for a while because there is something more interesting to say here.

There are several performance issues in this code, and one of them is the division by $10$. Of course, since the divisor is a known constant, our lovely compiler will automatically convert the division into multiply-and-shift (see this classic paper for example), so we do not need to worry about the dreaded idiv instruction which is extremely infamous of its performance. However, for simple enough algorithms like this, multiplication can still be a performance killer, so it is reasonable to expect that we will get a better performance by reducing the number of multiplications.

Regarding this, Andrei Alexandrescu popularized the idea of generating two digits per a division, so halving the required number of multiplications. That is, we first prepare a lookup table for converting $2$-digit integers into strings. Then in the loop we perform divisions by $100$, rather than $10$, to get $2$ digits per each division. The following code illustrates this:

static constexpr char radix_100_table[] = {
    '0', '0', '0', '1', '0', '2', '0', '3', '0', '4',
    '0', '5', '0', '6', '0', '7', '0', '8', '0', '9',
    '1', '0', '1', '1', '1', '2', '1', '3', '1', '4',
    '1', '5', '1', '6', '1', '7', '1', '8', '1', '9',
    '2', '0', '2', '1', '2', '2', '2', '3', '2', '4',
    '2', '5', '2', '6', '2', '7', '2', '8', '2', '9',
    '3', '0', '3', '1', '3', '2', '3', '3', '3', '4',
    '3', '5', '3', '6', '3', '7', '3', '8', '3', '9',
    '4', '0', '4', '1', '4', '2', '4', '3', '4', '4',
    '4', '5', '4', '6', '4', '7', '4', '8', '4', '9',
    '5', '0', '5', '1', '5', '2', '5', '3', '5', '4',
    '5', '5', '5', '6', '5', '7', '5', '8', '5', '9',
    '6', '0', '6', '1', '6', '2', '6', '3', '6', '4',
    '6', '5', '6', '6', '6', '7', '6', '8', '6', '9',
    '7', '0', '7', '1', '7', '2', '7', '3', '7', '4',
    '7', '5', '7', '6', '7', '7', '7', '8', '7', '9',
    '8', '0', '8', '1', '8', '2', '8', '3', '8', '4',
    '8', '5', '8', '6', '8', '7', '8', '8', '8', '9',
    '9', '0', '9', '1', '9', '2', '9', '3', '9', '4',
    '9', '5', '9', '6', '9', '7', '9', '8', '9', '9'
};

char* itoa_two_digits_per_div(std::uint32_t n, char* buffer) {
  char temp[8];
  char* ptr = temp + sizeof(temp);
  while (n >= 100) {
    ptr -= 2;
    std::memcpy(ptr, radix_100_table + (n % 100) * 2, 2);
    n /= 100;
  }
  if (n >= 10) {
    std::memcpy(buffer, radix_100_table + n * 2, 2);
    buffer += 2;
  }
  else {
    buffer[0] = char('0' + n);
    buffer += 1;
  }
  auto remaining_length = temp + sizeof(temp) - ptr;
  std::memcpy(buffer, ptr, remaining_length);
  return buffer + remaining_length;
}

(Demo: https://godbolt.org/z/vnMTf7s9r)

The core idea of James Anhalt’s algorithm

So, with the above trick of grouping $2$ digits, how many multiplications do we need for integers of, say, $10$ decimal digits? Note that we need to compute both the quotient and the remainder, and as far as I know at least $2$ multiplications should be performed to get both of them and there is no way to do it with just one multiplication. Hence, for each pair of $2$ digits, we need to perform $2$ multiplications, thus for integers with $10$ digits we need $8$ multiplications, since we need $4$ divisions to separate $5$ pairs of $2$ digits.

Quite surprisingly, in fact we can almost halve that number again into $5$, which (I believe) is the core benefit of James Anhalt’s algorithm. The crux of the idea can be summarized as follows: given $n$, we find an integer $y$ satisfying

\[n = \left\lfloor\frac{10^{k}y}{2^{D}}\right\rfloor\]

for some nonnegative integer constants $k$ and $D$.

This transformation is a real deal, because after we get such $y$, we can extract $2$ digits of $n$ per a multiplication. To see how, recall that in general

\[\left\lfloor\frac{a}{bc}\right\rfloor =\left\lfloor\frac{\lfloor a/b \rfloor}{c}\right\rfloor\]

holds for any positive integers $a,b,c$; that is, dividing $a$ by $bc$ is equivalent to dividing $a$ by $b$ first and then by $c$. Therefore, for any $l\leq k$, we have

\[\left\lfloor\frac{10^{k-l}y}{2^{D}}\right\rfloor = \left\lfloor\frac{n}{10^{l}}\right\rfloor.\]

So, for example, let $k=l=8$, then

\[\left\lfloor\frac{n}{10^{8}}\right\rfloor = \left\lfloor\frac{y}{2^{D}}\right\rfloor.\]

Assuming that $n$ is of $10$ digits, the left-hand side is precisely the first $2$ digits of $n$, while the right-hand side is just the right-shift of $y$ by $D$-bits.

On the other hand, the next $2$ digits of $n$ can be computed as

\[\left(\left\lfloor\frac{n}{10^{6}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right) = \left(\left\lfloor\frac{10^{2}y}{2^{D}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right).\]

Note that, if we write $y$ as $y=2^{D}q + r$ where $q$ is the quotient and $r$ is the remainder, then

\[10^{2}y = 2^{D}(10^{2}q) + 10^{2}r,\]

\[\left(\left\lfloor\frac{10^{2}y}{2^{D}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right) = \left(\left(10^{2}q + \left\lfloor\frac{10^{2}r}{2^{D}}\right\rfloor\right) \ \operatorname{mod}\ 10^{2}\right) = \left(\left\lfloor\frac{10^{2}r}{2^{D}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right).\]

Also, since $r<2^{D}$, $\left\lfloor\frac{10^{2}r}{2^{D}}\right\rfloor$ is strictly less than $10^{2}$, so we get

\[\left(\left\lfloor\frac{n}{10^{6}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right) = \left\lfloor\frac{10^{2}r}{2^{D}}\right\rfloor.\]

This means that, in order to compute the next $2$ digits of $n$, we first obtain the remainder of $y$ divided by $2^{D}$, and then multiply $10^{2}$ to it, and then obtain the quotient of it divided by $2^{D}$. In other words, we just need to first obtain the lowest $D$-bits, multiply $10^{2}$ to it, and then right-shift the result by $D$-bits. As you can see, we only need $1$ multiplication here.

This trend continues: to compute the next $2$ digits of $n$, we only need to obtain the lowest $D$-bits, multiply $10^{2}$, and then right-shift the result by $D$-bits, so in particular we only need $1$ multiplication for generating each pair of $2$ digits. Indeed, it can be inductively shown that if we write $y_{0}=y$ and $y_{i+1} = 10^{2}(y_{i}\ \operatorname{mod}\ 2^{D})$, then

\[\left(\left\lfloor\frac{n}{10^{k-2i}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right) = \left\lfloor\frac{y_{i}}{2^{D}}\right\rfloor\]

holds for each $i=0,1,2,3,4$.

How to compute $y$?

Alright, so we get that having $y$ satisfying

\[n = \left\lfloor\frac{10^{k}y}{2^{D}}\right\rfloor\]

is pretty useful for our purpose. The next question is how to find such $y$. Note that the above equality is equivalent to the inequality

\[n \leq \frac{10^{k}y}{2^{D}} < n+1,\]

\[\tag{$*$} \frac{2^{D}n}{10^{k}} \leq y < \frac{2^{D}(n+1)}{10^{k}}.\]

Assuming $2^{D}\geq 10^{k}$, $y=\left\lceil\frac{2^{D}n}{10^{k}}\right\rceil$ will obviously do the job, but computing $\left\lceil\frac{2^{D}n}{10^{k}}\right\rceil$ might be nontrivial. Hence, let us first try something easier:

\[y = n\left\lceil\frac{2^{D}}{10^{k}}\right\rceil.\]

Here, $\left\lceil\frac{2^{D}}{10^{k}}\right\rceil$ is just a constant and is not dependent on $n$, so computing $y$ is just a single integer multiplication.

With $k=8$ and $n<2^{32}$, we can show that this $y$ always satisfies $(*)$ if we take $D\geq 57$; we just need to find the smallest $D$ satisfying $(2^{32}-1)\left(\left\lceil\frac{2^{D}}{10^{k}}\right\rceil - \frac{2^{D}}{10^{k}}\right) < \frac{2^{D}}{10^{k}}$. Hence, choosing $D=57$, we get the magic number

\[\left\lceil\frac{2^{D}}{10^{k}}\right\rceil = 1441151881.\]

This leads us to the following code for always printing $10$ digits of given $n$ with possible leading zeros:

char* itoa_always_10_digits(std::uint32_t n, char* buffer) {
    constexpr auto mask = (std::uint64_t(1) << 57) - 1;
    auto y = n * std::uint64_t(1441151881);
    std::memcpy(buffer + 0, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 2, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 4, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 6, radix_100_table + int(y >> 57) * 2, 2);
    y &= mask;
    y *= 100;
    std::memcpy(buffer + 8, radix_100_table + int(y >> 57) * 2, 2);

    return buffer + 10;
}

(Demo: https://godbolt.org/z/9c4Mb76hc)

The constant $1441151881$ is only of $31$-bits and multiplications are performed in $64$-bits so there is no overflow.

Consideration of variable length

One can easily modify the above algorithm to omit printing leading decimal zeros and align the output to the left-most position of the buffer. However, the resulting algorithm is not very nice; although it only performs no more than $5$ multiplications, it always performs $5$ multiplications, even for short numbers like $n=15$.

What James Anhalt did with this is complete separation of the code paths for all possible lengths of $n$. I mean, something like this:

//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9
char* itoa_var_length(std::uint32_t n, char* buffer) {
  if (n < 100) {
    if (n < 10) {
      // 1 digit.
    }
    else {
      // 2 digits.
    }
  }
  else if (n < 100'0000) {
    if (n < 1'0000) {
      if (n < 1'000) {
        // 3 digits.
      }
      else {
        // 4 digits.
      }
    }
    else {
      if (n < 10'0000) {
        // 5 digits.
      }
      else {
        // 6 digits.
      }
    }
  }
  else if (n < 1'0000'0000) {
    if (n < 1000'0000) {
      // 7 digits.
    }
    else {
      // 8 digits.
    }
  }
  else if (n < 10'0000'0000) {
    // 9 digits.
  }
  else {
    // 10 digits.
  }
}

It sounds pretty crazy, but it does the job quite well.

Now, recall that our main job is to find $y$ satisfying

\[n = \left\lfloor\frac{10^{k}y}{2^{D}}\right\rfloor.\]

Note that the choice $k=8$ was to make sure that $\left\lfloor\frac{y}{2^{D}}\right\rfloor$ is the first $2$ digits, given that $n$ is of $10$ digits. Since $n$ is not always of $10$ digits, we may take different $k$’s for each branch. For example, when $n$ is of $3$ digits, we may want to choose $k=2$. Then since $n\leq 999$ in this case, the choice

\[y = n\left\lceil\frac{2^{D}}{10^{k}}\right\rceil\]

is valid for any $D\geq 12$, as $999\cdot \left(\left\lceil\frac{2^{12}}{10^{2}}\right\rceil - \frac{2^{12}}{10^{2}}\right) < \frac{2^{12}}{10^{2}}$ holds. In this case, it is in fact better to choose $D=32$ rather than $D=12$, because in platforms such as x86 obtaining the lower half of a $64$-bit integer is basically no-op.

As a result, the first digit of $n$ can be computed as

\[\left\lfloor\frac{n}{10^{2}}\right\rfloor = \left\lfloor\frac{y}{2^{32}}\right\rfloor = \left\lfloor\frac{\lceil 2^{32}/10^{2} \rceil n}{2^{32}}\right\rfloor = \left\lfloor\frac{42949673\cdot n}{2^{32}}\right\rfloor,\]

and the remaining $2$ digits can be computed as

\[\left(\left\lfloor\frac{n}{10^{0}}\right\rfloor \ \operatorname{mod}\ 10^{2}\right)= \left\lfloor\frac{10^{2}(y\ \operatorname{mod}\ 2^{32})}{2^{32}}\right\rfloor.\]

Similarly, we can choose $D=32$ (with the above $y$ with different $k$’s) for $n$’s up to $6$ digits (so $k=0,2,4$), but for larger $n$ our simplistic analysis does not allow us to do so. For $n$’s with $7$ or $8$ digits, we set $k=6$, and it can be shown that $D=47$ does the job. For $n$’s with $9$ or $10$ digits, we set $k=8$, and as we have already seen $D=57$ does the job. With these choices of parameters, we get the following code:

//      /\____________
//     /  \______     \______
//    /\   \     \     \     \
//   0  1  /\    /\    /\    /\
//        2  3  4  5  6  7  8  9
char* itoa_var_length(std::uint32_t n, char* buffer) {
  if (n < 100) {
    if (n < 10) {
      // 1 digit.
      buffer[0] = char('0' + n);
      return buffer + 1;
    }
    else {
      // 2 digits.
      std::memcpy(buffer, radix_100_table + n * 2, 2);
      return buffer + 2;
    }
  }
  else if (n < 100'0000) {
    if (n < 1'0000) {
      // 3 or 4 digits.
      // 42949673 = ceil(2^32 / 10^2)
      auto y = n * std::uint64_t(42949673);
      if (n < 1'000) {
        // 3 digits.
        buffer[0] = char('0' + int(y >> 32));
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 1, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 3;
      }
      else {
        // 4 digits.
        std::memcpy(buffer + 0, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 2, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 4;
      }
    }
    else {
      // 5 or 6 digits.
      // 429497 = ceil(2^32 / 10^4)
      auto y = n * std::uint64_t(429497);
      if (n < 10'0000) {
        // 5 digits.
        buffer[0] = char('0' + int(y >> 32));
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 1, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 3, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 5;
      }
      else {
        // 6 digits.
        std::memcpy(buffer + 0, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 2, radix_100_table + int(y >> 32) * 2, 2);
        y = std::uint32_t(y) * std::uint64_t(100);
        std::memcpy(buffer + 4, radix_100_table + int(y >> 32) * 2, 2);
        return buffer + 6;
      }
    }
  }
  else if (n < 1'0000'0000) {
    // 7 or 8 digits.
    // 140737489 = ceil(2^47 / 10^6)
    auto y = n * std::uint64_t(140737489);
    constexpr auto mask = (std::uint64_t(1) << 47) - 1;
    if (n < 1000'0000) {
      // 7 digits.
      buffer[0] = char('0' + int(y >> 47));
      y = (y & mask) * 100;
      std::memcpy(buffer + 1, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 3, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 5, radix_100_table + int(y >> 47) * 2, 2);
      return buffer + 7;
    }
    else {
      // 8 digits.
      std::memcpy(buffer + 0, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 2, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 4, radix_100_table + int(y >> 47) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 6, radix_100_table + int(y >> 47) * 2, 2);
      return buffer + 8;
    }
  }
  else {
    // 9 or 10 digits.
    // 1441151881 = ceil(2^57 / 10^8)
    constexpr auto mask = (std::uint64_t(1) << 57) - 1;
    auto y = n * std::uint64_t(1441151881);
    if (n < 10'0000'0000) {
      // 9 digits.
      buffer[0] = char('0' + int(y >> 57));
      y = (y & mask) * 100;
      std::memcpy(buffer + 1, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 3, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 5, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 7, radix_100_table + int(y >> 57) * 2, 2);
      return buffer + 9;

    }
    else {
      // 10 digits.
      std::memcpy(buffer + 0, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 2, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 4, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 6, radix_100_table + int(y >> 57) * 2, 2);
      y = (y & mask) * 100;
      std::memcpy(buffer + 8, radix_100_table + int(y >> 57) * 2, 2);
      return buffer + 10;
    }
  }
}

(Demo: https://godbolt.org/z/froGhEn3s)

Note: The paths for $(2k-1)$-digits case and $2k$-digits case share a lot of code, so one might try to merge the code for printing $(2k-2)$ remaining digits while leaving in separate branches only the code for printing the leading $1$ or $2$ digits. However, it seems that such a refactoring causes the code to perform worse, probably because the number of additions performed is increased. Nevertheless, that is also one viable option, especially regarding the code size.

Better choices for $y$

The above code is pretty good for $n$’s up to $6$ digits, but not so much for longer $n$’s, as we have to perform masking in addition to multiplication and shift for each $2$ digits. This is due to $D$ being not equal to $32$, so it will be beneficial if we can choose $D=32$ even for $n$’s with digits more than $6$. And it turns out that we can.

The reason we had to choose $D>32$ was due to our poor choice of $y$:

\[y = n\left\lceil\frac{2^{D}}{10^{k}}\right\rceil.\]

Recall that we do not need to choose $y$ like this; all we need to ensure is that $y$ satisfies the inequality

\[\tag{$*$} \frac{2^{D}n}{10^{k}} \leq y < \frac{2^{D}(n+1)}{10^{k}}.\]

Slightly generalizing what we have done, let us suppose that we want to obtain $y$ by computing

\[y = \left\lfloor\frac{nm}{2^{L}}\right\rfloor\]

for some positive integer constants $m$ and $L$. Then the inequality $(*)$ can be rewritten as

\[\tag{$**$} \frac{1}{n}\left\lceil\frac{2^{D}n}{10^{k}}\right\rceil \leq \frac{m}{2^{L}} < \frac{1}{n}\left\lceil\frac{2^{D}(n+1)}{10^{k}}\right\rceil.\]

At the time of writing this post, I am not quite sure if there is an elegant way to obtain the precise admissible range of $m$ and $L$ for the above inequality with any given range of $n$, but a reasonable guess is that

\[m = \left\lceil\frac{2^{D+L}}{10^{k}}\right\rceil + 1\]

will often do the job. Indeed, in this case we have

\[\frac{mn}{2^{L}} \geq \frac{2^{D}n}{10^{k}} + \frac{n}{2^{L}},\]

so the left-hand side of $(**)$ is always satisfied if

\[2^{L} \leq n\]

holds for all $n$ in the range. On the other hand, we have

\[\frac{mn}{2^{L}} < \frac{2^{D}n}{10^{k}} + \frac{n}{2^{L-1}},\]

so the right-hand side of $(**)$ is always satisfied if

\[\frac{n}{2^{L-1}}\leq \frac{2^{D}}{10^{k}},\]

or equivalently,

\[\frac{10^{k}n}{2^{D-1}} \leq 2^{L}\]

holds for all $n$ in the range.

Thus for example, when $n$ is of $7$ or $8$ digits (so $n\in[10^{6}, 10^{8}-1]$), $k=6$, and $D=32$, it is enough to have

\[\frac{10^{6}(10^{8} - 1)}{2^{31}} \leq 2^{L} \leq 10^{6},\]

which is equivalent to

\[16 \leq L \leq 19.\]

Hence, we take $L = 16$ and accordingly $m = 281474978$. Of course, since $m$ is even we can equivalently take $L = 15$ and $m = 140737489$ as well, so this method of analysis is clearly far from being tight.

(In fact, it can be exhaustively verified that the left-hand side of $(*)$ is maximized when $n=1000795$, while the right-hand side is minimized when $n=10^{8}-1$, which together yield the inequality

\[\frac{4298381796}{1000795} \leq \frac{m}{2^{L}} < \frac{429496729600}{99999999}.\]

The minimum $L$ allowing an integer solution for $m$ to the above inequality is $L=15$ and in this case $m = 140737489$ is the unique solution.)

When $n$ is of $9$ or $10$ digits, a similar analysis does not give the best result, especially for $10$ digits it fails to give any admissible choice of $L$ and $m$. Nevertheless, it can be exhaustively verified that, when $k=8$ and $D=32$, if we set $L = 25$ and

\[m = \left\lceil\frac{2^{D+L}}{10^{k}}\right\rceil + 1 = 1441151882,\]

then

\[n = \left\lfloor \frac{10^{k}\lfloor nm/2^{L} \rfloor}{2^{D}} \right\rfloor\]

holds for all $n\in [10^{8}, 10^{9}-1]$, and similarly, if we set $L = 25$ and

\[m = \left\lceil\frac{2^{D+L}}{10^{k}}\right\rceil = 1441151881,\]

then

\[n = \left\lfloor \frac{10^{k}\lfloor nm/2^{L} \rfloor}{2^{D}} \right\rfloor\]

holds for all $n\in [10^{9}, 2^{32}-1]$, so we can still take

\[y = \left\lfloor \frac{nm}{2^{L}} \right\rfloor\]

with the above $L$ and $m$.

(In fact, applying what we have done for $7$ or $8$ digits into the case of $9$ digits gives $L=26$ and $m=2882303763$. However, $1441151882$ is a better magic number than $2882303763$, because the former is of $31$-bits while the latter is of $32$-bits. This trivially-looking difference actually quite matters on platforms like x86, because when computing $y=\left\lfloor\frac{nm}{2^{L}}\right\rfloor$, we want to leverage the fast imul instruction, but imul sign-extends the input immediate constant when performing $64$-bit multiplication. Hence, if the magic number is of $32$-bits, the multiplication cannot be done in a single instruction, and the magic number must be first loaded into a register and then zero-extended.)

Therefore, we are indeed able to always choose $D=32$, which results in the following code:

char* itoa_better_y(std::uint32_t n, char* buffer) {
  std::uint64_t prod;

  auto get_next_two_digits = [&]() {
    prod = std::uint32_t(prod) * std::uint64_t(100);
    return int(prod >> 32);
  };
  auto print_1 = [&](int digit) {
    buffer[0] = char(digit + '0');
    buffer += 1;
  };
  auto print_2 = [&] (int two_digits) {
    std::memcpy(buffer, radix_100_table + two_digits * 2, 2);
    buffer += 2;
  };
  auto print = [&](std::uint64_t magic_number, int extra_shift, auto remaining_count) {
    prod = n * magic_number;
    prod >>= extra_shift;
    auto two_digits = int(prod >> 32);

    if (two_digits < 10) {
      print_1(two_digits);
      for (int i = 0; i < remaining_count; ++i) {
        print_2(get_next_two_digits());
      }
    }
    else {
      print_2(two_digits);
      for (int i = 0; i < remaining_count; ++i) {
        print_2(get_next_two_digits());
      }
    }
  };

  if (n < 100) {
    if (n < 10) {
      // 1 digit.
      print_1(n);
    }
    else {
      // 2 digit.
      print_2(n);
    }
  }
  else {
    if (n < 100'0000) {
      if (n < 1'0000) {
        // 3 or 4 digits.
        // 42949673 = ceil(2^32 / 10^2)
        print(42949673, 0, std::integral_constant<int, 1>{});
      }
      else {
        // 5 or 6 digits.
        // 429497 = ceil(2^32 / 10^4)
        print(429497, 0, std::integral_constant<int, 2>{});
      }
    }
    else {
      if (n < 1'0000'0000) {
        // 7 or 8 digits.
        // 281474978 = ceil(2^48 / 10^6) + 1
        print(281474978, 16, std::integral_constant<int, 3>{});
      }
      else {
        if (n < 10'0000'0000) {
          // 9 digits.
          // 1441151882 = ceil(2^57 / 10^8) + 1
          prod = n * std::uint64_t(1441151882);
          prod >>= 25;
          print_1(int(prod >> 32));
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
        }
        else {
          // 10 digits.
          // 1441151881 = ceil(2^57 / 10^8)
          prod = n * std::uint64_t(1441151881);
          prod >>= 25;
          print_2(int(prod >> 32));
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
          print_2(get_next_two_digits());
        }
      }
    }
  }
  return buffer;
}

(Demo: https://godbolt.org/z/7TaqYa9h1)

Note: Looking at a port of James Anhalt’s original algorithm, it seems that the above code is probably a little bit better than the original implementation (which can be confirmed in the benchmark below) because the original algorithm performs an addition after the first multiplication and shift, for digit length longer than some value. With our choice of magic numbers, that is not necessary.

Benchmark

Alright, now let’s compare the performance of these implementations.

Link: https://quick-bench.com/q/ieIpsBhWC751YUhyVS2OE83dTO4

itoa_var_length_naive is a straightforward variation of itoa_var_length doing the naive quotient/remainder computation instead of playing with $y$. Well, compared to itoa_var_length_naive, the performance benefit of itoa_better_y seems not very impressive to be honest. Nevertheless, I still think the idea behind the algorithm is pretty intriguing.

Back to fixed-length case

So far we only have looked at the case of $32$-bit unsigned integers. For $64$-bit integers, what people typically do is to first divide the input number into several segments of $32$-bit integers, and then print them using methods for $32$-bit integers. In this case, typically only the first segment is of variable length and remaining segments are of fixed length. As we can see in the benchmark above, when the length is known we can do a lot better. For simplicity of the following discussion, let us suppose that the input integer is at most of $9$ digits and we want to always print $9$ digits with possible leading zeros.

While what we have done in itoa_always_10_digits is not so bad, we can certainly do better by choosing $D=32$ which eliminates the need for performing masking at each generation of $2$ digits. Recall that all we need to do is to find an integer $y$ satisfying

\[\tag{$*$} \frac{2^{D}n}{10^{k}} \leq y < \frac{2^{D}(n+1)}{10^{k}}\]

for given $n$. Since we want to always print $9$ digits, we take $k=8$. What’s different from the previous case is that now $n$ can be any integer in the range $[1,10^{9}-1]$ (ignoring $n=0$ case, but that is not a big deal), in particular it can be very small. In this case, one can show by exhaustively checking all possible $n$’s that the inequality

\[\tag{$**$} \frac{1}{n}\left\lceil\frac{2^{D}n}{10^{k}}\right\rceil \leq \frac{m}{2^{L}} < \frac{1}{n}\left\lceil\frac{2^{D}(n+1)}{10^{k}}\right\rceil\]

does not have a solution, because the maximum value of the left-hand side is bigger than the minimum value of the right-hand side. Therefore, it is not possible to compute $y$ by just performing a multiplication followed by a shift.

Instead, we choose

\[y = \left\lfloor \frac{nm}{2^{L}} \right\rfloor + 1,\]

which means that we indeed omit masking at each generation of $2$ digits, but at the cost of additionally performing an increment for the initial step of computing $y$.

With this choice of $y$, $(*)$ becomes

\[\tag{$**'$} \frac{1}{n}\left\lceil\frac{2^{D}n}{10^{k}}\right\rceil - \frac{1}{n} \leq \frac{m}{2^{L}} < \frac{1}{n}\left\lceil\frac{2^{D}(n+1)}{10^{k}}\right\rceil - \frac{1}{n}\]

instead of $(**)$. Then we can perform a similar analysis to conclude that $L = 25$ with

\[m = \left\lceil \frac{2^{D+L}}{10^{k}} \right\rceil = 1441151881\]

do the job. In fact, an exhaustive check shows that we can even take $L = 24$ and $m = 720575941$.

EDIT: Recall that in the above, the range of $n$ is constrained into $[0,10^{9}-1]$. It turns out $L = 24$ with $m = 720575941$ works only up to $n = 1133989877$, and it starts to produce errors if $n \geq 1133989878$.

I derived a better, but still incomplete analysis here. After a while, I eventually obtained a complete algorithm for computing the optimal bounds, and wrote a program that does the analysis. The algorithm itself is not published anywhere at this moment.

Concluding remarks

I applied a minor variation of the algorithm explained here into my Dragonbox implementation to speed up digit generation, and the result was quite satisfactory. I think probably the complete branching for all possible lengths of the input is not a brilliant idea for generic applications, but the idea of coming up with $y$ that enables generating each pair of $2$ digits by only performing a multiply-and-shift is very clever and useful. I expect that this same trick might be applicable to other problems as well, fixed-precision floating-point formatting for example.

Continued fractions and their application into fast computation of \$\lfloor nx\rfloor\$

2021-01-03T00:00:00-08:00

When I was working on Dragonbox and Grisu-Exact (which are float-to-string conversion algorithms with some nice properties) I had to come up with a fast method for computing things like $\lfloor n\log_{10}2 \rfloor$ or $\lfloor n\log_{2}10 \rfloor$, or more generally $\lfloor nx\rfloor$ for some integer $n$ and a fixed positive real number $x$. Actually, the sign of $x$ isn’t extremely important, but let us just assume $x>0$ for simplicity.

At that time I just took a fairly straightforward approach, which is to compute the multiplication of $n$ with a truncated binary expansion of $x$. That is, select an appropriate nonnegative integer $k$ and then compute $\lfloor \frac{nm}{2^{k}}\rfloor$ where $m$ is $2^{k}$ multiplied by the binary expansion of $x$ up to $k$ th fractional digits, which is a nonnegative integer. Hence, the whole computation can be done with one integer multiplication followed by one arithmetic shift, which is indeed quite fast.

Here is an example. Consider $x=\log_{10}2$. According to Wolfram Alpha, it’s hexadecimal expansion is 0x0.4d104d427de7fbcc47c4acd605be48.... If we choose, for example, $k=20$, then $m=$0x4d104$=315652$. Hence, we try to compute $\lfloor nx \rfloor$ by computing

\[\left\lfloor \frac{315652 \cdot n}{2^{20}}\right\rfloor\]

instead, or in C code:

  (315652 * n) >> 20

With a manual verification, it can be seen that this computation is valid for all $\vert n\vert\leq 1650$. However, when $n=1651$, the above formula gives $496$ while the correct value is $497$, and when $n=-1651$, the above gives $-497$ while the correct value is $-498$. Obviously, one would expect that the range of valid inputs will increase if we use a larger value for $k$ (thus using more digits of $x$). That is certainly true in general, but it will at the same time make the computation more prone to overflow.

Note that in order to make this computation truly fast, we want to keep things inside a machine word size. For example, we may want the result of multiplication $nm$ to fit inside 32-bits. Actually, even on 64-bit platforms, 32-bit arithmetic is often faster, so we still want things to be in 32-bits. For example, on platforms like x86_64, (if I recall correctly) there is no instruction for integer multiplication of a register and a 64-bit immediate constant, so if the constant $m$ is of 64-bits, we need to first load it into a register. If $m$ fits inside 32-bits, we don’t need to do that because there is a multiplication instruction for 32-bit constants and this results in a faster code.

(Please don’t say “you need a benchmark for such a claim”. I mean, yeah, that’s true, but firstly, once upon a time I witnessed a noticeable performance difference between these two so even though I’ve thrown out that code into the trash bin long time ago I still believe in this claim to some extent, and secondly and more importantly, difference of 32-bit vs 64-bit is not the point of this post.)

Hence, if for example we choose $k=30$, then $m=323228496$, so we must have $\vert n\vert\leq \lfloor (2^{31}-1)/m \rfloor = 6$ in order to ensure $nm$ to fit in 32-bits.

(In fact, at least on x86_64, it seems that there is no difference in performance of multiplying a 64-bit register and a 32-bit constant to produce a 64-bit number, versus multiplying a 32-bit register and a 32-bit constant to produce a 32-bit number, so we might only want to ensure $m$ to be in 32-bits and $mn$ to be in 64-bits, but let us just assume that we want $nm$ to be also in 32-bits for simplicity. What matters here is that we want to keep things inside a fixed number of bits.)

As a result, these competing requirements will admit a sweet spot that gives the maximum range of $n$. In this specific case, the sweet spot turns out to be $k=21$ and $m=631305$, which allows $n$ up to $\vert n \vert\leq 2620$.

So far, everything is nice, we obtained a nice formula for computing $\lfloor nx \rfloor$ and we have found a range of $n$ that makes the formula valid. Except that we have zero theoretical estimate on how the size of the range of $n$, $k$, and $x$ are related. I just claimed that the range can be verified manually, which is indeed what I did when working on Dragonbox and Grisu-Exact.

To be precise, I did have some estimate at the time of developing Grisu-Exact, which I briefly explained here (it has little to do with this post so readers don’t need to look at it really), but this estimate is too rough and gives a poor bound on the range of valid $n$’s.

Quite recently, I realized that the concept of so called continued fractions is a very useful tool for this kind of stuffs. In a nutshell, the reason why this concept is so useful is that it gives us a way to enumerate all the best rational approximations of any given real number, and from there we can derive a very detailed analysis of the mentioned method of computing $\lfloor nx \rfloor$.

Continued fractions

A continued fraction is a way of writing a number $x$ in a form

\[x = a_{0} + \dfrac{b_{1}}{a_{1} + \dfrac{b_{2}}{a_{2} + \dfrac{b_{3}}{a_{3} + \cdots}}},\]

either as a finite or infinite sequence. We will only consider the case of simple continued fractions, which means that all of $b_{1},b_{2},\ \cdots\ $ are 1. From now on, by a continued fraction we will always refer to a simple continued fraction. Also, $a_{0}$ is assumed to be an integer while all other $a_{i}$’s are assumed to be positive integers. We denote the continued fraction with the coefficients $a_{0},a_{1},a_{2},\ \cdots\ $ as $[a_{0};a_{1},a_{2},\ \cdots\ ]$.

With these conventions, there is a fairly simple algorithm for obtaining a continued fraction expansion of $x$. First, let $a_{0}:=\lfloor x \rfloor$. Then $x-a_{0}$ is always in the interval $[0,1)$. If it is $0$ that means $x$ is an integer so stop there. Otherwise, compute the reciprocal $x_{1}:= \frac{1}{x-a_{0}}\in(1,\infty)$. Then let $a_{1}:=\lfloor x_{1} \rfloor$, then again $x_{1}-a_{1}$ lies in $[0,1)$. If it is zero, stop, and if not, compute the reciprocal $x_{2}:=\frac{1}{x_{1}-a_{1}}$ and continue.

Here is an example from Wikipedia; consider $x=\frac{415}{93}$. As $415=4\cdot 93 + 43$, we obtain $a_{0}=4$ and $x_{1}=\frac{93}{43}$. Then from $93=2\cdot43+7$ we get $a_{1}=2$ and $x_{2}=\frac{43}{7}$, and similarly $a_{2}=6$, $x_{3}=7$. Since $x_{3}$ is an integer, we get $a_{3}=7$ and the process terminates. As a result, we obtain the continued fraction expansion

\[\frac{415}{93} = 4 + \dfrac{1}{2+\dfrac{1}{6+\dfrac{1}{7}}}.\]

It can be easily shown that this procedure terminates if and only if $x$ is a rational number. In fact, when $x=\frac{p}{q}$ is a rational number (whenever we say $\frac{p}{q}$ is a rational number, we implicitly assumes that it is in its reduced form; that is, $q$ is a positive integer and $p$ is an integer coprime to $q$), then the coefficients $a_{i}$’s are nothing but those appearing in the Euclid algorithm of computing GCD of $p$ and $q$ (which is a priori assumed to be $1$ of course).

When $x$ is rational and $x=[a_{0};a_{1},\ \cdots\ ,a_{i}]$ is the continued fraction expansion of $x$ obtained from the above algorithm, then either $x$ is an integer (so $i=0$) or $a_{i}>1$. Then $[a_{0};a_{1},\ \cdots\ ,a_{i-1},a_{i}-1,1]$ is another continued fraction expansion of $x$. Then it can be shown that these two are the only continued fraction expansions for $x$. For example, we have the following alternative continued fraction expansion

\[\frac{415}{93} = 4 + \dfrac{1}{2+\dfrac{1}{6+\dfrac{1}{6+\dfrac{1}{1}}}}\]

of $\frac{415}{93}$, and these two expansions are the only continued fraction expansions of $\frac{415}{93}$.

When $x$ is an irrational number, we can run the same algorithm to get an infinite sequence, and the obtained continued fraction expansion of $x$ is the unique one. Here are some list of examples from Wikipedia:

$\sqrt{19} = [4;2,1,3,1,2,8,2,1,3,1,2,8,\ \cdots\ ]$
$e = [2;1,2,1,1,4,1,1,6,1,1,8,\ \cdots\ ]$
$\pi = [3;7,15,1,292,1,1,1,2,1,3,1,\ \cdots\ ]$

Also, according to Wolfram Alpha, the continued fraction expansion of $\log_{10}2$ is

\[\log_{10}2 = [0; 3, 3, 9, 2, 2, 4, 6, 2, 1, 1, 3, 1, 18,\ \cdots\ ].\]

Given a continued fraction $[a_{0};a_{1},\ \cdots\ ]$, the rational number $[a_{0};a_{1},\ \cdots\ ,a_{i}]$ obtained by truncating the sequence at the $i$ th position is called the $i$ th convergent of the continued fraction. As it is a rational number we can write it uniquely as

\[[a_{0};a_{1},\ \cdots\, a_{i}] = \frac{p_{i}}{q_{i}}\]

for a positive integer $q_{i}$ and an integer $p_{i}$ coprime to $q_{i}$. Then one can derive a recurrence relation

\[\begin{cases} p_{i} = p_{i-2} + a_{i}p_{i-1}, &\\ q_{i} = q_{i-2} + a_{i}q_{i-1} \end{cases}\]

with the initial conditions $(p_{-2},q_{-2}) = (0,1)$ and $(p_{-1},q_{-1})=(1,0)$.

For example, for $\log_{10}2 = [0; 3, 3, 9, 2, 2, 4, 6, 2, 1, 1, 3, 1, 18,\ \cdots\ ]$, using the recurrence relation we can compute several initial convergents:

\[\frac{p_{0}}{q_{0}}=\frac{0}{1},\quad \frac{p_{1}}{q_{1}}=\frac{1+3\cdot 0}{0+3\cdot 1}=\frac{1}{3},\quad \frac{p_{2}}{q_{2}}=\frac{0+3\cdot 1}{1+3\cdot 3}=\frac{3}{10}, \\ \frac{p_{3}}{q_{3}}=\frac{1+9\cdot 3}{3+9\cdot 10}=\frac{28}{93},\quad \frac{p_{4}}{q_{4}}=\frac{3+2\cdot 28}{10+2\cdot 93}=\frac{59}{196}, \\ \frac{p_{5}}{q_{5}}=\frac{28+2\cdot 59}{93+2\cdot 196}=\frac{146}{485},\quad \frac{p_{6}}{q_{6}}=\frac{59+4\cdot 146}{196+4\cdot 485}=\frac{643}{2136}, \\ \frac{p_{7}}{q_{7}}=\frac{146+6\cdot 643}{485+6\cdot 2136}=\frac{4004}{13301}, \frac{p_{8}}{q_{8}}=\frac{643+2\cdot 4004}{2136+2\cdot 13301}=\frac{8651}{28738},\]

and so on.

Note that the sequence of convergents above is converging to $\log_{10}2$ rapidly. Indeed, the error $\left\vert\frac{p_{2}}{q_{2}} - x\right\vert$ of the second convergent is about $1.03\times 10^{-3}$, and $\left\vert\frac{p_{4}}{q_{4}} - x\right\vert$ is about $9.59\times 10^{-6}$. For the sixth convergent, the error $\left\vert\frac{p_{6}}{q_{6}} - x\right\vert$ is about $3.31\times 10^{-8}$. This is way better than other types of rational approximations, let’s say by truncated decimal expansions of $\log_{10}2$, because in that case the denominator must grow approximately as large as $10^{8}$ in order to achieve the error of order $10^{-8}$, but that order of error was achievable by $\frac{p_{6}}{q_{6}}$ whose denominator is only $2136$.

Best rational approximations

A cool thing about continued fractions is that, in fact convergents are ones of the best possible rational approximations in the following sense. Given a real number $x$, a rational number $\frac{p}{q}$ is called a best rational approximation of $x$, if for every rational number $\frac{a}{b}$ with $b\leq q$, we always have

\[\left\vert x - \frac{p}{q}\right\vert \leq \left\vert x - \frac{a}{b}\right\vert.\]

So, this means that there is no better approximation of $x$ by rational numbers of denominator no more than $q$.

In fact, whenever $\frac{p}{q}$ is a best rational approximation of $x$ with $q>1$, the equality in the inequality above is only achieved when $\frac{a}{b}=\frac{p}{q}$. Note that $q=1$ corresponds to the uninteresting case of approximating $x$ with integers, and obviously in this case there is the unique best approximation of $x$ for all real numbers $x$ except only when $x=n+\frac{1}{2}$ for some integer $n$, and for those exceptional cases there are precisely two best approximations, namely $n$ and $n+1$.

As hinted, given a continued fraction expansion of $x$, any convergent $\frac{p_{i}}{q_{i}}$ is a best rational approximation of $x$, except possibly for $i=0$ (which is also a best rational approximation if and only if $a_{1}>1$). It must be noted that, however, not every best rational approximation of $x$ is obtained as a convergent of its continued fraction expansion.

There is a nice description of all best rational approximations of $x$ in terms of convergents, but in this post that is kind of irrelevant. Rather, what matters for us is a method of enumerating all best rational approximations from below and from above. (In fact, the aforementioned description of all best rational approximations can be derived from there.)

Formally, we say a rational number $\frac{p}{q}$ is a best rational approximation from below of $x$ if $\frac{p}{q}\leq x$ and for any rational number $\frac{a}{b}$ with $\frac{a}{b}\leq x$ and $b\leq q$, we have

\[\frac{a}{b}\leq \frac{p}{q} \leq x.\]

Similarly, we say a rational number $\frac{p}{q}$ is a best rational approximation from above of $x$ if $\frac{p}{q}\geq x$ and for any rational number $\frac{a}{b}$ with $\frac{a}{b}\geq x$ and $b\leq q$, we have

\[\frac{a}{b}\geq \frac{p}{q} \geq x.\]

To describe how those best rational approximations from below/above look like, let $x=[a_{0};a_{1},\ \cdots\ ]$ be a continued fraction expansion of a real number $x$ and $\left(\frac{p_{i}}{q_{i}}\right)_{i}$ be the corresponding sequence of convergents. When $x$ is a non-integer rational number so that the expansion terminates after a nonzero number of steps, then we always assume that the expansion is of the second form, that is, the last coefficient is assumed to be $1$.

It can be shown that the sequence $\left(\frac{p_{2i}}{q_{2i}}\right)_{i}$ of even convergents strictly increases to $x$ from below, while the sequence $\left(\frac{p_{2i+1}}{q_{2i+1}}\right)_{i}$ of odd convergents strictly decreases to $x$ from above. In other words, the convergence of convergents happens in a “zig-zag” manner, alternating between below and above of $x$.

As the approximation errors of the even convergents decrease to zero, any sufficiently good rational approximation $\frac{p}{q}\leq x$ from below must lie in between some $\frac{p_{2i}}{q_{2i}}$ and $\frac{p_{2i+2}}{q_{2i+2}}$. Similarly, any sufficiently good rational approximation $\frac{p}{q}\geq x$ from above must lie in between some $\frac{p_{2i+1}}{q_{2i+1}}$ and $\frac{p_{2i-1}}{q_{2i-1}}$.

Based on these facts, one can show the following: a rational number is a best rational approximation from below of $x$ if and only if it can be written as

\[\frac{p_{2i} + sp_{2i+1}}{q_{2i} + sq_{2i+1}}\]

for some integer $0\leq s\leq a_{2i+2}$, and it is a best approximation from above of $x$ if and only if it can be written as

\[\frac{p_{2i-1} + sp_{2i}}{q_{2i-1} + sq_{2i}}\]

for some integer $0\leq s\leq a_{2i+1}$.

In general, a rational number of the form

\[\frac{p_{i-1} + sp_{i}}{q_{i-1} + sq_{i}}\]

for an integer $0\leq s\leq a_{i+1}$ is called a semiconvergent. So in other words, semiconvergents are precisely the best rational approximations from below/above of $x$.

It can be shown that the sequence

\[\left(\frac{p_{2i} + sp_{2i+1}}{q_{2i} + sq_{2i+1}}\right)_{s=0}^{a_{2i+2}}\]

strictly monotonically increases from $\frac{p_{2i}}{q_{2i}}$ to $\frac{p_{2i+2}}{q_{2i+2}}$, and similarly the sequence

\[\left(\frac{p_{2i-1} + sp_{2i}}{q_{2i-1} + sq_{2i}}\right)_{s=0}^{a_{2i+1}}\]

strictly monotonically decreases from $\frac{p_{2i-1}}{q_{2i-1}}$ to $\frac{p_{2i+1}}{q_{2i+1}}$. Therefore, as $s$ increases, we get successively better approximations of $x$.

For the case of $x=\log_{10}2$, here are the lists of all best rational approximations from below:

\[\mathbf{0} < \frac{1}{4}< \frac{2}{7} < \mathbf{\frac{3}{10}} < \frac{31}{103} < \mathbf{\frac{59}{196}} < \frac{205}{681} < \frac{351}{1166} < \frac{497}{1651} < \mathbf{\frac{643}{2136}} <\ \cdots\ <\log_{10}2,\]

and from above:

\[1 > \frac{1}{2} > \mathbf{\frac{1}{3}} > \frac{4}{13} > \frac{7}{23} > \frac{10}{33} > \frac{13}{43} > \frac{16}{53} > \frac{19}{63} > \frac{22}{73} > \frac{25}{83} > \mathbf{\frac{28}{93}} >\ \cdots\ >\log_{10}2\]

with convergents highlighted in bold.

Clearly, if $\frac{p}{q}$ is a best rational approximation of $x$ from below, then we must have $p=\lfloor qx \rfloor$. Indeed, if $p$ is strictly less than $\lfloor qx \rfloor$, then $\frac{p+1}{q}$ must be a strictly better approximation of $x$ which is still below $x$, and if $p$ is strictly greater than $\lfloor qx \rfloor$, then $\frac{p}{q}$ must be strictly greater than $x$.

Similarly, if $\frac{p}{q}$ is a best rational approximation of $x$ from above, then we must have $p=\lceil qx \rceil$.

Application into computation of $\lfloor nx \rfloor$

Alright, now let’s get back to where we started: when do we have the equality

\[\tag{$*$} \left\lfloor \frac{nm}{2^{k}} \right\rfloor = \lfloor nx \rfloor\]

for all $\vert n\vert \leq n_{\max}$? First, note that this equality can be equivalently written as the inequality

\[\lfloor nx \rfloor \leq \frac{nm}{2^{k}} < \lfloor nx \rfloor + 1.\]

The case $n>0$

We will first consider the case $n>0$. In this case, the inequality above can be rewritten as

\[\frac{\lfloor nx \rfloor}{n} \leq \frac{m}{2^{k}} < \frac{\lfloor nx \rfloor + 1}{n}.\]

Obviously, thus $(*)$ holds for all $n=1,\ \cdots\ ,n_{\max}$ if and only if

\[\tag{$**$} \max_{n=1,\ \cdots\ ,n_{\max}}\frac{\lfloor nx \rfloor}{n} \leq \frac{m}{2^{k}} <\min_{n=1,\ \cdots\ ,n_{\max}}\frac{\lfloor nx \rfloor + 1}{n}.\]

Now, note that $\frac{\lfloor nx \rfloor}{n}$ and $\frac{\lfloor nx \rfloor + 1}{n}$ are rational approximations of $x$, where the former is smaller than or equal to $x$ and the latter is strictly greater than $x$.

Therefore, the left-hand side of $(**)$ is nothing but $\frac{p_{*}}{q_{*}}$, where $\frac{p_{*}}{q_{*}}$ is the best rational approximation from below of $x$ with largest $q_{*}\leq n_{\max}$.

Similarly, the right-hand side of $(**)$ is nothing but $\frac{p^{*}}{q^{*}}$, where $\frac{p^{*}}{q^{*}}$ is the best rational approximation from above of $x$ with the largest $q^{*}\leq n_{\max}$. Except when $x=\frac{p^{*}}{q^{*}}$ (which means that $x$ is rational and its denominator is at most $n_{\max}$), in which case the situation is a bit dirtier and is analyzed as follows.

Note that the case $x=\frac{p^{*}}{q^{*}}$ is the case considered in the classical paper by Granlund-Montgomery (Theorem 4.2), but here we will do a sharper analysis that gives us a condition that is not only sufficient but also necessary to have $(*)$ for all $n=1,\ \cdots\ ,n_{\max}$. Since the case we are considering here is when $x$ is rational and its denominator is at most $n_{\max}$, which means $\frac{p^{*}}{q^{*}}=\frac{p_{*}}{q_{*}}=x$, so let us drop those stars from our notation and just write $x=\frac{p}{q}$ for brevity. Then we want to find the minimizer of

\[\frac{\left\lfloor np/q \right\rfloor + 1}{n}\]

for $n=1,\ \cdots\ ,n_{\max}$. Let $r$ be the remainder of $np$ divided by $q$, then the above can be written as

\[\frac{(np-r)/q + 1}{n} = \frac{p}{q} + \frac{q - r}{qn},\]

so the task is to find the minimizer of $\frac{q-r}{n}$. One can expect that the minimum will be achieved when $r=q-1$, so let $u$ be the largest integer such that $u\leq n_{\max}$ and $up\equiv -1\ (\operatorname{mod}\,q)$. We claim that $u$ is an optimizer of $\frac{q-r}{n}$. Indeed, by definition $u$, we must have $n_{\max} \[\frac{q-r}{n} = \frac{1}{u}\cdot \frac{(q-r)u}{n},\]

so it suffices to show that $n$ must be at most $(q-r)u$, so suppose $n>(q-r)u$ on the contrary. Since $up\equiv -1\ (\operatorname{mod}\,q)$, we have $(q-r)up\equiv r\equiv np\ (\operatorname{mod}\,q)$. As we have $np>(q-r)up$, we must have

\[np = (q-r)up + eq\]

for some $e\geq 1$. However, since $np$ and $(q-r)up$ are both multiples of $p$, $eq$ must be a multiple of $p$ as well, but since $p$ and $q$ are coprime, it follows that $e$ is a multiple of $p$. Therefore,

\[n = (q-r)u + \frac{eq}{p} \geq (q-r)u + q \geq u + q,\]

but this contradicts to $n_{\max} < u + q$, thus proving the claim.

As a result, we obtain

\[\min_{n=1,\ \cdots\ ,n_{\max}}\frac{\lfloor nx \rfloor + 1}{n} = x + \frac{1}{qu}\]

where $u$ is the largest integer such that $u\leq n_{\max}$ and $up\equiv -1\ (\operatorname{mod}\,q)$.

In summary, we get the following conclusions:

If $x$ is irrational or rational with denominator strictly greater than $n_{\max}$, then $(*)$ holds for all $n=1,\ \cdots\ ,n_{\max}$ if and only if
\[\frac{2^{k}p_{*}}{q_{*}} \leq m < \frac{2^{k}p^{*}}{q^{*}}.\]
If $x=\frac{p}{q}$ is rational with $q\leq n_{\max}$, then $(*)$ holds for all $n=1,\ \cdots\ ,n_{\max}$ if and only if
\[\frac{2^{k}p}{q} \leq m < \frac{2^{k}p}{q} + \frac{2^{k}}{qu}\]
where $u$ is the largest integer such that $u\leq n_{\max}$ and $up\equiv -1\ (\operatorname{mod}\,q)$.

The case $n<0$

Next, let us consider the case $n<0$. In this case, $(*)$ is equivalent to

\[\frac{\lceil \vert n\vert x \rceil - 1}{\vert n\vert} = \frac{\lfloor nx \rfloor + 1}{n} < \frac{m}{2^{k}} \leq \frac{\lfloor nx \rfloor}{n} = \frac{\lceil \vert n\vert x \rceil}{\vert n\vert}.\]

Similarly to the case $n>0$, the minimum of the right-hand side is precisely $\frac{p^{*}}{q^{*}}$, where again $\frac{p^{*}}{q^{*}}$ is the best rational approximation from above of $x$ with the largest $q^{*}\leq n_{\max}$.

The maximum of the left-hand side is, as one can expect, a bit more involved. It is precisely $\frac{p_{*}}{q_{*}}$ (with the same definition above) except when $x=\frac{p_{*}}{q_{*}}$, in which case we do some more analysis.

As in the case $n>0$, assume $x=\frac{p}{q}$ with $q\leq n_{\max}$, and let us find the maximizer of

\[\frac{\left\lceil np/q \right\rceil - 1}{n}\]

for $n=1,\ \cdots\ ,n_{\max}$. This time, let $r$ be the unique integer such that $0\leq r\leq q-1$ and

\[np = \left\lceil \frac{np}{q} \right\rceil q - r,\]

then we can rewrite our objective function as

\[\frac{(np+r)/q - 1}{n} = \frac{p}{q} - \frac{q-r}{qn}.\]

Therefore, a maximizer of the above is a minimizer of $\frac{q-r}{qn}$, which is, as we obtained in the previous case, the largest integer $u$ such that $u\leq n_{\max}$ and $up\equiv -1\ (\operatorname{mod}\,q)$.

Hence, we get the following conclusions:

If $x$ is irrational or rational with denominator strictly greater than $n_{\max}$, then $(*)$ holds for all $n=-1,\ \cdots\ ,-n_{\max}$ if and only if
\[\frac{2^{k}p_{*}}{q_{*}} < m \leq \frac{2^{k}p^{*}}{q^{*}}.\]
If $x=\frac{p}{q}$ is rational with $q\leq n_{\max}$, then $(*)$ holds for all $n=-1,\ \cdots\ ,-n_{\max}$ if and only if
\[\frac{2^{k}p}{q} - \frac{2^{k}}{qu} < m \leq \frac{2^{k}p}{q}\]
where $u$ is the largest integer such that $u\leq n_{\max}$ and $up\equiv -1\ (\operatorname{mod}\,q)$.

Conclusion and an example

Putting two cases together, we get the following conclusion: if $x$ is irrational or rational with denominator strictly greater than $n_{\max}$, then $(*)$ holds for all $\vert n\vert \leq n_{\max}$ if and only if

\[\tag{$\square$} \frac{2^{k}p_{*}}{q_{*}} < m < \frac{2^{k}p^{*}}{q^{*}}.\]

For the case $x=\frac{p}{q}$ with $q\leq n_{\max}$, we cannot really conclude anything useful if we consider positive $n$’s and negative $n$’s altogether, because the inequality we get is

\[\frac{2^{k}p}{q} \leq m \leq \frac{2^{k}p}{q},\]

which admits a solution if and only if $\frac{2^{k}p}{q}$ is an integer, which can happen only when $q$ is a power of $2$. But for that case the problem is already trivial.

The number $x$ in main motivating examples are irrational numbers, so $(\square)$ is indeed a useful conclusion. Now, let us apply it to the case $x=\log_{10}2$ to see what we can get out of it. First, we need to see how $\frac{p_{*}}{q_{*}}$ and $\frac{p^{*}}{q^{*}}$ are determined for given $n_{\max}$. This can be done in a systematic way, as shown below:

Find the last convergent $\frac{p_{i}}{q_{i}}$ whose denominator $q_{i}$ is at most $n_{\max}$.
If $i$ is odd, we conclude
\[\frac{p^{*}}{q^{*}} = \frac{p_{i}}{q_{i}} \quad\textrm{and}\quad \frac{p_{*}}{q_{*}} = \frac{p_{i-1} + sp_{i}}{q_{i-1} + sq_{i}}\]
where $s$ is the largest integer with $q_{i-1}+sq_{i}\leq n_{\max}$.
If $i$ is even, we conclude
\[\frac{p_{*}}{q_{*}} = \frac{p_{i}}{q_{i}} \quad\textrm{and}\quad \frac{p^{*}}{q^{*}} = \frac{p_{i-1} + sp_{i}}{q_{i-1} + sq_{i}}\]
where $s$ is the largest integer with $q_{i-1}+sq_{i}\leq n_{\max}$.

So, for example, if $n_{\max}=1000$, then $i=5$ with $\frac{p_{i}}{q_{i}}=\frac{146}{485}$ and $\frac{p_{i-1}}{q_{i-1}}=\frac{59}{196}$ so the maximum $s$ with $q_{i-1}+sq_{i}\leq n_{\max}$ is $s=1$, concluding

\[\frac{p^{*}}{q^{*}} = \frac{146}{485} \quad\textrm{and}\quad \frac{p_{*}}{q_{*}} = \frac{59 + 1\cdot 146}{196 + 1\cdot 485} = \frac{205}{681}.\]

Then, the minimum $k$ that allows the inequality

\[\tag{$\square$} \frac{2^{k}p_{*}}{q_{*}} < m < \frac{2^{k}p^{*}}{q^{*}}\]

to have a solution is $k=18$, and in that case $m=78913$ is the unique solution. (One can verify that $78913$ is precisely $\lfloor 2^{18}\log_{10}2 \rfloor$.)

In fact, note that $\frac{p^{*}}{q^{*}}$ and $\frac{p_{*}}{q_{*}}$ stay the same for any $681\leq n_{\max}\leq 1165$; here, $1165$ is one less of $196 + 2\cdot 485=1166$, which is the moment when $\frac{p_{*}}{q_{*}}$ changes into the next semiconvergent

\[\frac{59+2\cdot 146}{196+2\cdot 485}=\frac{351}{1166}.\]

Therefore, with the choice $k=18$ and $m=78913$, we should have

\[\lfloor nx \rfloor = \left\lfloor \frac{78913\cdot n}{2^{18}} \right\rfloor\]

for all $\vert n\vert \leq 1165$. In fact, even with $\frac{p_{*}}{q_{*}}=\frac{351}{1166}$ we still have $(\square)$, so the above should hold until the next moment when $\frac{p_{*}}{q_{*}}$ changes, which is when $n_{\max} = 196 + 3\cdot 485=1651$. If $n_{\max}=1651$, then

\[\frac{p_{*}}{q_{*}} = \frac{59 + 3\cdot 146}{196 + 3\cdot 485} = \frac{497}{1651},\]

and in this case now the left-hand side of $(\square)$ is violated. Thus,

\[\lfloor nx \rfloor = \left\lfloor \frac{78913\cdot n}{2^{18}} \right\rfloor\]

holds precisely up to $\vert n\vert\leq 1650$ and the first counterexample is $n=\pm 1651$.

In a similar manner, one can see that $\frac{p_{*}}{q_{*}}=\frac{497}{1651}$ holds up to $n_{\max} = 196 + 4\cdot 485 - 1 = 2135$, and the minimum $k$ allowing $(\square)$ to have a solution is $k=20$, with $m=315653$ as the unique solution. (One can also verify that $315653$ is precisely $\lceil 2^{20}\log_{10}2 \rceil$, so this time it is not the truncated binary expansion of $\log_{10}2$, rather that plus $1$.) Hence,

\[\lfloor nx \rfloor = \left\lfloor \frac{315653\cdot n}{2^{20}} \right\rfloor\]

must hold at least up to $\vert n\vert\leq 2135$. When $n_{\max}=2136$, $\frac{p_{*}}{q_{*}}$ changes into $\frac{643}{2136}$, but $(\square)$ is still true with the same choice of $k$ and $m$, so the above formula must be valid at least for $\vert n\vert\leq 2620$; here $2621$ is the moment when $\frac{p^{*}}{q^{*}}$ changes from $\frac{146}{485}$ into $\frac{789}{2621}$. If $n_{\max}=2621$ so $\frac{p^{*}}{q^{*}}$ changes into $\frac{789}{2621}$, then the right-hand side of the inequality $(\square)$ is now violated, so $\vert n\vert\leq 2620$ is indeed the optimal range and $n=\pm 2621$ is the first counterexample.

In general, the transition can only occur at the denominators of semiconvergents. With this method, we can figure out what is the minimum $k$ that allows the computation up to the chosen transition point and what $m$ should be used for that choice of minimum $k$. We want $m$ to be as small as possible, so $m=\left\lceil\frac{2^{k}p_{*}}{q_{*}} \right\rceil$ is the best choice. This $m$ will be probably either floor or ceil of $2^{k}x$, but we cannot determine which one is better by simply looking at the binary expansion of $x$. This indicates the flaw of the naive method I tried before.

Another application: minmax Euclid algorithm

At the very heart of Ryū, Ryū-printf, Grisu-Exact and Dragonbox (which are all float-to-string conversion algorithms), there is so-called minmax Euclid algorithm.

Minmax Euclid algorithm is invented by Ulf Adams who is the author of Ryū and Ryū-printf mentioned above, and it appears on his paper on Ryū. It is at the core of showing the sufficiency of the number of bits of each item in the precomputed lookup table. What it does is this: given positive integers $a,b$ and $N$, compute the minimum and maximum of $ag\,\operatorname{mod}\,b$ where $g$ is any integer ranging from $1$ to $N$. It sounds simple, but naive exhaustive search doesn’t scale if $a,b,N$ are very large numbers. Indeed, in its application into float-to-string conversion algorithms, $a,b$ are extremely large integers and $N$ can be as large as $2^{54}$. For that wide range of $g$, it is simply impossible to run an exhaustive search.

A rough idea of minmax Euclid algorithm is that the coefficients appearing in the Euclid algorithm for computing GCD of $a$ and $b$ tell us what the minimum and the maximum are. If you try to write down and see how $ag\,\operatorname{mod}\,b$ varies as $g$ increases, you will find out why this is the case. (Precise mathematical proof can be cumbersome, though.) Formalizing this idea will lead to the minmax Euclid algorithm.

In fact, the exact minmax Euclid algorithm described in the Ryū paper is not entirely correct because it can produce wrong outputs for some inputs. Thus, I had to give some more thought on it after I realized the flaw in 2018 when I was trying to apply the algorithm to Grisu-Exact. Eventually, I came up with an improved and corrected version of it with a hopefully correct proof, which I wrote on the paper on Grisu-Exact (Figure 3 and Theorem 4.3).

But more than 2 years after writing the long-winded and messy proof of the algorithm, I finally realized that in fact the algorithm is just an easy application of continued fractions. Well, I did not know anything about continued fractions back then, unfortunately!

The story is as follows. Obtaining the minimum and the maximum of $ag\,\operatorname{mod}\,b$ is equivalent to obtaining the minimum and the maximum of

\[\frac{ag\,\operatorname{mod}\,b}{b} = \frac{ag}{b} - \left\lfloor\frac{ag}{b}\right\rfloor = g\left(\frac{a}{b} - \frac{\lfloor ag/b \rfloor}{g}\right) = 1 - g\left(\frac{\lfloor ag/b \rfloor + 1}{g} - \frac{a}{b}\right),\]

which reminds us of approximating $\frac{a}{b}$ by $\frac{\lfloor ag/b \rfloor}{g}$ or by $\frac{\lfloor ag/b \rfloor + 1}{g}$, except that what we are minimizing is not the error itself, rather the error multiplied by $g$. Hence, we define the following concepts.

We say a rational number $\frac{p}{q}$ is a best rational approximation from below in the strong sense of $x$ if $\frac{p}{q}\leq x$ and for any rational number $\frac{a}{b}$ with $\frac{a}{b}\leq x$ and $b\leq q$, we have

\[\vert qx - p\vert \leq \vert bx - a\vert.\]

Similarly, we say a rational number $\frac{p}{q}$ is a best rational approximation from above in the strong sense of $x$ if $\frac{p}{q}\geq x$ and for any rational number $\frac{a}{b}$ with $\frac{a}{b}\geq x$ and $b\leq q$, we have

\[\vert qx - p\vert \leq \vert bx - a\vert.\]

Let us call best rational approximations from below/above defined before as “in the weak sense” to distinguish from the new definitions.

The reason why these are called “in the strong sense” is obvious; if we have

\[\vert qx - p\vert \leq \vert bx - a\vert,\]

then dividing both sides by $q$ gives

\[\left\vert x - \frac{p}{q}\right\vert \leq \frac{b}{q}\left\vert x - \frac{a}{b}\right\vert,\]

so with $b\leq q$ the right-hand side is at most $\left\vert x-\frac{a}{b}\right\vert$, so this implies that $\frac{p}{q}$ is a better approximation than $\frac{a}{b}$.

It is a well-known fact that, if we remove the directional conditions $\frac{a}{b}\leq x$ and $\frac{a}{b}\geq x$, so asking for best rational approximations (in both directions) in the strong sense, then these are precisely the convergents. However, it turns out that with the directional conditions, the best rational approximations from below/above in the weak sense or in the strong sense are just same. Hence, the best rational approximations from below/above in the strong sense are also just precisely the semiconvergents. This fact is probably well-known as well, but I do not know of any reference at this point other than my own proof of it (which I do not write here). I guess this fact on semiconvergents is probably one of the necessary pieces for proving the other fact that convergents are precisely the best rational approximations in the strong sense, but I haven’t checked any proof of it so I do not know.

Anyway, because of this, the problem of finding the minimum and the maximum of $ag\,\operatorname{mod}\,b$ reduces to the problem of finding the semiconvergents that are below and above $\frac{a}{b}$ with the largest denominators among all semiconvergents of denominators at most $N$. This is essentially what the improved minmax Euclid algorithm is doing as described in my paper. I will not discuss further details here.

Computation of $\lfloor nx - y \rfloor$

When developing Dragonbox, I also had to come up with a method of computing

\[\left\lfloor n\log_{10}2 - \log_{10}\frac{4}{3} \right\rfloor.\]

So I did the same thing, I approximated both $\log_{10}2$ and $\log_{10}\frac{4}{3}$ by their respective truncated binary expansions, and computed something like

\[\left\lfloor \frac{nm - s}{2^{k}} \right\rfloor,\]

where $m,s$ are both positive integers, and manually found out the range of $n$ making the above formula valid.

More generally, we can think of computing $\lfloor nx - y \rfloor$ for fixed positive real numbers $x,y$. Again, signs of $x,y$ are not terribly important, but let us just assume $x,y>0$ to make our life a bit easier. Also, let us further assume $0

The rough idea is as follows. If we approximate $x$ by a rational number $\frac{p}{q}$ sufficiently well, then each $nx$ must be very close to $\frac{np}{q}$. Note that

\[\lfloor nx - y \rfloor = \begin{cases} \lfloor nx \rfloor & \textrm{if $nx - \lfloor nx \rfloor \geq y$},\\ \lfloor nx \rfloor - 1 & \textrm{if $nx - \lfloor nx \rfloor < y$}, \end{cases}\]

hence, what matters here is whether the fractional part of $nx$ is above or below $y$. Note that the fractional part of $nx$ must be approximately equal to $\frac{np\,\operatorname{mod}\,q}{q}$. Thus, find the unique integer $0\leq u < q-1$ such that $\frac{u}{q}\leq y < \frac{u+1}{q}$, then probably we can compare the fractional part of $nx$ with $y$ by just comparing $(np\,\operatorname{mod}\,q)$ with $u$. This is indeed the case if $y$ is sufficiently far from both $\frac{u}{q}$ and $\frac{u+1}{q}$, specifically, more than the error of $\frac{np}{q}$ from $nx$.

So, in some sense the situation is better if $x$ and $y$ are “far apart” in the sense that the denominators showing up in approximating rationals of $x$ and $y$ are very different, and the situation is worse if there are a lot of common divisors between those denominators. Maybe someone better than me at number theory can formalize this into a precise language?

Anyway, with a high probability the distance from $y$ to $\frac{u}{q}$ and $\frac{u+1}{q}$ will be of $O\left(\frac{1}{q}\right)$, but it is well-known that the distance from $x$ to $\frac{p}{q}$ can be made to be of $O\left(\frac{1}{q^{2}}\right)$, which will allow the equality

\[\lfloor nx - y \rfloor = \left\lfloor \frac{np - u}{q} \right\rfloor\]

to hold for $O(q)$-many $n$’s. After getting this equality, the next step is to convert it into

\[\left\lfloor \frac{np - u}{q} \right\rfloor = \left\lfloor \frac{nm - s}{2^{k}} \right\rfloor\]

using a Granlund-Montgomery style analysis we did for $\lfloor nx \rfloor$ with rational $x$.

Unfortunately, the bound I got by applying this analysis to the case $x=\log_{10}2$ and $y=\log_{10}\frac{4}{3}$ was not that great, particularly because $y$ is too close to $\frac{u+1}{q}$ for the choice $\frac{p}{q}=\frac{643}{2136}$, which is otherwise a very efficient and effective approximation of $x$. Well, but I might be overlooking some things at this point, so probably I have to give some more tries on this later.

(Edit on 02-10-2022) I included a better analysis of this in my new paper on Dragonbox, Section 4.4. But still I consider it incomplete, because it seems that the actual range of $n$ is much wider than what the analysis method described in the paper expects.

Acknowledgements

Thanks Seung uk Jang for teaching me many things about continued fractions (including useful references) and other discussions relevant to this post.

Junekey Jeon

Circle is the only shape with the smallest maximum arc-chord ratio

A generalization of the Lax-Milgram Theorem

Introduction

Duality pairings and Mackey-Arens theorem

A generalization of the Lax-Milgram theorem

Recovering the classical case

A non-Banach application

The Fourier transform of the Heaviside step function

Justification of the exponential decay trick

The limit of $\hat{H}_{\lambda}$’s

How to quickly factor out a constant factor from integers

Naïve algorithm

Granlund-Montgomery modular inverse algorithm

Lemire’s algorithm

Generalized modular inverse algorithm

Benchmark and conclusion

On the optimal bounds for integer division by constants

Turning an integer division into a multiply-and-shift

Turning multiplication by a real number into a multiply-and-shift

Some applications

Finding the first error case

Coming up with a better magic number than Granlund-Montgomery

A textbook example for the case $p\neq 1$

Will the case $p\neq 1$ be potentially relevant for compiler-writers?

When the magic number is too big

Multiply-add-and-shift rather than multiply-shift

The lower bound

The upper bound

Finding feasible values of $\xi$ and $\zeta$

Results by Lemire et al.

An example usage of Algorithm 7

Fixed-precision formatting of floating-point numbers

TL;DR

Introduction

So what is the goal precisely?

Acknowledgement

The core idea

(a) It is $k$ that matters, not $e$.

(b) We need only one $k$ per $\eta$-many $k$’s.

(c) We don’t need to remember the smallest $Q$.

Decimal digit generation

Which $(e,k)$-pairs are relevant?

The cache table

(a) How do we arrange the computed bits of $5^{k-\eta}$ in the memory?

(b) How to locate the necessary bits from the table, for given $e$ and $k$?

Summary of how it works

Rounding

Rounding inside a subsegment

Rounding at the subsegment boundary

Actual implementation

Possible performance issues of the algorithm

Deep hierarchy

Too much compression

Overlapping digits

Having to load/store

Conclusion

Appendix: Fixed-point fraction trick revisited

Fixed-length case

Some example applications

Variable-length case

Some example applications

Faster integer formatting - James Anhalt (jeaiii)’s algorithm

Disclaimer

Naïve implementations

The core idea of James Anhalt’s algorithm

How to compute $y$?

Consideration of variable length

Better choices for $y$

Benchmark

Back to fixed-length case

Concluding remarks

Continued fractions and their application into fast computation of \\(\lfloor nx\rfloor\\)

Continued fractions

Best rational approximations

Application into computation of $\lfloor nx \rfloor$

The case $n>0$

The case $n<0$

Conclusion and an example

Another application: minmax Euclid algorithm