natdiv.go

     1  // Copyright 2009 The Go Authors. All rights reserved.
     2  // Use of this source code is governed by a BSD-style
     3  // license that can be found in the LICENSE file.
     4  
     5  /*
     6  
     7  Multi-precision division. Here be dragons.
     8  
     9  Given u and v, where u is n+m digits, and v is n digits (with no leading zeros),
    10  the goal is to return quo, rem such that u = quo*v + rem, where 0 ≤ rem < v.
    11  That is, quo = ⌊u/v⌋ where ⌊x⌋ denotes the floor (truncation to integer) of x,
    12  and rem = u - quo·v.
    13  
    14  
    15  Long Division
    16  
    17  Division in a computer proceeds the same as long division in elementary school,
    18  but computers are not as good as schoolchildren at following vague directions,
    19  so we have to be much more precise about the actual steps and what can happen.
    20  
    21  We work from most to least significant digit of the quotient, doing:
    22  
    23   • Guess a digit q, the number of v to subtract from the current
    24     section of u to zero out the topmost digit.
    25   • Check the guess by multiplying q·v and comparing it against
    26     the current section of u, adjusting the guess as needed.
    27   • Subtract q·v from the current section of u.
    28   • Add q to the corresponding section of the result quo.
    29  
    30  When all digits have been processed, the final remainder is left in u
    31  and returned as rem.
    32  
    33  For example, here is a sketch of dividing 5 digits by 3 digits (n=3, m=2).
    34  
    35  	                 q₂ q₁ q₀
    36  	         _________________
    37  	v₂ v₁ v₀ ) u₄ u₃ u₂ u₁ u₀
    38  	           ↓  ↓  ↓  |  |
    39  	          [u₄ u₃ u₂]|  |
    40  	        - [  q₂·v  ]|  |
    41  	        ----------- ↓  |
    42  	          [  rem  | u₁]|
    43  	        - [    q₁·v   ]|
    44  	           ----------- ↓
    45  	             [  rem  | u₀]
    46  	           - [    q₀·v   ]
    47  	              ------------
    48  	                [  rem   ]
    49  
    50  Instead of creating new storage for the remainders and copying digits from u
    51  as indicated by the arrows, we use u's storage directly as both the source
    52  and destination of the subtractions, so that the remainders overwrite
    53  successive overlapping sections of u as the division proceeds, using a slice
    54  of u to identify the current section. This avoids all the copying as well as
    55  shifting of remainders.
    56  
    57  Division of u with n+m digits by v with n digits (in base B) can in general
    58  produce at most m+1 digits, because:
    59  
    60    • u < B^(n+m)               [B^(n+m) has n+m+1 digits]
    61    • v ≥ B^(n-1)               [B^(n-1) is the smallest n-digit number]
    62    • u/v < B^(n+m) / B^(n-1)   [divide bounds for u, v]
    63    • u/v < B^(m+1)             [simplify]
    64  
    65  The first step is special: it takes the top n digits of u and divides them by
    66  the n digits of v, producing the first quotient digit and an n-digit remainder.
    67  In the example, q₂ = ⌊u₄u₃u₂ / v⌋.
    68  
    69  The first step divides n digits by n digits to ensure that it produces only a
    70  single digit.
    71  
    72  Each subsequent step appends the next digit from u to the remainder and divides
    73  those n+1 digits by the n digits of v, producing another quotient digit and a
    74  new n-digit remainder.
    75  
    76  Subsequent steps divide n+1 digits by n digits, an operation that in general
    77  might produce two digits. However, as used in the algorithm, that division is
    78  guaranteed to produce only a single digit. The dividend is of the form
    79  rem·B + d, where rem is a remainder from the previous step and d is a single
    80  digit, so:
    81  
    82   • rem ≤ v - 1                 [rem is a remainder from dividing by v]
    83   • rem·B ≤ v·B - B             [multiply by B]
    84   • d ≤ B - 1                   [d is a single digit]
    85   • rem·B + d ≤ v·B - 1         [add]
    86   • rem·B + d < v·B             [change ≤ to <]
    87   • (rem·B + d)/v < B           [divide by v]
    88  
    89  
    90  Guess and Check
    91  
    92  At each step we need to divide n+1 digits by n digits, but this is for the
    93  implementation of division by n digits, so we can't just invoke a division
    94  routine: we _are_ the division routine. Instead, we guess at the answer and
    95  then check it using multiplication. If the guess is wrong, we correct it.
    96  
    97  How can this guessing possibly be efficient? It turns out that the following
    98  statement (let's call it the Good Guess Guarantee) is true.
    99  
   100  If
   101  
   102   • q = ⌊u/v⌋ where u is n+1 digits and v is n digits,
   103   • q < B, and
   104   • the topmost digit of v = vₙ₋₁ ≥ B/2,
   105  
   106  then q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ satisfies q ≤ q̂ ≤ q+2. (Proof below.)
   107  
   108  That is, if we know the answer has only a single digit and we guess an answer
   109  by ignoring the bottom n-1 digits of u and v, using a 2-by-1-digit division,
   110  then that guess is at least as large as the correct answer. It is also not
   111  too much larger: it is off by at most two from the correct answer.
   112  
   113  Note that in the first step of the overall division, which is an n-by-n-digit
   114  division, the 2-by-1 guess uses an implicit uₙ = 0.
   115  
   116  Note that using a 2-by-1-digit division here does not mean calling ourselves
   117  recursively. Instead, we use an efficient direct hardware implementation of
   118  that operation.
   119  
   120  Note that because q is u/v rounded down, q·v must not exceed u: u ≥ q·v.
   121  If a guess q̂ is too big, it will not satisfy this test. Viewed a different way,
   122  the remainder r̂ for a given q̂ is u - q̂·v, which must be positive. If it is
   123  negative, then the guess q̂ is too big.
   124  
   125  This gives us a way to compute q. First compute q̂ with 2-by-1-digit division.
   126  Then, while u < q̂·v, decrement q̂; this loop executes at most twice, because
   127  q̂ ≤ q+2.
   128  
   129  
   130  Scaling Inputs
   131  
   132  The Good Guess Guarantee requires that the top digit of v (vₙ₋₁) be at least B/2.
   133  For example in base 10, ⌊172/19⌋ = 9, but ⌊18/1⌋ = 18: the guess is wildly off
   134  because the first digit 1 is smaller than B/2 = 5.
   135  
   136  We can ensure that v has a large top digit by multiplying both u and v by the
   137  right amount. Continuing the example, if we multiply both 172 and 19 by 3, we
   138  now have ⌊516/57⌋, the leading digit of v is now ≥ 5, and sure enough
   139  ⌊51/5⌋ = 10 is much closer to the correct answer 9. It would be easier here
   140  to multiply by 4, because that can be done with a shift. Specifically, we can
   141  always count the number of leading zeros i in the first digit of v and then
   142  shift both u and v left by i bits.
   143  
   144  Having scaled u and v, the value ⌊u/v⌋ is unchanged, but the remainder will
   145  be scaled: 172 mod 19 is 1, but 516 mod 57 is 3. We have to divide the remainder
   146  by the scaling factor (shifting right i bits) when we finish.
   147  
   148  Note that these shifts happen before and after the entire division algorithm,
   149  not at each step in the per-digit iteration.
   150  
   151  Note the effect of scaling inputs on the size of the possible quotient.
   152  In the scaled u/v, u can gain a digit from scaling; v never does, because we
   153  pick the scaling factor to make v's top digit larger but without overflowing.
   154  If u and v have n+m and n digits after scaling, then:
   155  
   156    • u < B^(n+m)               [B^(n+m) has n+m+1 digits]
   157    • v ≥ B^n / 2               [vₙ₋₁ ≥ B/2, so vₙ₋₁·B^(n-1) ≥ B^n/2]
   158    • u/v < B^(n+m) / (B^n / 2) [divide bounds for u, v]
   159    • u/v < 2 B^m               [simplify]
   160  
   161  The quotient can still have m+1 significant digits, but if so the top digit
   162  must be a 1. This provides a different way to handle the first digit of the
   163  result: compare the top n digits of u against v and fill in either a 0 or a 1.
   164  
   165  
   166  Refining Guesses
   167  
   168  Before we check whether u < q̂·v, we can adjust our guess to change it from
   169  q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ into the refined guess ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋.
   170  Although not mentioned above, the Good Guess Guarantee also promises that this
   171  3-by-2-digit division guess is more precise and at most one away from the real
   172  answer q. The improvement from the 2-by-1 to the 3-by-2 guess can also be done
   173  without n-digit math.
   174  
   175  If we have a guess q̂ = ⌊uₙuₙ₋₁ / vₙ₋₁⌋ and we want to see if it also equal to
   176  ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, we can use the same check we would for the full division:
   177  if uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂, then the guess is too large and should be reduced.
   178  
   179  Checking uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ < 0,
   180  and
   181  
   182  	uₙuₙ₋₁uₙ₋₂ - q̂·vₙ₋₁vₙ₋₂ = (uₙuₙ₋₁·B + uₙ₋₂) - q̂·(vₙ₋₁·B + vₙ₋₂)
   183  	                          [splitting off the bottom digit]
   184  	                      = (uₙuₙ₋₁ - q̂·vₙ₋₁)·B + uₙ₋₂ - q̂·vₙ₋₂
   185  	                          [regrouping]
   186  
   187  The expression (uₙuₙ₋₁ - q̂·vₙ₋₁) is the remainder of uₙuₙ₋₁ / vₙ₋₁.
   188  If the initial guess returns both q̂ and its remainder r̂, then checking
   189  whether uₙuₙ₋₁uₙ₋₂ < q̂·vₙ₋₁vₙ₋₂ is the same as checking r̂·B + uₙ₋₂ < q̂·vₙ₋₂.
   190  
   191  If we find that r̂·B + uₙ₋₂ < q̂·vₙ₋₂, then we can adjust the guess by
   192  decrementing q̂ and adding vₙ₋₁ to r̂. We repeat until r̂·B + uₙ₋₂ ≥ q̂·vₙ₋₂.
   193  (As before, this fixup is only needed at most twice.)
   194  
   195  Now that q̂ = ⌊uₙuₙ₋₁uₙ₋₂ / vₙ₋₁vₙ₋₂⌋, as mentioned above it is at most one
   196  away from the correct q, and we've avoided doing any n-digit math.
   197  (If we need the new remainder, it can be computed as r̂·B + uₙ₋₂ - q̂·vₙ₋₂.)
   198  
   199  The final check u < q̂·v and the possible fixup must be done at full precision.
   200  For random inputs, a fixup at this step is exceedingly rare: the 3-by-2 guess
   201  is not often wrong at all. But still we must do the check. Note that since the
   202  3-by-2 guess is off by at most 1, it can be convenient to perform the final
   203  u < q̂·v as part of the computation of the remainder r = u - q̂·v. If the
   204  subtraction underflows, decremeting q̂ and adding one v back to r is enough to
   205  arrive at the final q, r.
   206  
   207  That's the entirety of long division: scale the inputs, and then loop over
   208  each output position, guessing, checking, and correcting the next output digit.
   209  
   210  For a 2n-digit number divided by an n-digit number (the worst size-n case for
   211  division complexity), this algorithm uses n+1 iterations, each of which must do
   212  at least the 1-by-n-digit multiplication q̂·v. That's O(n) iterations of
   213  O(n) time each, so O(n²) time overall.
   214  
   215  
   216  Recursive Division
   217  
   218  For very large inputs, it is possible to improve on the O(n²) algorithm.
   219  Let's call a group of n/2 real digits a (very) “wide digit”. We can run the
   220  standard long division algorithm explained above over the wide digits instead of
   221  the actual digits. This will result in many fewer steps, but the math involved in
   222  each step is more work.
   223  
   224  Where basic long division uses a 2-by-1-digit division to guess the initial q̂,
   225  the new algorithm must use a 2-by-1-wide-digit division, which is of course
   226  really an n-by-n/2-digit division. That's OK: if we implement n-digit division
   227  in terms of n/2-digit division, the recursion will terminate when the divisor
   228  becomes small enough to handle with standard long division or even with the
   229  2-by-1 hardware instruction.
   230  
   231  For example, here is a sketch of dividing 10 digits by 4, proceeding with
   232  wide digits corresponding to two regular digits. The first step, still special,
   233  must leave off a (regular) digit, dividing 5 by 4 and producing a 4-digit
   234  remainder less than v. The middle steps divide 6 digits by 4, guaranteed to
   235  produce two output digits each (one wide digit) with 4-digit remainders.
   236  The final step must use what it has: the 4-digit remainder plus one more,
   237  5 digits to divide by 4.
   238  
   239  	                       q₆ q₅ q₄ q₃ q₂ q₁ q₀
   240  	            _______________________________
   241  	v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀
   242  	              ↓  ↓  ↓  ↓  ↓  |  |  |  |  |
   243  	             [u₉ u₈ u₇ u₆ u₅]|  |  |  |  |
   244  	           - [    q₆q₅·v    ]|  |  |  |  |
   245  	           ----------------- ↓  ↓  |  |  |
   246  	                [    rem    |u₄ u₃]|  |  |
   247  	              - [     q₄q₃·v      ]|  |  |
   248  	              -------------------- ↓  ↓  |
   249  	                      [    rem    |u₂ u₁]|
   250  	                    - [     q₂q₁·v      ]|
   251  	                    -------------------- ↓
   252  	                            [    rem    |u₀]
   253  	                          - [     q₀·v     ]
   254  	                          ------------------
   255  	                               [    rem    ]
   256  
   257  An alternative would be to look ahead to how well n/2 divides into n+m and
   258  adjust the first step to use fewer digits as needed, making the first step
   259  more special to make the last step not special at all. For example, using the
   260  same input, we could choose to use only 4 digits in the first step, leaving
   261  a full wide digit for the last step:
   262  
   263  	                       q₆ q₅ q₄ q₃ q₂ q₁ q₀
   264  	            _______________________________
   265  	v₃ v₂ v₁ v₀ ) u₉ u₈ u₇ u₆ u₅ u₄ u₃ u₂ u₁ u₀
   266  	              ↓  ↓  ↓  ↓  |  |  |  |  |  |
   267  	             [u₉ u₈ u₇ u₆]|  |  |  |  |  |
   268  	           - [    q₆·v   ]|  |  |  |  |  |
   269  	           -------------- ↓  ↓  |  |  |  |
   270  	             [    rem    |u₅ u₄]|  |  |  |
   271  	           - [     q₅q₄·v      ]|  |  |  |
   272  	           -------------------- ↓  ↓  |  |
   273  	                   [    rem    |u₃ u₂]|  |
   274  	                 - [     q₃q₂·v      ]|  |
   275  	                 -------------------- ↓  ↓
   276  	                         [    rem    |u₁ u₀]
   277  	                       - [     q₁q₀·v      ]
   278  	                       ---------------------
   279  	                               [    rem    ]
   280  
   281  Today, the code in divRecursiveStep works like the first example. Perhaps in
   282  the future we will make it work like the alternative, to avoid a special case
   283  in the final iteration.
   284  
   285  Either way, each step is a 3-by-2-wide-digit division approximated first by
   286  a 2-by-1-wide-digit division, just as we did for regular digits in long division.
   287  Because the actual answer we want is a 3-by-2-wide-digit division, instead of
   288  multiplying q̂·v directly during the fixup, we can use the quick refinement
   289  from long division (an n/2-by-n/2 multiply) to correct q to its actual value
   290  and also compute the remainder (as mentioned above), and then stop after that,
   291  never doing a full n-by-n multiply.
   292  
   293  Instead of using an n-by-n/2-digit division to produce n/2 digits, we can add
   294  (not discard) one more real digit, doing an (n+1)-by-(n/2+1)-digit division that
   295  produces n/2+1 digits. That single extra digit tightens the Good Guess Guarantee
   296  to q ≤ q̂ ≤ q+1 and lets us drop long division's special treatment of the first
   297  digit. These benefits are discussed more after the Good Guess Guarantee proof
   298  below.
   299  
   300  
   301  How Fast is Recursive Division?
   302  
   303  For a 2n-by-n-digit division, this algorithm runs a 4-by-2 long division over
   304  wide digits, producing two wide digits plus a possible leading regular digit 1,
   305  which can be handled without a recursive call. That is, the algorithm uses two
   306  full iterations, each using an n-by-n/2-digit division and an n/2-by-n/2-digit
   307  multiplication, along with a few n-digit additions and subtractions. The standard
   308  n-by-n-digit multiplication algorithm requires O(n²) time, making the overall
   309  algorithm require time T(n) where
   310  
   311  	T(n) = 2T(n/2) + O(n) + O(n²)
   312  
   313  which, by the Bentley-Haken-Saxe theorem, ends up reducing to T(n) = O(n²).
   314  This is not an improvement over regular long division.
   315  
   316  When the number of digits n becomes large enough, Karatsuba's algorithm for
   317  multiplication can be used instead, which takes O(n^log₂3) = O(n^1.6) time.
   318  (Karatsuba multiplication is implemented in func karatsuba in nat.go.)
   319  That makes the overall recursive division algorithm take O(n^1.6) time as well,
   320  which is an improvement, but again only for large enough numbers.
   321  
   322  It is not critical to make sure that every recursion does only two recursive
   323  calls. While in general the number of recursive calls can change the time
   324  analysis, in this case doing three calls does not change the analysis:
   325  
   326  	T(n) = 3T(n/2) + O(n) + O(n^log₂3)
   327  
   328  ends up being T(n) = O(n^log₂3). Because the Karatsuba multiplication taking
   329  time O(n^log₂3) is itself doing 3 half-sized recursions, doing three for the
   330  division does not hurt the asymptotic performance. Of course, it is likely
   331  still faster in practice to do two.
   332  
   333  
   334  Proof of the Good Guess Guarantee
   335  
   336  Given numbers x, y, let us break them into the quotients and remainders when
   337  divided by some scaling factor S, with the added constraints that the quotient
   338  x/y and the high part of y are both less than some limit T, and that the high
   339  part of y is at least half as big as T.
   340  
   341  	x₁ = ⌊x/S⌋        y₁ = ⌊y/S⌋
   342  	x₀ = x mod S      y₀ = y mod S
   343  
   344  	x  = x₁·S + x₀    0 ≤ x₀ < S    x/y < T
   345  	y  = y₁·S + y₀    0 ≤ y₀ < S    T/2 ≤ y₁ < T
   346  
   347  And consider the two truncated quotients:
   348  
   349  	q = ⌊x/y⌋
   350  	q̂ = ⌊x₁/y₁⌋
   351  
   352  We will prove that q ≤ q̂ ≤ q+2.
   353  
   354  The guarantee makes no real demands on the scaling factor S: it is simply the
   355  magnitude of the digits cut from both x and y to produce x₁ and y₁.
   356  The guarantee makes only limited demands on T: it must be large enough to hold
   357  the quotient x/y, and y₁ must have roughly the same size.
   358  
   359  To apply to the earlier discussion of 2-by-1 guesses in long division,
   360  we would choose:
   361  
   362  	S  = Bⁿ⁻¹
   363  	T  = B
   364  	x  = u
   365  	x₁ = uₙuₙ₋₁
   366  	x₀ = uₙ₋₂...u₀
   367  	y  = v
   368  	y₁ = vₙ₋₁
   369  	y₀ = vₙ₋₂...u₀
   370  
   371  These simpler variables avoid repeating those longer expressions in the proof.
   372  
   373  Note also that, by definition, truncating division ⌊x/y⌋ satisfies
   374  
   375  	x/y - 1 < ⌊x/y⌋ ≤ x/y.
   376  
   377  This fact will be used a few times in the proofs.
   378  
   379  Proof that q ≤ q̂:
   380  
   381  	q̂·y₁ = ⌊x₁/y₁⌋·y₁                      [by definition, q̂ = ⌊x₁/y₁⌋]
   382  	     > (x₁/y₁ - 1)·y₁                  [x₁/y₁ - 1 < ⌊x₁/y₁⌋]
   383  	     = x₁ - y₁                         [distribute y₁]
   384  
   385  	So q̂·y₁ > x₁ - y₁.
   386  	Since q̂·y₁ is an integer, q̂·y₁ ≥ x₁ - y₁ + 1.
   387  
   388  	q̂ - q = q̂ - ⌊x/y⌋                      [by definition, q = ⌊x/y⌋]
   389  	      ≥ q̂ - x/y                        [⌊x/y⌋ < x/y]
   390  	      = (1/y)·(q̂·y - x)                [factor out 1/y]
   391  	      ≥ (1/y)·(q̂·y₁·S - x)             [y = y₁·S + y₀ ≥ y₁·S]
   392  	      ≥ (1/y)·((x₁ - y₁ + 1)·S - x)    [above: q̂·y₁ ≥ x₁ - y₁ + 1]
   393  	      = (1/y)·(x₁·S - y₁·S + S - x)    [distribute S]
   394  	      = (1/y)·(S - x₀ - y₁·S)          [-x = -x₁·S - x₀]
   395  	      > -y₁·S / y                      [x₀ < S, so S - x₀ > 0; drop it]
   396  	      ≥ -1                             [y₁·S ≤ y]
   397  
   398  	So q̂ - q > -1.
   399  	Since q̂ - q is an integer, q̂ - q ≥ 0, or equivalently q ≤ q̂.
   400  
   401  Proof that q̂ ≤ q+2:
   402  
   403  	x₁/y₁ - x/y = x₁·S/y₁·S - x/y          [multiply left term by S/S]
   404  	            ≤ x/y₁·S - x/y             [x₁S ≤ x]
   405  	            = (x/y)·(y/y₁·S - 1)       [factor out x/y]
   406  	            = (x/y)·((y - y₁·S)/y₁·S)  [move -1 into y/y₁·S fraction]
   407  	            = (x/y)·(y₀/y₁·S)          [y - y₁·S = y₀]
   408  	            = (x/y)·(1/y₁)·(y₀/S)      [factor out 1/y₁]
   409  	            < (x/y)·(1/y₁)             [y₀ < S, so y₀/S < 1]
   410  	            ≤ (x/y)·(2/T)              [y₁ ≥ T/2, so 1/y₁ ≤ 2/T]
   411  	            < T·(2/T)                  [x/y < T]
   412  	            = 2                        [T·(2/T) = 2]
   413  
   414  	So x₁/y₁ - x/y < 2.
   415  
   416  	q̂ - q = ⌊x₁/y₁⌋ - q                    [by definition, q̂ = ⌊x₁/y₁⌋]
   417  	      = ⌊x₁/y₁⌋ - ⌊x/y⌋                [by definition, q = ⌊x/y⌋]
   418  	      ≤ x₁/y₁ - ⌊x/y⌋                  [⌊x₁/y₁⌋ ≤ x₁/y₁]
   419  	      < x₁/y₁ - (x/y - 1)              [⌊x/y⌋ > x/y - 1]
   420  	      = (x₁/y₁ - x/y) + 1              [regrouping]
   421  	      < 2 + 1                          [above: x₁/y₁ - x/y < 2]
   422  	      = 3
   423  
   424  	So q̂ - q < 3.
   425  	Since q̂ - q is an integer, q̂ - q ≤ 2.
   426  
   427  Note that when x/y < T/2, the bounds tighten to x₁/y₁ - x/y < 1 and therefore
   428  q̂ - q ≤ 1.
   429  
   430  Note also that in the general case 2n-by-n division where we don't know that
   431  x/y < T, we do know that x/y < 2T, yielding the bound q̂ - q ≤ 4. So we could
   432  remove the special case first step of long division as long as we allow the
   433  first fixup loop to run up to four times. (Using a simple comparison to decide
   434  whether the first digit is 0 or 1 is still more efficient, though.)
   435  
   436  Finally, note that when dividing three leading base-B digits by two (scaled),
   437  we have T = B² and x/y < B = T/B, a much tighter bound than x/y < T.
   438  This in turn yields the much tighter bound x₁/y₁ - x/y < 2/B. This means that
   439  ⌊x₁/y₁⌋ and ⌊x/y⌋ can only differ when x/y is less than 2/B greater than an
   440  integer. For random x and y, the chance of this is 2/B, or, for large B,
   441  approximately zero. This means that after we produce the 3-by-2 guess in the
   442  long division algorithm, the fixup loop essentially never runs.
   443  
   444  In the recursive algorithm, the extra digit in (2·⌊n/2⌋+1)-by-(⌊n/2⌋+1)-digit
   445  division has exactly the same effect: the probability of needing a fixup is the
   446  same 2/B. Even better, we can allow the general case x/y < 2T and the fixup
   447  probability only grows to 4/B, still essentially zero.
   448  
   449  
   450  References
   451  
   452  There are no great references for implementing long division; thus this comment.
   453  Here are some notes about what to expect from the obvious references.
   454  
   455  Knuth Volume 2 (Seminumerical Algorithms) section 4.3.1 is the usual canonical
   456  reference for long division, but that entire series is highly compressed, never
   457  repeating a necessary fact and leaving important insights to the exercises.
   458  For example, no rationale whatsoever is given for the calculation that extends
   459  q̂ from a 2-by-1 to a 3-by-2 guess, nor why it reduces the error bound.
   460  The proof that the calculation even has the desired effect is left to exercises.
   461  The solutions to those exercises provided at the back of the book are entirely
   462  calculations, still with no explanation as to what is going on or how you would
   463  arrive at the idea of doing those exact calculations. Nowhere is it mentioned
   464  that this test extends the 2-by-1 guess into a 3-by-2 guess. The proof of the
   465  Good Guess Guarantee is only for the 2-by-1 guess and argues by contradiction,
   466  making it difficult to understand how modifications like adding another digit
   467  or adjusting the quotient range affects the overall bound.
   468  
   469  All that said, Knuth remains the canonical reference. It is dense but packed
   470  full of information and references, and the proofs are simpler than many other
   471  presentations. The proofs above are reworkings of Knuth's to remove the
   472  arguments by contradiction and add explanations or steps that Knuth omitted.
   473  But beware of errors in older printings. Take the published errata with you.
   474  
   475  Brinch Hansen's “Multiple-length Division Revisited: a Tour of the Minefield”
   476  starts with a blunt critique of Knuth's presentation (among others) and then
   477  presents a more detailed and easier to follow treatment of long division,
   478  including an implementation in Pascal. But the algorithm and implementation
   479  work entirely in terms of 3-by-2 division, which is much less useful on modern
   480  hardware than an algorithm using 2-by-1 division. The proofs are a bit too
   481  focused on digit counting and seem needlessly complex, especially compared to
   482  the ones given above.
   483  
   484  Burnikel and Ziegler's “Fast Recursive Division” introduced the key insight of
   485  implementing division by an n-digit divisor using recursive calls to division
   486  by an n/2-digit divisor, relying on Karatsuba multiplication to yield a
   487  sub-quadratic run time. However, the presentation decisions are made almost
   488  entirely for the purpose of simplifying the run-time analysis, rather than
   489  simplifying the presentation. Instead of a single algorithm that loops over
   490  quotient digits, the paper presents two mutually-recursive algorithms, for
   491  2n-by-n and 3n-by-2n. The paper also does not present any general (n+m)-by-n
   492  algorithm.
   493  
   494  The proofs in the paper are remarkably complex, especially considering that
   495  the algorithm is at its core just long division on wide digits, so that the
   496  usual long division proofs apply essentially unaltered.
   497  */
   498  
   499  package big
   500  
   501  import "math/bits"
   502  
   503  // rem returns r such that r = u%v.
   504  // It uses z as the storage for r.
   505  func (z nat) rem(stk *stack, u, v nat) (r nat) {
   506  	if alias(z, u) {
   507  		z = nil
   508  	}
   509  	defer stk.restore(stk.save())
   510  	q := stk.nat(max(1, len(u)-(len(v)-1)))
   511  	_, r = q.div(stk, z, u, v)
   512  	return r
   513  }
   514  
   515  // div returns q, r such that q = ⌊u/v⌋ and r = u%v = u - q·v.
   516  // It uses z and z2 as the storage for q and r.
   517  // The caller may pass stk == nil to request that div obtain and release one itself.
   518  func (z nat) div(stk *stack, z2, u, v nat) (q, r nat) {
   519  	if len(v) == 0 {
   520  		panic("division by zero")
   521  	}
   522  
   523  	if len(v) == 1 {
   524  		// Short division: long optimized for a single-word divisor.
   525  		// In that case, the 2-by-1 guess is all we need at each step.
   526  		var r2 Word
   527  		q, r2 = z.divW(u, v[0])
   528  		r = z2.setWord(r2)
   529  		return
   530  	}
   531  
   532  	if u.cmp(v) < 0 {
   533  		q = z[:0]
   534  		r = z2.set(u)
   535  		return
   536  	}
   537  
   538  	if stk == nil {
   539  		stk = getStack()
   540  		defer stk.free()
   541  	}
   542  
   543  	q, r = z.divLarge(stk, z2, u, v)
   544  	return
   545  }
   546  
   547  // divW returns q, r such that q = ⌊x/y⌋ and r = x%y = x - q·y.
   548  // It uses z as the storage for q.
   549  // Note that y is a single digit (Word), not a big number.
   550  func (z nat) divW(x nat, y Word) (q nat, r Word) {
   551  	m := len(x)
   552  	switch {
   553  	case y == 0:
   554  		panic("division by zero")
   555  	case y == 1:
   556  		q = z.set(x) // result is x
   557  		return
   558  	case m == 0:
   559  		q = z[:0] // result is 0
   560  		return
   561  	}
   562  	// m > 0
   563  	z = z.make(m)
   564  	r = divWVW(z, 0, x, y)
   565  	q = z.norm()
   566  	return
   567  }
   568  
   569  // modW returns x % d.
   570  func (x nat) modW(d Word) (r Word) {
   571  	// TODO(agl): we don't actually need to store the q value.
   572  	var q nat
   573  	q = q.make(len(x))
   574  	return divWVW(q, 0, x, d)
   575  }
   576  
   577  // divWVW overwrites z with ⌊x/y⌋, returning the remainder r.
   578  // The caller must ensure that len(z) = len(x).
   579  func divWVW(z []Word, xn Word, x []Word, y Word) (r Word) {
   580  	r = xn
   581  	if len(x) == 1 {
   582  		qq, rr := bits.Div(uint(r), uint(x[0]), uint(y))
   583  		z[0] = Word(qq)
   584  		return Word(rr)
   585  	}
   586  	rec := reciprocalWord(y)
   587  	for i := len(z) - 1; i >= 0; i-- {
   588  		z[i], r = divWW(r, x[i], y, rec)
   589  	}
   590  	return r
   591  }
   592  
   593  // div returns q, r such that q = ⌊uIn/vIn⌋ and r = uIn%vIn = uIn - q·vIn.
   594  // It uses z and u as the storage for q and r.
   595  // The caller must ensure that len(vIn) ≥ 2 (use divW otherwise)
   596  // and that len(uIn) ≥ len(vIn) (the answer is 0, uIn otherwise).
   597  func (z nat) divLarge(stk *stack, u, uIn, vIn nat) (q, r nat) {
   598  	n := len(vIn)
   599  	m := len(uIn) - n
   600  
   601  	// Scale the inputs so vIn's top bit is 1 (see “Scaling Inputs” above).
   602  	// vIn is treated as a read-only input (it may be in use by another
   603  	// goroutine), so we must make a copy.
   604  	// uIn is copied to u.
   605  	defer stk.restore(stk.save())
   606  	shift := nlz(vIn[n-1])
   607  	v := stk.nat(n)
   608  	u = u.make(len(uIn) + 1)
   609  	if shift == 0 {
   610  		copy(v, vIn)
   611  		copy(u[:len(uIn)], uIn)
   612  		u[len(uIn)] = 0
   613  	} else {
   614  		lshVU(v, vIn, shift)
   615  		u[len(uIn)] = lshVU(u[:len(uIn)], uIn, shift)
   616  	}
   617  
   618  	// The caller should not pass aliased z and u, since those are
   619  	// the two different outputs, but correct just in case.
   620  	if alias(z, u) {
   621  		z = nil
   622  	}
   623  	q = z.make(m + 1)
   624  
   625  	// Use basic or recursive long division depending on size.
   626  	if n < divRecursiveThreshold {
   627  		q.divBasic(stk, u, v)
   628  	} else {
   629  		q.divRecursive(stk, u, v)
   630  	}
   631  
   632  	q = q.norm()
   633  
   634  	// Undo scaling of remainder.
   635  	if shift != 0 {
   636  		rshVU(u, u, shift)
   637  	}
   638  	r = u.norm()
   639  
   640  	return q, r
   641  }
   642  
   643  // divBasic implements long division as described above.
   644  // It overwrites q with ⌊u/v⌋ and overwrites u with the remainder r.
   645  // q must be large enough to hold ⌊u/v⌋.
   646  func (q nat) divBasic(stk *stack, u, v nat) {
   647  	n := len(v)
   648  	m := len(u) - n
   649  
   650  	defer stk.restore(stk.save())
   651  	qhatv := stk.nat(n + 1)
   652  
   653  	// Set up for divWW below, precomputing reciprocal argument.
   654  	vn1 := v[n-1]
   655  	rec := reciprocalWord(vn1)
   656  
   657  	// Invent a leading 0 for u, for the first iteration.
   658  	// Invariant: ujn == u[j+n] in each iteration.
   659  	ujn := Word(0)
   660  
   661  	// Compute each digit of quotient.
   662  	for j := m; j >= 0; j-- {
   663  		// Compute the 2-by-1 guess q̂.
   664  		qhat := Word(_M)
   665  
   666  		// ujn ≤ vn1, or else q̂ would be more than one digit.
   667  		// For ujn == vn1, we set q̂ to the max digit M above.
   668  		// Otherwise, we compute the 2-by-1 guess.
   669  		if ujn != vn1 {
   670  			var rhat Word
   671  			qhat, rhat = divWW(ujn, u[j+n-1], vn1, rec)
   672  
   673  			// Refine q̂ to a 3-by-2 guess. See “Refining Guesses” above.
   674  			vn2 := v[n-2]
   675  			x1, x2 := mulWW(qhat, vn2)
   676  			ujn2 := u[j+n-2]
   677  			for greaterThan(x1, x2, rhat, ujn2) { // x1x2 > r̂ u[j+n-2]
   678  				qhat--
   679  				prevRhat := rhat
   680  				rhat += vn1
   681  				// If r̂  overflows, then
   682  				// r̂ u[j+n-2]v[n-1] is now definitely > x1 x2.
   683  				if rhat < prevRhat {
   684  					break
   685  				}
   686  				// TODO(rsc): No need for a full mulWW.
   687  				// x2 += vn2; if x2 overflows, x1++
   688  				x1, x2 = mulWW(qhat, vn2)
   689  			}
   690  		}
   691  
   692  		// Compute q̂·v.
   693  		qhatv[n] = mulAddVWW(qhatv[0:n], v, qhat, 0)
   694  		qhl := len(qhatv)
   695  		if j+qhl > len(u) && qhatv[n] == 0 {
   696  			qhl--
   697  		}
   698  
   699  		// Subtract q̂·v from the current section of u.
   700  		// If it underflows, q̂·v > u, which we fix up
   701  		// by decrementing q̂ and adding v back.
   702  		c := subVV(u[j:j+qhl], u[j:j+qhl], qhatv[:qhl])
   703  		if c != 0 {
   704  			c := addVV(u[j:j+n], u[j:j+n], v)
   705  			// If n == qhl, the carry from subVV and the carry from addVV
   706  			// cancel out and don't affect u[j+n].
   707  			if n < qhl {
   708  				u[j+n] += c
   709  			}
   710  			qhat--
   711  		}
   712  
   713  		ujn = u[j+n-1]
   714  
   715  		// Save quotient digit.
   716  		// Caller may know the top digit is zero and not leave room for it.
   717  		if j == m && m == len(q) && qhat == 0 {
   718  			continue
   719  		}
   720  		q[j] = qhat
   721  	}
   722  }
   723  
   724  // greaterThan reports whether the two digit numbers x1 x2 > y1 y2.
   725  // TODO(rsc): In contradiction to most of this file, x1 is the high
   726  // digit and x2 is the low digit. This should be fixed.
   727  func greaterThan(x1, x2, y1, y2 Word) bool {
   728  	return x1 > y1 || x1 == y1 && x2 > y2
   729  }
   730  
   731  // divRecursiveThreshold is the number of divisor digits
   732  // at which point divRecursive is faster than divBasic.
   733  var divRecursiveThreshold = 40 // see calibrate_test.go
   734  
   735  // divRecursive implements recursive division as described above.
   736  // It overwrites z with ⌊u/v⌋ and overwrites u with the remainder r.
   737  // z must be large enough to hold ⌊u/v⌋.
   738  // This function is just for allocating and freeing temporaries
   739  // around divRecursiveStep, the real implementation.
   740  func (z nat) divRecursive(stk *stack, u, v nat) {
   741  	clear(z)
   742  	z.divRecursiveStep(stk, u, v, 0)
   743  }
   744  
   745  // divRecursiveStep is the actual implementation of recursive division.
   746  // It adds ⌊u/v⌋ to z and overwrites u with the remainder r.
   747  // z must be large enough to hold ⌊u/v⌋.
   748  // It uses temps[depth] (allocating if needed) as a temporary live across
   749  // the recursive call. It also uses tmp, but not live across the recursion.
   750  func (z nat) divRecursiveStep(stk *stack, u, v nat, depth int) {
   751  	// u is a subsection of the original and may have leading zeros.
   752  	// TODO(rsc): The v = v.norm() is useless and should be removed.
   753  	// We know (and require) that v's top digit is ≥ B/2.
   754  	u = u.norm()
   755  	v = v.norm()
   756  	if len(u) == 0 {
   757  		clear(z)
   758  		return
   759  	}
   760  
   761  	// Fall back to basic division if the problem is now small enough.
   762  	n := len(v)
   763  	if n < divRecursiveThreshold {
   764  		z.divBasic(stk, u, v)
   765  		return
   766  	}
   767  
   768  	// Nothing to do if u is shorter than v (implies u < v).
   769  	m := len(u) - n
   770  	if m < 0 {
   771  		return
   772  	}
   773  
   774  	// We consider B digits in a row as a single wide digit.
   775  	// (See “Recursive Division” above.)
   776  	//
   777  	// TODO(rsc): rename B to Wide, to avoid confusion with _B,
   778  	// which is something entirely different.
   779  	// TODO(rsc): Look into whether using ⌈n/2⌉ is better than ⌊n/2⌋.
   780  	B := n / 2
   781  
   782  	// Allocate a nat for qhat below.
   783  	defer stk.restore(stk.save())
   784  	qhat0 := stk.nat(B + 1)
   785  
   786  	// Compute each wide digit of the quotient.
   787  	//
   788  	// TODO(rsc): Change the loop to be
   789  	//	for j := (m+B-1)/B*B; j > 0; j -= B {
   790  	// which will make the final step a regular step, letting us
   791  	// delete what amounts to an extra copy of the loop body below.
   792  	j := m
   793  	for j > B {
   794  		// Divide u[j-B:j+n] (3 wide digits) by v (2 wide digits).
   795  		// First make the 2-by-1-wide-digit guess using a recursive call.
   796  		// Then extend the guess to the full 3-by-2 (see “Refining Guesses”).
   797  		//
   798  		// For the 2-by-1-wide-digit guess, instead of doing 2B-by-B-digit,
   799  		// we use a (2B+1)-by-(B+1) digit, which handles the possibility that
   800  		// the result has an extra leading 1 digit as well as guaranteeing
   801  		// that the computed q̂ will be off by at most 1 instead of 2.
   802  
   803  		// s is the number of digits to drop from the 3B- and 2B-digit chunks.
   804  		// We drop B-1 to be left with 2B+1 and B+1.
   805  		s := (B - 1)
   806  
   807  		// uu is the up-to-3B-digit section of u we are working on.
   808  		uu := u[j-B:]
   809  
   810  		// Compute the 2-by-1 guess q̂, leaving r̂ in uu[s:B+n].
   811  		qhat := qhat0
   812  		clear(qhat)
   813  		qhat.divRecursiveStep(stk, uu[s:B+n], v[s:], depth+1)
   814  		qhat = qhat.norm()
   815  
   816  		// Extend to a 3-by-2 quotient and remainder.
   817  		// Because divRecursiveStep overwrote the top part of uu with
   818  		// the remainder r̂, the full uu already contains the equivalent
   819  		// of r̂·B + uₙ₋₂ from the “Refining Guesses” discussion.
   820  		// Subtracting q̂·vₙ₋₂ from it will compute the full-length remainder.
   821  		// If that subtraction underflows, q̂·v > u, which we fix up
   822  		// by decrementing q̂ and adding v back, same as in long division.
   823  
   824  		// TODO(rsc): Instead of subtract and fix-up, this code is computing
   825  		// q̂·vₙ₋₂ and decrementing q̂ until that product is ≤ u.
   826  		// But we can do the subtraction directly, as in the comment above
   827  		// and in long division, because we know that q̂ is wrong by at most one.
   828  		mark := stk.save()
   829  		qhatv := stk.nat(3 * n)
   830  		clear(qhatv)
   831  		qhatv = qhatv.mul(stk, qhat, v[:s])
   832  		for i := 0; i < 2; i++ {
   833  			e := qhatv.cmp(uu.norm())
   834  			if e <= 0 {
   835  				break
   836  			}
   837  			subVW(qhat, qhat, 1)
   838  			c := subVV(qhatv[:s], qhatv[:s], v[:s])
   839  			if len(qhatv) > s {
   840  				subVW(qhatv[s:], qhatv[s:], c)
   841  			}
   842  			addTo(uu[s:], v[s:])
   843  		}
   844  		if qhatv.cmp(uu.norm()) > 0 {
   845  			panic("impossible")
   846  		}
   847  		c := subVV(uu[:len(qhatv)], uu[:len(qhatv)], qhatv)
   848  		if c > 0 {
   849  			subVW(uu[len(qhatv):], uu[len(qhatv):], c)
   850  		}
   851  		addTo(z[j-B:], qhat)
   852  		j -= B
   853  		stk.restore(mark)
   854  	}
   855  
   856  	// TODO(rsc): Rewrite loop as described above and delete all this code.
   857  
   858  	// Now u < (v<<B), compute lower bits in the same way.
   859  	// Choose shift = B-1 again.
   860  	s := B - 1
   861  	qhat := qhat0
   862  	clear(qhat)
   863  	qhat.divRecursiveStep(stk, u[s:].norm(), v[s:], depth+1)
   864  	qhat = qhat.norm()
   865  	qhatv := stk.nat(3 * n)
   866  	clear(qhatv)
   867  	qhatv = qhatv.mul(stk, qhat, v[:s])
   868  	// Set the correct remainder as before.
   869  	for i := 0; i < 2; i++ {
   870  		if e := qhatv.cmp(u.norm()); e > 0 {
   871  			subVW(qhat, qhat, 1)
   872  			c := subVV(qhatv[:s], qhatv[:s], v[:s])
   873  			if len(qhatv) > s {
   874  				subVW(qhatv[s:], qhatv[s:], c)
   875  			}
   876  			addTo(u[s:], v[s:])
   877  		}
   878  	}
   879  	if qhatv.cmp(u.norm()) > 0 {
   880  		panic("impossible")
   881  	}
   882  	c := subVV(u[:len(qhatv)], u[:len(qhatv)], qhatv)
   883  	if c > 0 {
   884  		c = subVW(u[len(qhatv):], u[len(qhatv):], c)
   885  	}
   886  	if c > 0 {
   887  		panic("impossible")
   888  	}
   889  
   890  	// Done!
   891  	addTo(z, qhat.norm())
   892  }
   893
View as plain text