|
| 1 | +CUDA Collatz |
| 2 | +============ |
| 3 | + |
| 4 | +See `Wikipedia - Collatz Conjecture |
| 5 | +<http://en.wikipedia.org/wiki/Collatz_conjecture>`_ and `On the 3x + 1 problem |
| 6 | +<http://www.ericr.nl/wondrous/index.html>`_ for more information about this |
| 7 | +intriguing mathematical problem. |
| 8 | + |
| 9 | +I think this might be the world's fastest single chip Collatz :term:`Delay |
| 10 | +Record` calculator. With starting numbers in the 64 bit range, this app checks |
| 11 | +around 8 billion numbers per second for :term:`Delay Record`\ s on a GTX570 |
| 12 | +graphics card. |
| 13 | + |
| 14 | + |
| 15 | +Glossary |
| 16 | +~~~~~~~~ |
| 17 | + |
| 18 | +N |
| 19 | + |
| 20 | +:A positive number for which the :term:`Delay` is to be calculated. In |
| 21 | +theory, it can be infinitely large, but this implementation limits starting |
| 22 | +:term:`N` to 64 bits and :term:`N` that may be reached during the trajectory |
| 23 | +to 128 bits. |
| 24 | + |
| 25 | +Delay |
| 26 | + |
| 27 | +:The number of iterations of the Collatz rules on a given :term:`N` until the |
| 28 | +:term:`N` reaches the value 1. |
| 29 | + |
| 30 | +Trajectory |
| 31 | + |
| 32 | +:The path a specific :term:`N` follows from its first value until it reaches |
| 33 | +the value 1. |
| 34 | + |
| 35 | +Delay Class |
| 36 | + |
| 37 | +:All numbers :term:`N` that have a specific :term:`Delay`. |
| 38 | + |
| 39 | +Class Record |
| 40 | + |
| 41 | +:The lowest :term:`N` in a given :term:`Delay Class`. |
| 42 | + |
| 43 | +The first :term:`N` that we find that has a previously unencountered |
| 44 | + |
| 45 | +:term:`Delay` is a Class Record. If the :term:`Delay` is also in the highest |
| 46 | +:term:`Delay Class`, it is also a :term:`Delay Record`. |
| 47 | + |
| 48 | +Delay Record |
| 49 | + |
| 50 | +:The lowest :term:`N` that has a :term:`Delay` higher than the :term:`Delay` |
| 51 | +of the previous :term:`Delay Record`. |
| 52 | + |
| 53 | + |
| 54 | +Overview of implemented optimizations |
| 55 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 56 | + |
| 57 | +The performance of this app is achieved by a combination of high level and low |
| 58 | +level optimizations. What follows is brief overview of each optimization. |
| 59 | + |
| 60 | + |
| 61 | +High level optimizations |
| 62 | +------------------------ |
| 63 | + |
| 64 | +These optimizations are described on the `Wikipedia - Collatz Conjecture |
| 65 | +<http://en.wikipedia.org/wiki/Collatz_conjecture>`_ page. |
| 66 | + |
| 67 | +Skipping :term:`N` |
| 68 | +`````````````````` |
| 69 | + |
| 70 | +Many :term:`N` can be ruled out as :term:`Delay Record`\ s without calculating |
| 71 | +their :term:`Delay`\ s. |
| 72 | + |
| 73 | +- All even :term:`N` are skipped because any even :term:`N` :term:`Delay Record` |
| 74 | + can be derived directly from the previous odd numbered :term:`Delay Record`. |
| 75 | + (Skips 50% of all :term:`N`.) |
| 76 | + |
| 77 | +- All :term:`N` on the form of 3k+2 (N is congruent 2 modulo 3) are |
| 78 | + skipped because these numbers are not potential :term:`Delay Record`\ s. |
| 79 | + (Skips 33% of all remaining :term:`N`.) |
| 80 | + |
| 81 | +- A table, called a sieve, is used to skip checking of many :term:`N`. The sieve |
| 82 | + checks whether paths come together. If two paths join then the upper one can |
| 83 | + never yield a :term:`Delay Record` (or Class Record) and can be skipped. |
| 84 | + (Skips approximately 80% of all remaining :term:`N`.) |
| 85 | + |
| 86 | + Example: Any :term:`N` of the form 8k+4 will reach 6k+4 after 3 steps, and so |
| 87 | + will any :term:`N` of the form 8k+5. Therefore no :term:`N` of the form 8k+5 |
| 88 | + can be a Class Record (5 itself being the only exception). So any :term:`N` of |
| 89 | + the form 8k+5 does not need to be checked, and all positions of the form 8k+5 |
| 90 | + in the sieve contain a zero. |
| 91 | + |
| 92 | +After these optimization have been performed, Less than 7% of all :term:`N` |
| 93 | +remain to actually be calculated. So, while the app checks around 8 billion |
| 94 | +:term:`N`\ s, it calculates the :term:`Delay` of around 560 million :term:`N` s. |
| 95 | + |
| 96 | + |
| 97 | +:term:`Delay` calculation optimizations |
| 98 | +``````````````````````````````````````` |
| 99 | + |
| 100 | +The :term:`Delay` s for the :term:`N` that were not skipped must be calculated. |
| 101 | +The following optimizations are done when calculating :term:`Delay`\ s. |
| 102 | + |
| 103 | +- Lookup tables are used to perform multiple steps of the Collatz calculation |
| 104 | + per iteration of the calculation loop. |
| 105 | + |
| 106 | +- A table, called a "tail", is used to prevent having to calculate the |
| 107 | + :term:`Delay`\ s of small Ns. Once :term:`N` drops below a certain value, the |
| 108 | + final delay is calculated by looking the remaining :term:`Delay` up in this |
| 109 | + table and adding it to the current :term:`Delay`. |
| 110 | + |
| 111 | +- In addition, an early break technique has been tested. In this technique, |
| 112 | + :term:`N` is compared with known, earlier :term:`Delay Record`\ s. Calculation |
| 113 | + is ended early when it is determined that :term:`N` cannot possibly become a |
| 114 | + :term:`Delay Record`. Unfortunately, the speed increase from ending |
| 115 | + calculations early was outweighed by the overhead of continually checking |
| 116 | + :term:`N` against a table of :term:`Delay Record`\ s, resulting in a net |
| 117 | + decrease of calculation speed. So, the early break optimization has been left |
| 118 | + in the code, but has been disabled. |
| 119 | + |
| 120 | + |
| 121 | +Low Level optimizations |
| 122 | +----------------------- |
| 123 | + |
| 124 | +These are specific to my CUDA implementation for the GPU. |
| 125 | + |
| 126 | +- The sieve technique that is described on Wikipedia yields a table that |
| 127 | + contains a power-of-two number of entries. For instance, a 20 bit sieve |
| 128 | + contains 2^20 = 1,048,576 entries. The table is applicable to all Ns, to |
| 129 | + infinity. Each entry is a true/false value that describes if the corresponding |
| 130 | + :term:`N` modulus X can be skipped. In a straightforward GPU implementation, |
| 131 | + one would create 1,048,576 threads, have each check its true/false value in |
| 132 | + the table, and abort if the value is false. This would abort around 80% of the |
| 133 | + values, and calculations would proceed on 20%. On the GPU, threads are run in |
| 134 | + warps. A warp has to keep running as long as at least 1 thread is active. In |
| 135 | + general, each warp would have a few threads active after sieving, so there |
| 136 | + would be almost no speed advantage on the GPU. The most fun I had with this |
| 137 | + project was with finding out how to optimize this on the GPU. The solution |
| 138 | + turned out to be simple. All I had to do was to transform the table of |
| 139 | + true/false values to a table of offsets to all the true values. In this way, |
| 140 | + only the same number of threads as there were true values had to be started |
| 141 | + and no threads had to be aborted. Each thread determines which :term:`N` it |
| 142 | + should calculate by using its index to look up the corresponding offset in the |
| 143 | + sieve offsets table and adding it to the base :term:`N`. |
| 144 | + |
| 145 | +- Instead of individually performing the three steps described above that filter |
| 146 | + out :term:`N` I rolled those filters into a combined table. Because one of the |
| 147 | + filters remove all numbers on the form 3k+2 (one number out of three), this |
| 148 | + was accomplished by creating three variations of the table, each filtering a |
| 149 | + different set of every three numbers and, for each block of :term:`N` select |
| 150 | + the one that filters out the correct numbers for that :term:`N` base. |
| 151 | + |
| 152 | +- The step algorithm requires two tables called c and d. It also requires that 3 |
| 153 | + to-the-power of the lookup index be calculated for each lookup. Because the |
| 154 | + indexes into each of the tables and the index used in the 3 to-the-power-of |
| 155 | + calculation is the same for a given round in the loop, I created a table for |
| 156 | + the exp3 values and interleaved the three tables so that a single lookup could |
| 157 | + be used for finding both the c and d values and the 3exp value. I found that a |
| 158 | + step size of 19 is the largest step size in which none of the values in the |
| 159 | + tables overflow 32 bit values. The size of the step tables doubles for each |
| 160 | + additional step. 19 steps takes 2 ^ 19 * 4 * 4 = 8,388,608 bytes. |
| 161 | + |
| 162 | +- The :term:`Delay` calculation loop was simplified by making sure that the step |
| 163 | + table is wider than the sieve bits. |
| 164 | + |
| 165 | +- In C, there is no efficient way of doing math operations with higher bit width |
| 166 | + than what is natively supported by the machine (because C does not support an |
| 167 | + efficient way of capturing the carry flag and including it in new |
| 168 | + calculations.) The target GPU, GF110, is a 32 bit machine and this calculator |
| 169 | + does 128 bit calculations while calculating the :term:`Delay`, so it was |
| 170 | + written in PTX (A VM assembly language for NVIDIA GPUs). This helped speed up |
| 171 | + other operations as well. |
| 172 | + |
| 173 | + |
| 174 | +Sieve generator |
| 175 | +~~~~~~~~~~~~~~~ |
| 176 | + |
| 177 | +As described above, the sieve is a precomputed table that specifies :term:`N` |
| 178 | +for which no :term:`Delay Record`\ s are possible and thus, can be skipped. |
| 179 | + |
| 180 | +A 19 bit wide sieve turned out to be the optimal size in my GPU implementation. |
| 181 | +Initially, I thought that the optimal size for the sieve would be the widest |
| 182 | +sieve that would fit in GPU memory, so I went about creating an app that could |
| 183 | +create an arbitrarily wide sieve. |
| 184 | + |
| 185 | +Generating a small sieve is simple. To generate a sieve, say 10 bits wide, 1024k |
| 186 | ++ i is calculated, where i loops from 0 to 1023. 10 steps of x/2 or (3x+1)/2 are |
| 187 | +done. After that a number on the form 3^p + r is obtained. If some of those |
| 188 | +numbers end up with the same p and r, all of them can be skipped, except the |
| 189 | +lowest one. |
| 190 | + |
| 191 | +However, this method does not work for generating a large sieve. The reason is |
| 192 | +that the algorithm is slowed down by a `Schlemiel the Painter's algorithm |
| 193 | +<http://en.wikipedia.org/wiki/Schlemiel_the_Painter%27s_algorithm>`_. For each |
| 194 | +new entry in the table, the algorithm has to revisit all the previously |
| 195 | +generated entries. As the number of entries increases, the algorithm keeps |
| 196 | +slowing down, until it virtually grinds to a halt. |
| 197 | + |
| 198 | +By analyzing the algorithm, I found that it could be implemented in a way that |
| 199 | +does not require revisiting all the previously generated entries for each new |
| 200 | +entry. The new algorithm makes it feasible to create large sieves. It works by |
| 201 | +creating entries that can be sorted in such a way that only a single pass over |
| 202 | +all the records is necessary. |
| 203 | + |
| 204 | +A sieve that would use 2GB of memory covers 2 (because we remove even numbered |
| 205 | +bits in the end) * 2GB * 8 (bits per byte) = 32gbit = 2^35 = 34 359 738 368 |
| 206 | +bits. To generate this sieve, it is necessary to have a sortable table with the |
| 207 | +same number of entries. Each entry is 16 bytes (optimized using bitfields). 16 |
| 208 | +bytes * 34 359 738 368 entry = 512GB of temporary storage. |
| 209 | + |
| 210 | +Unless one has a supercomputer with TBs of RAM, it is necessary to use disks for |
| 211 | +storage. I found a library called STXXL that implements STL for large datasets |
| 212 | +and includes algorithms that are efficient when using disk based storage. `STXXL |
| 213 | +<http://stxxl.sourceforge.net/>`_ enabled me to easily create an app that |
| 214 | +manipulates the data in much the same way as I would with regular STL. The |
| 215 | +stxxl::sort is not in-place. It requires the same amount of disk space as the |
| 216 | +size of the data being sorted, to store the sorted runs during sorting. So |
| 217 | +another 512GB is required during the step that sorts the entries. |
| 218 | + |
| 219 | +The same number of index records is also required, each is 64 bits + 8 bits = 9 |
| 220 | +bytes. This is less than the extra memory used by sorting the Collatz records, |
| 221 | +so the peak disk usage is 1TB. |
| 222 | + |
| 223 | +Adding 20% for overhead, I determined that around 1.2TB of disk space was |
| 224 | +required to generate a 2^35 sieve. At the time when I did this project, disks |
| 225 | +weren't that large, so I set up several of my largest disks in a JBOD |
| 226 | +configuration to hold the temporary data. The single file on there, that was |
| 227 | +over 1TB at one point, is still the biggest file I've seen. It took around two |
| 228 | +weeks to run the app, during which time the disks were working continuously. |
| 229 | + |
| 230 | + |
| 231 | +Technologies |
| 232 | +~~~~~~~~~~~~ |
| 233 | + |
| 234 | +CUDA, PTX, C++, Boost |
| 235 | + |
| 236 | + |
| 237 | +To do |
| 238 | +~~~~~ |
| 239 | + |
| 240 | +- There is one unused 32 bit word used for padding in the interleaved step |
| 241 | + table. It might be worth it to extend the exp3 to this word, so that more |
| 242 | + steps can be done in one iteration. |
0 commit comments