Skip to content

Commit 1b57e97

Browse files
committed
initial work
1 parent b0cfea9 commit 1b57e97

19 files changed

+2959
-0
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -4,3 +4,5 @@
44
*.ptx
55
*.cubin
66
*.fatbin
7+
/libraries/
8+
/releases/

CMakeLists.txt

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
cmake_minimum_required(VERSION 3.5)
2+
project(cuda_collatz)
3+
#set(CMAKE_VERBOSE_MAKEFILE ON)
4+
5+
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_SOURCE_DIR}/bin)
6+
set(SOURCE_DIR ${CMAKE_SOURCE_DIR}/src/cuda_c)
7+
set(UTIL ${CMAKE_SOURCE_DIR}/src/util)
8+
9+
find_package(CUDA QUIET REQUIRED)
10+
message(STATUS "Found CUDA version: ${CUDA_VERSION}")
11+
#set(BUILD_SHARED_LIBS OFF)
12+
#set(CUDA_SEPARABLE_COMPILATION ON)
13+
14+
find_package(Boost 1.58.0 EXACT REQUIRED COMPONENTS program_options system)
15+
16+
set(
17+
CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS};
18+
-gencode arch=compute_50,code=sm_50
19+
-gencode arch=compute_35,code=sm_35
20+
-std=c++11;
21+
)
22+
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
23+
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; -g -G -O0)
24+
message(WARNING "Building DEBUG version: ${CUDA_NVCC_FLAGS}")
25+
else()
26+
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; -O3)
27+
message(WARNING "Building RELEASE version: ${CUDA_NVCC_FLAGS}")
28+
endif()
29+
30+
set(CMAKE_CXX_STANDARD 14)
31+
32+
# Specify include directories
33+
include_directories(
34+
${SOURCE_DIR}
35+
${UTIL}
36+
${Boost_INCLUDE_DIRS}
37+
)
38+
39+
link_directories(
40+
${Boost_LIBRARY_DIRS}
41+
)
42+
43+
link_libraries(
44+
${Boost_LIBRARIES}
45+
)
46+
47+
cuda_add_executable(
48+
cuda_collatz
49+
${SOURCE_DIR}/collatz.cpp
50+
${SOURCE_DIR}/collatz_kernel.cu
51+
${UTIL}/cuda_util.cpp
52+
)

README.rst

+242
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
CUDA Collatz
2+
============
3+
4+
See `Wikipedia - Collatz Conjecture
5+
<http://en.wikipedia.org/wiki/Collatz_conjecture>`_ and `On the 3x + 1 problem
6+
<http://www.ericr.nl/wondrous/index.html>`_ for more information about this
7+
intriguing mathematical problem.
8+
9+
I think this might be the world's fastest single chip Collatz :term:`Delay
10+
Record` calculator. With starting numbers in the 64 bit range, this app checks
11+
around 8 billion numbers per second for :term:`Delay Record`\ s on a GTX570
12+
graphics card.
13+
14+
15+
Glossary
16+
~~~~~~~~
17+
18+
N
19+
20+
:A positive number for which the :term:`Delay` is to be calculated. In
21+
theory, it can be infinitely large, but this implementation limits starting
22+
:term:`N` to 64 bits and :term:`N` that may be reached during the trajectory
23+
to 128 bits.
24+
25+
Delay
26+
27+
:The number of iterations of the Collatz rules on a given :term:`N` until the
28+
:term:`N` reaches the value 1.
29+
30+
Trajectory
31+
32+
:The path a specific :term:`N` follows from its first value until it reaches
33+
the value 1.
34+
35+
Delay Class
36+
37+
:All numbers :term:`N` that have a specific :term:`Delay`.
38+
39+
Class Record
40+
41+
:The lowest :term:`N` in a given :term:`Delay Class`.
42+
43+
The first :term:`N` that we find that has a previously unencountered
44+
45+
:term:`Delay` is a Class Record. If the :term:`Delay` is also in the highest
46+
:term:`Delay Class`, it is also a :term:`Delay Record`.
47+
48+
Delay Record
49+
50+
:The lowest :term:`N` that has a :term:`Delay` higher than the :term:`Delay`
51+
of the previous :term:`Delay Record`.
52+
53+
54+
Overview of implemented optimizations
55+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56+
57+
The performance of this app is achieved by a combination of high level and low
58+
level optimizations. What follows is brief overview of each optimization.
59+
60+
61+
High level optimizations
62+
------------------------
63+
64+
These optimizations are described on the `Wikipedia - Collatz Conjecture
65+
<http://en.wikipedia.org/wiki/Collatz_conjecture>`_ page.
66+
67+
Skipping :term:`N`
68+
``````````````````
69+
70+
Many :term:`N` can be ruled out as :term:`Delay Record`\ s without calculating
71+
their :term:`Delay`\ s.
72+
73+
- All even :term:`N` are skipped because any even :term:`N` :term:`Delay Record`
74+
can be derived directly from the previous odd numbered :term:`Delay Record`.
75+
(Skips 50% of all :term:`N`.)
76+
77+
- All :term:`N` on the form of 3k+2 (N is congruent 2 modulo 3) are
78+
skipped because these numbers are not potential :term:`Delay Record`\ s.
79+
(Skips 33% of all remaining :term:`N`.)
80+
81+
- A table, called a sieve, is used to skip checking of many :term:`N`. The sieve
82+
checks whether paths come together. If two paths join then the upper one can
83+
never yield a :term:`Delay Record` (or Class Record) and can be skipped.
84+
(Skips approximately 80% of all remaining :term:`N`.)
85+
86+
Example: Any :term:`N` of the form 8k+4 will reach 6k+4 after 3 steps, and so
87+
will any :term:`N` of the form 8k+5. Therefore no :term:`N` of the form 8k+5
88+
can be a Class Record (5 itself being the only exception). So any :term:`N` of
89+
the form 8k+5 does not need to be checked, and all positions of the form 8k+5
90+
in the sieve contain a zero.
91+
92+
After these optimization have been performed, Less than 7% of all :term:`N`
93+
remain to actually be calculated. So, while the app checks around 8 billion
94+
:term:`N`\ s, it calculates the :term:`Delay` of around 560 million :term:`N` s.
95+
96+
97+
:term:`Delay` calculation optimizations
98+
```````````````````````````````````````
99+
100+
The :term:`Delay` s for the :term:`N` that were not skipped must be calculated.
101+
The following optimizations are done when calculating :term:`Delay`\ s.
102+
103+
- Lookup tables are used to perform multiple steps of the Collatz calculation
104+
per iteration of the calculation loop.
105+
106+
- A table, called a "tail", is used to prevent having to calculate the
107+
:term:`Delay`\ s of small Ns. Once :term:`N` drops below a certain value, the
108+
final delay is calculated by looking the remaining :term:`Delay` up in this
109+
table and adding it to the current :term:`Delay`.
110+
111+
- In addition, an early break technique has been tested. In this technique,
112+
:term:`N` is compared with known, earlier :term:`Delay Record`\ s. Calculation
113+
is ended early when it is determined that :term:`N` cannot possibly become a
114+
:term:`Delay Record`. Unfortunately, the speed increase from ending
115+
calculations early was outweighed by the overhead of continually checking
116+
:term:`N` against a table of :term:`Delay Record`\ s, resulting in a net
117+
decrease of calculation speed. So, the early break optimization has been left
118+
in the code, but has been disabled.
119+
120+
121+
Low Level optimizations
122+
-----------------------
123+
124+
These are specific to my CUDA implementation for the GPU.
125+
126+
- The sieve technique that is described on Wikipedia yields a table that
127+
contains a power-of-two number of entries. For instance, a 20 bit sieve
128+
contains 2^20 = 1,048,576 entries. The table is applicable to all Ns, to
129+
infinity. Each entry is a true/false value that describes if the corresponding
130+
:term:`N` modulus X can be skipped. In a straightforward GPU implementation,
131+
one would create 1,048,576 threads, have each check its true/false value in
132+
the table, and abort if the value is false. This would abort around 80% of the
133+
values, and calculations would proceed on 20%. On the GPU, threads are run in
134+
warps. A warp has to keep running as long as at least 1 thread is active. In
135+
general, each warp would have a few threads active after sieving, so there
136+
would be almost no speed advantage on the GPU. The most fun I had with this
137+
project was with finding out how to optimize this on the GPU. The solution
138+
turned out to be simple. All I had to do was to transform the table of
139+
true/false values to a table of offsets to all the true values. In this way,
140+
only the same number of threads as there were true values had to be started
141+
and no threads had to be aborted. Each thread determines which :term:`N` it
142+
should calculate by using its index to look up the corresponding offset in the
143+
sieve offsets table and adding it to the base :term:`N`.
144+
145+
- Instead of individually performing the three steps described above that filter
146+
out :term:`N` I rolled those filters into a combined table. Because one of the
147+
filters remove all numbers on the form 3k+2 (one number out of three), this
148+
was accomplished by creating three variations of the table, each filtering a
149+
different set of every three numbers and, for each block of :term:`N` select
150+
the one that filters out the correct numbers for that :term:`N` base.
151+
152+
- The step algorithm requires two tables called c and d. It also requires that 3
153+
to-the-power of the lookup index be calculated for each lookup. Because the
154+
indexes into each of the tables and the index used in the 3 to-the-power-of
155+
calculation is the same for a given round in the loop, I created a table for
156+
the exp3 values and interleaved the three tables so that a single lookup could
157+
be used for finding both the c and d values and the 3exp value. I found that a
158+
step size of 19 is the largest step size in which none of the values in the
159+
tables overflow 32 bit values. The size of the step tables doubles for each
160+
additional step. 19 steps takes 2 ^ 19 * 4 * 4 = 8,388,608 bytes.
161+
162+
- The :term:`Delay` calculation loop was simplified by making sure that the step
163+
table is wider than the sieve bits.
164+
165+
- In C, there is no efficient way of doing math operations with higher bit width
166+
than what is natively supported by the machine (because C does not support an
167+
efficient way of capturing the carry flag and including it in new
168+
calculations.) The target GPU, GF110, is a 32 bit machine and this calculator
169+
does 128 bit calculations while calculating the :term:`Delay`, so it was
170+
written in PTX (A VM assembly language for NVIDIA GPUs). This helped speed up
171+
other operations as well.
172+
173+
174+
Sieve generator
175+
~~~~~~~~~~~~~~~
176+
177+
As described above, the sieve is a precomputed table that specifies :term:`N`
178+
for which no :term:`Delay Record`\ s are possible and thus, can be skipped.
179+
180+
A 19 bit wide sieve turned out to be the optimal size in my GPU implementation.
181+
Initially, I thought that the optimal size for the sieve would be the widest
182+
sieve that would fit in GPU memory, so I went about creating an app that could
183+
create an arbitrarily wide sieve.
184+
185+
Generating a small sieve is simple. To generate a sieve, say 10 bits wide, 1024k
186+
+ i is calculated, where i loops from 0 to 1023. 10 steps of x/2 or (3x+1)/2 are
187+
done. After that a number on the form 3^p + r is obtained. If some of those
188+
numbers end up with the same p and r, all of them can be skipped, except the
189+
lowest one.
190+
191+
However, this method does not work for generating a large sieve. The reason is
192+
that the algorithm is slowed down by a `Schlemiel the Painter's algorithm
193+
<http://en.wikipedia.org/wiki/Schlemiel_the_Painter%27s_algorithm>`_. For each
194+
new entry in the table, the algorithm has to revisit all the previously
195+
generated entries. As the number of entries increases, the algorithm keeps
196+
slowing down, until it virtually grinds to a halt.
197+
198+
By analyzing the algorithm, I found that it could be implemented in a way that
199+
does not require revisiting all the previously generated entries for each new
200+
entry. The new algorithm makes it feasible to create large sieves. It works by
201+
creating entries that can be sorted in such a way that only a single pass over
202+
all the records is necessary.
203+
204+
A sieve that would use 2GB of memory covers 2 (because we remove even numbered
205+
bits in the end) * 2GB * 8 (bits per byte) = 32gbit = 2^35 = 34 359 738 368
206+
bits. To generate this sieve, it is necessary to have a sortable table with the
207+
same number of entries. Each entry is 16 bytes (optimized using bitfields). 16
208+
bytes * 34 359 738 368 entry = 512GB of temporary storage.
209+
210+
Unless one has a supercomputer with TBs of RAM, it is necessary to use disks for
211+
storage. I found a library called STXXL that implements STL for large datasets
212+
and includes algorithms that are efficient when using disk based storage. `STXXL
213+
<http://stxxl.sourceforge.net/>`_ enabled me to easily create an app that
214+
manipulates the data in much the same way as I would with regular STL. The
215+
stxxl::sort is not in-place. It requires the same amount of disk space as the
216+
size of the data being sorted, to store the sorted runs during sorting. So
217+
another 512GB is required during the step that sorts the entries.
218+
219+
The same number of index records is also required, each is 64 bits + 8 bits = 9
220+
bytes. This is less than the extra memory used by sorting the Collatz records,
221+
so the peak disk usage is 1TB.
222+
223+
Adding 20% for overhead, I determined that around 1.2TB of disk space was
224+
required to generate a 2^35 sieve. At the time when I did this project, disks
225+
weren't that large, so I set up several of my largest disks in a JBOD
226+
configuration to hold the temporary data. The single file on there, that was
227+
over 1TB at one point, is still the biggest file I've seen. It took around two
228+
weeks to run the app, during which time the disks were working continuously.
229+
230+
231+
Technologies
232+
~~~~~~~~~~~~
233+
234+
CUDA, PTX, C++, Boost
235+
236+
237+
To do
238+
~~~~~
239+
240+
- There is one unused 32 bit word used for padding in the interleaved step
241+
table. It might be worth it to extend the exp3 to this word, so that more
242+
steps can be done in one iteration.

0 commit comments

Comments
 (0)