- #1
- 37,779
- 10,167
I've done a bit of CUDA programming lately, to exercise some parallel code on my nVidia graphics card. I also ported implemented the computations in Intel AVX-512 assembly code.
The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
Last edited: