Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

Mark44 · Mar 23, 2022

I've done a bit of CUDA programming lately, to exercise some parallel code on my nVidia graphics card. I also ported implemented the computations in Intel AVX-512 assembly code.

The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.

jedishrfu · Mar 23, 2022

It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.

Mark44 · Mar 23, 2022

jedishrfu said:

It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.

My code is pretty much C/C++ (including CUDA extensions) for the CUDA version, and C/C++ with raw AVX-512 assembly for the other version. There are instrinsics for a lot of the AVX-512 and other assembly instructions, but I've never felt the need to use them.

I've never done anything in Julia, so can't say anything about it. Python + numpy would be very much slower, I believe. Recoding the C/C++ parts in Java might be anywhere from somewhat slower to a lot slower -- don't know.

pbuk · Mar 23, 2022

jedishrfu said:

It would be interesting to run it with various languages C vs Java vs Julia vs Python.

Maybe, but on a totally different level. This would simply show whether the particular implementation of the relevant language was capable of exploiting efficiences in either AVX512, CUDA or both.

jedishrfu said:

Python has numpy although I'm not sure of its CUDA capabilities.

It has none: for CUDA you need Numba. Which I think demonstrates my point: massively parallel numeric computing is not something that speeds up everything you do, it only speeds up code that is written and/or compiled specifically to take advantage of it.

pbuk · Mar 23, 2022

Mark44 said:

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.

Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores, however I wouldn't expect this GPU to shine in this test:

64-bit performance on the P2000 is not great (95 GFLOPS vs 3 TFLOPS for 32 bit).
## 2^{18} \times 8 ## bytes is 2 GBo, in 8 ms that's 25 GBo/s. PCIe3 is 32 GB/s so I think this is the bottleneck rather than core processing.

Mark44 · Mar 23, 2022

pbuk said:

Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores,

But the AVX code isn't running in parallel, at least not based on anything I did. I've done some experimentation in the past on splitting a program up into threads, but there is so much overhead in comparison to the relatively small amount of work I'm doing, that it takes way longer with multiple threads than just running a single thread.

pbuk said:

##2^{18}×8## bytes is 2 GBo

I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.

pbuk · Mar 23, 2022

Mark44 said:

But the AVX code isn't running in parallel, at least not based on anything I did.

Well it is executing 512 / 64 operations in parallel per core, but if only one core then this only 8 parallel operations vs. 1024 for the GPU.

Mark44 said:

I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.

Yes that's what I assumed but 2 GB takes at least 6 ms over PCIe 3 or 8 ms over DDR4 3200 so the CPU/GPU performance is dominated by bus bandwith in either case. You need something more intensive than the 6 FLOPs per 2 x 64 bit data point of simple linear least squares regression so the extra cores and GDDR5 bandwidth of the GPU can make a difference.

sysprog · Mar 23, 2022

Mark44 said:

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.

I would be interested in reading it.

Mark44 · Apr 8, 2022

I'm still working on things. I'm doing it in two Insights articles, one for the CUDA part, and one for the AVX-512 part. The AVX-512 part will also include some timing comparisons. The CUDA part is pretty well done, but I haven't started on the AVX-512 writeup just yet -- it's tax season and I've been working on gathering info for the federal and state returns for my mother's estate.

sysprog · Apr 9, 2022

Blessings to you and yours in this time of your bereavement. Among your readership, I, and I'm confident many others, look forward to more of your good writings. Please carry on when you're ready.

Mark44 · Apr 13, 2022

I'm nearly done with two articles -- one on a CUDA application and the other that does approximately the same thing in AVX-512. One article is finished, and the other is nearly finished. I'm hoping to get them published by the end of this week, maybe.
Edit: I contacted Greg by PM, and he said that the earliest the articles could be published was next Tuesday.

Mark44 · Apr 20, 2022

The first of the two articles is now published -- https://www.physicsforums.com/threads/parallel-programming-on-an-nvidia-gpu.1014468/.

I expect the second to be published within a day or two.

Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

FAQ: Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

What is the difference between nVidia CUDA code and Intel AVX-512 code?

Which type of code is better for scientific simulations?

Is one type of code faster than the other?

Can CUDA and AVX-512 code be used together?

Are there any limitations to using CUDA or AVX-512 code?

Similar threads

Hot Threads

Recent Insights