Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

In summary, the programmer ported the code from Intel AVX-512 assembly code to parallel code using CUDA on their graphics card. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on the programmer's Dell computer with a 10-core Xeon Silver processor.
  • #1
37,779
10,167
I've done a bit of CUDA programming lately, to exercise some parallel code on my nVidia graphics card. I also ported implemented the computations in Intel AVX-512 assembly code.

The code I wrote takes a bunch (=262,144 = ##2^{18}## to be exact) of points, and calculates the slope and y-intercept of the regression line that best fits these points. Since all the points were generated using a straight-line function, it's easy to tell whether the computed slope and intercept are correct. The two programs came in surprisingly close in elapsed time, with about 8 milliseconds for the CUDA version, and about 9 milliseconds for the AVX-512 version. Both versions were run on my Dell computer with a 10-core Xeon Silver processor. The nVidia card is a Quadro Pro P2000, with 8 multiprocessors, and 128 cores per MP,

If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
 
Last edited:
  • Like
Likes Jarvis323, pbuk and jedishrfu
Technology news on Phys.org
  • #2
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
 
  • #3
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python. Julia in particular has some numerical computing with CUDA and Python has numpy although I'm not sure of its CUDA capabilities.
My code is pretty much C/C++ (including CUDA extensions) for the CUDA version, and C/C++ with raw AVX-512 assembly for the other version. There are instrinsics for a lot of the AVX-512 and other assembly instructions, but I've never felt the need to use them.

I've never done anything in Julia, so can't say anything about it. Python + numpy would be very much slower, I believe. Recoding the C/C++ parts in Java might be anywhere from somewhat slower to a lot slower -- don't know.
 
  • Like
Likes jedishrfu and pbuk
  • #4
jedishrfu said:
It would be interesting to run it with various languages C vs Java vs Julia vs Python.
Maybe, but on a totally different level. This would simply show whether the particular implementation of the relevant language was capable of exploiting efficiences in either AVX512, CUDA or both.
jedishrfu said:
Python has numpy although I'm not sure of its CUDA capabilities.
It has none: for CUDA you need Numba. Which I think demonstrates my point: massively parallel numeric computing is not something that speeds up everything you do, it only speeds up code that is written and/or compiled specifically to take advantage of it.
 
  • #5
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores, however I wouldn't expect this GPU to shine in this test:
  • 64-bit performance on the P2000 is not great (95 GFLOPS vs 3 TFLOPS for 32 bit).
  • ## 2^{18} \times 8 ## bytes is 2 GBo, in 8 ms that's 25 GBo/s. PCIe3 is 32 GB/s so I think this is the bottleneck rather than core processing.
 
Last edited by a moderator:
  • Like
Likes Jarvis323
  • #6
pbuk said:
Yes it does, with 1024 cores you might expect the GPU to be at least 10x faster than the 10 x 512 / 64 = 80 parallel 64 bit computations of 10 AVX-512 cores,
But the AVX code isn't running in parallel, at least not based on anything I did. I've done some experimentation in the past on splitting a program up into threads, but there is so much overhead in comparison to the relatively small amount of work I'm doing, that it takes way longer with multiple threads than just running a single thread.
pbuk said:
##2^{18}×8## bytes is 2 GBo
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
 
  • #7
Mark44 said:
But the AVX code isn't running in parallel, at least not based on anything I did.
Well it is executing 512 / 64 operations in parallel per core, but if only one core then this only 8 parallel operations vs. 1024 for the GPU.

Mark44 said:
I didn't mention it, but the points are all doubles, so both programs are working with 2 GB of data.
Yes that's what I assumed but 2 GB takes at least 6 ms over PCIe 3 or 8 ms over DDR4 3200 so the CPU/GPU performance is dominated by bus bandwith in either case. You need something more intensive than the 6 FLOPs per 2 x 64 bit data point of simple linear least squares regression so the extra cores and GDDR5 bandwidth of the GPU can make a difference.
 
  • #8
Mark44 said:
If this piques the interest of enough people, I'll write something up explaining what I did. If not, I won't.
I would be interested in reading it.
 
  • #9
I'm still working on things. I'm doing it in two Insights articles, one for the CUDA part, and one for the AVX-512 part. The AVX-512 part will also include some timing comparisons. The CUDA part is pretty well done, but I haven't started on the AVX-512 writeup just yet -- it's tax season and I've been working on gathering info for the federal and state returns for my mother's estate.
 
  • Like
Likes sysprog and Jarvis323
  • #10
Blessings to you and yours in this time of your bereavement. Among your readership, I, and I'm confident many others, look forward to more of your good writings. Please carry on when you're ready.
 
  • #11
I'm nearly done with two articles -- one on a CUDA application and the other that does approximately the same thing in AVX-512. One article is finished, and the other is nearly finished. I'm hoping to get them published by the end of this week, maybe.
Edit: I contacted Greg by PM, and he said that the earliest the articles could be published was next Tuesday.
 
Last edited:
  • Like
Likes sysprog

FAQ: Interest in post comparing nVidia CUDA code vs. Intel AVX-512 code?

What is the difference between nVidia CUDA code and Intel AVX-512 code?

nVidia CUDA code is used for parallel computing on GPUs, while Intel AVX-512 code is used for vector processing on CPUs. This means that CUDA code is optimized for tasks that can be broken down into smaller, parallel computations, while AVX-512 code is designed for tasks that can benefit from vectorization.

Which type of code is better for scientific simulations?

This depends on the specific needs of the simulation. CUDA code may be better for simulations that require a lot of parallel computations, while AVX-512 code may be better for simulations that require a lot of vector processing. It is important to carefully consider the requirements of the simulation before choosing a code type.

Is one type of code faster than the other?

It is difficult to make a general statement about the speed of CUDA vs. AVX-512 code. Both types of code have their strengths and weaknesses, and their performance may vary depending on the specific task and hardware being used. Benchmarking and testing are necessary to determine which code type is faster for a particular application.

Can CUDA and AVX-512 code be used together?

Yes, it is possible to use both types of code together in certain situations. For example, a program may use CUDA code for the majority of the computations, but also incorporate AVX-512 code for certain vector processing tasks. This can potentially lead to improved performance and efficiency.

Are there any limitations to using CUDA or AVX-512 code?

Both CUDA and AVX-512 have specific hardware and software requirements, so there may be limitations in terms of the systems and programs that can support them. Additionally, not all tasks may be suitable for parallel or vector processing, so it is important to carefully evaluate the needs of a project before deciding to use either type of code.

Back
Top