- #1
- 37,777
- 10,153
Not a question - I just got started at writing CUDA code to run on an NVIDIA card I bought a couple of months ago. I've been interested in highly performant code for a long time and have spent some time in the last few years tinkering with the technologies that are featured in the Intel and AMD CPUs. These include code that directly accesses the floating point unit (FPU) and advances in the last 10 or so years, such as MMX, SIMD, AVX and a whole alphabet soup of other technogies.
The CUDA stuff offered by NVIDIA is pretty cool, making it possible to write some really fast code via massive parallelism.
Here's a very simple example to give you a little of the flavor of what's going on. There's a lot that's not shown, but there's enough here to show the power that's available.
The first snippet is "host" code, code that is mostly plain old C or C++ that will run on the CPU. Here we have two 50,000 element arrays being initialized. The "h" prefixes are reminders that these are arrays in the host memory.
The part that isn't plain old C code is the line just above, which includes a CUDA extension to C that calls "device" code, code that runs on the graphics processing unit (GPU). A part that I omitted was the business of copying the two arrays on the host to corresponding arrays on the device. These latter arrays have "d" prefixes.
Below is the code for the function that adds two vectors. A close look suggests that it is adding a component of one array to the corresponding component of another array, and storing this value in the component with the same index of a third array. In short, this code is just adding two numbers together, and storing their sum in another number. I thought we were adding two 50,000 element arrays. What gives here?
What's happening under the hood is that the CUDA library code is generating lots (hundreds or thousands) of threads, and they're all running concurrently, so instead of iterating through the two arrays sequentially, the additions are being done in parallel, in very little time, an order of magnitude faster than even the latest CPUs can do this work.
Anyway, I'm pretty excited about this and thought I'd share what I've learned so far (which isn't very much - I'm still pretty much a n00b).
The CUDA stuff offered by NVIDIA is pretty cool, making it possible to write some really fast code via massive parallelism.
Here's a very simple example to give you a little of the flavor of what's going on. There's a lot that's not shown, but there's enough here to show the power that's available.
The first snippet is "host" code, code that is mostly plain old C or C++ that will run on the CPU. Here we have two 50,000 element arrays being initialized. The "h" prefixes are reminders that these are arrays in the host memory.
C:
// Host code
numElements = 50000;
for (int i = 0; i < numElements; ++i)
{
h_A[i] = .0625 *i;
h_B[i] = h_A[i] + .0625;
}
.
.
.
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
.
.
Below is the code for the function that adds two vectors. A close look suggests that it is adding a component of one array to the corresponding component of another array, and storing this value in the component with the same index of a third array. In short, this code is just adding two numbers together, and storing their sum in another number. I thought we were adding two 50,000 element arrays. What gives here?
C:
// Device code
/* Vector addition: C = A + B.
* This sample implements element by element vector addition.
*/
__global__ void vectorAdd(const float *A, const float *B, float *C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
Anyway, I'm pretty excited about this and thought I'd share what I've learned so far (which isn't very much - I'm still pretty much a n00b).