Why are GPUs fast?

  • Thread starter Vanadium 50
  • Start date
In summary, GPUs (graphics processing units) are fast because they are specifically designed to handle complex graphical computations and parallel processing tasks. They have thousands of smaller processing cores that can handle multiple tasks simultaneously, making them more efficient than traditional CPUs for tasks such as rendering graphics, video editing, and machine learning. Additionally, GPUs have dedicated memory and specialized algorithms that allow them to process large amounts of data quickly. This makes them ideal for applications that require high-speed and efficient processing, making them an essential component in modern computers and devices.
  • #1
Vanadium 50
Staff Emeritus
Science Advisor
Education Advisor
2023 Award
35,005
21,673
I have said here a few times that GPUs are not fast, they are wide. But is that the whole story? I can see several differences, and I am curious as to which ones make the biggest difference.
  1. They are wide. Instead of 64 bits, they can be 2048, and thus execute code 32x faster.
  2. The card doesn't have to run everything. If its illl-suited, dump it on the CPU. The CPU has no choice.
  3. CPUs use a lot of silicon to do branch prediction. The secret to fast GPU code is "never branch if you can help it" so that silicon can be used for other things.
  4. Half-precision is a "thing". Doubles the speed if you don't need full precision.
  5. GPU memory is optimized differently than CPU memory.
  6. Programming a GPU takes some skill, so doing it badly is harder.
So, which ones make the most difference? Have I missed any important ones?
 
  • Like
Likes Rive, Greg Bernhardt and FactChecker
Computer science news on Phys.org
  • #2
A GPU is a vector processor, like an early Cray supercomputer, on one chip.

It is not the thousand bit processor width, but the thousand processor width, that makes the difference. There can be one 32-bit processor for each pixel across the screen. The optimum SIMD.
 
  • Like
Likes FactChecker
  • #3
The design of the IBM cell processor had a CPU and eight special-purpose processing units. The synergistic processing elements (SPE) ran very fast until you branched. The cell processor was a hybrid with generic CPU and Nvidia GPU features.

https://en.wikipedia.org/wiki/Cell_(processor)

Algorithms with minimal branching ran eight times faster, but when branching occurred, the deep pipeline had to be flushed and reloaded with new instructions to process, reducing the CPU on the chip to two times.

One of the cell processor's core issues was trying to parallelize more algorithms to extract more speed from the chip. Sadly, they could only double their speed, so the Intel multi-core chips superseded them.

These chips were used in the PS3:

CPU: “Cell Broadband Engine,” a unique processor co-developed by Sony, Toshiba, and IBM, running at 3.2 GHz.
 
  • Informative
Likes FactChecker
  • #4
Vanadium 50 said:
Programming a GPU takes some skill, so doing it badly is harder.
That sounds backwards. If programming a GPU takes skill, then doing it badly should be easy, not harder than for a CPU.
 
  • #5
Baluncore said:
A GPU is a vector processor, like an early Cray supercomputer, on one chip.
I don't understand the distinction you are drawing, but glad you used the term SIMD.

If I want to do 32 logical ANDs on an x86, I give the ALU these values one at a time, and have it compute them. If I want to do this in a GPU, I load up a 2048 bit "register" with all 32 numbers and send them to the ALU at once.

I also have the freedom to arrange these 2048 bits differently if I choose - 64 32-bit numbers, 256 8-bit numbers, etc.

GPUs are fast because they are wide -they take big bites...er...bytes....er..no, bites.
 
  • #6
@Vanadium 50 what am I missing in post #4? I assume you have a reason for your item 6 but I can't see what it would be.
 
  • #7
phinds said:
then doing it badly should be easy, not harder than for a CPU.
Because if you do it badly, it won't work at all. :smile:

The concept of being able to hack something together like spaghetti-python on a GPU isn't really a thing. You need to know enough about data structures and data motion to get the data on and off the GPU. Or nothing will happen.
 
  • Informative
  • Like
Likes jedishrfu and phinds
  • #8
However, like most programming, doing bad may work until edge cases are found where it fails, and then there's the bad where it just doesn't work.
 
  • #9
Fundamentally, the thing that absolutely kills GPU performance is data motion. "Bad" GPU code has too much data motion - moving data on and off the card unnecessarily. While it is always possible to write bad code, you don't usually get a lot of this because it takes more writing.

This assumes that someone already knows that this is a a SIMD architecture so a bramble of branches is a mistake.
 
  • #10
Vanadium 50 said:
If I want to do 32 logical ANDs on an x86, I give the ALU these values one at a time, and have it compute them. If I want to do this in a GPU, I load up a 2048 bit "register" with all 32 numbers and send them to the ALU at once.
Not necessarily one at a time. The computer I'm using right now (Dell tower with Xeon Silver CPU) has 32 64-byte-wide registers that can process 32 16-bit quantities in a single operation. Not quite the bandwidth of the GPU on this computer, but still well above the capabilities of the normal x-86 CPU.
 
  • Like
Likes jedishrfu
  • #11
Yes, and both CPU and GPUs all contain multiple cores. And if I really only had 32 ANDs, there's no point to send them to the GPU at all. 32 million, on the other hand....
 
  • #12
Mark44 said:
Not necessarily one at a time.
So let's discuss how GPUs work. Suppose I wanted to calculate π by the world's slowest algorithm.

[tex]\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7}... [/tex].

So, I code that up:
Code:
for N = 1 to many
   if N is odd
    sum = sum + 1/(2N-1)
  else
     sum = sum - 1/(2N-1)
  end if'
end for

And I discover it is slow.

The thing that's killing it is the branch. Because what;s really going on is a 2048-bit calculation, I can't branch on some and not on others. So the GPU can only run 64 bits at a time. This costs me a factor of 32.

OK, so let's fix this. Let's do all the odds and then the evens, and then subtract them. That doesn't work because those series are divergent.

All right, so, maybe we send it an array and have it add them: hand it (1, -3, +5. -7...) and so on and have the GPU do the math. This is less bad, but you are spending a lot of CPU time in creating that array, and probably more tim,e getting it on the GPU compared to the GPU calculation time.

No, the best thing to do is to remove the branch:

Code:
for N = 1 to many step 2
    sum = sum + 1/(2N-1) -1/(2N+1)
end for
 
  • #13
Vanadium 50 said:
No, the best thing to do is to remove the branch:

Code:
for N = 1 to many step 2
    sum = sum + 1/(2N-1) -1/(2N+1)
end for
You still have a branch, one that comes after each iteration of the loop. Of course, modern processors incorporate branch-prediction logic that can lessen the impact of having to invalidate the processor pipeline.

An improvement to your loop with its single instruction that adds two terms per iteration is to add four or eight or sixteen or more per loop iteration. Even so, it's not clear to me how to take advantage of SIMD (single instruction multiple data) instructions with this algorithm that would take advantage of the parallelism capabilities of GPUs.
 
  • Like
Likes Vanadium 50 and jedishrfu
  • #14
##\pi/4 = 1-1/3+1/5-1/7+1/9-...## is not a hard calculation to make parallel. The calculation on a computer must be finite. You can set a total limit of summing 10 million terms and calculate 10 summations of one million each in parallel. The lack of any dependence between the parallel calculations is key.
 
Last edited:
  • #15
Mark44 said:
You still have a branch, one that comes after each iteration of the loop.
Yes, but all 32 take the same branch - not 16 in one direction and 16 in the other. That's the key - think of it as one giant execution unit operating on 2k bit words.

Branch prediction is a big area of research. Unfortunately, it is running against compiler optimization. The simplest thing to do in prediction is "always go back; it's probably a loop" which stops working when the optimizer starts unrolling loops.

You could lump these together 4 at a time., 8 at a time whatever. What does that save? Not computation time - the computations are the same. It saves you the time it takes to do the reduction. There is also a minor issue - "many" gets partitions out to all the 2K execution units there are. If you make each chunk of the problem too big, the compiler can't distribute it among all the execution units.
 
  • #16
Vanadium 50 said:
So let's discuss how GPUs work. Suppose I wanted to calculate π by the world's slowest algorithm.
To evaluate the series in parallel, group the terms in pairs of; [1/n - 1/(n+2)]
Start with the minor terms on the right first, so round off errors are minimised.

Initialise each processor with a value n. Each processor unit evaluates two reciprocals by long division, and their difference. Other processors are evaluating the difference of other pairs of terms, in parallel.

The terms are then summed from right to left as they are generated across the processor array.
 
  • #17
Vanadium 50 said:
Branch prediction is a big area of research. Unfortunately, it is running against compiler optimization. The simplest thing to do in prediction is "always go back; it's probably a loop" which stops working when the optimizer starts unrolling loops.
It's my understanding that a lot of the benefit of branch prediction comes from starting early retrieval of data from slower memory. I don't see how unrolling could hurt that.
 
  • #18
If you don't unroll, you guess right very often with very little circuitry. If you do unroll, you guess right less often and you need more transistors to guess well.
 
  • Like
Likes FactChecker
  • #19
Vanadium 50 said:
If you don't unroll, you guess right very often with very little circuitry. If you do unroll, you guess right less often and you need more transistors to guess well.
I guess I don't understand what assembly code is produced by unrolling.

ADDED: The way I imagined unrolling would remove the conditional branch.
 
  • #20
The internals of the processor elements will need the ability to perform a long division, without branching. That will require a comparison or subtraction, followed by a result determined register selection.
 
  • #21
You can definitely improve on the calculation with algebra. However, my general reaction is to let the compiler figure it out.

The first step in performance is to get all the hardware working on the problem.
 
Back
Top