Why are GPUs fast?

Vanadium 50 · Aug 22, 2024

I have said here a few times that GPUs are not fast, they are wide. But is that the whole story? I can see several differences, and I am curious as to which ones make the biggest difference.

They are wide. Instead of 64 bits, they can be 2048, and thus execute code 32x faster.
The card doesn't have to run everything. If its illl-suited, dump it on the CPU. The CPU has no choice.
CPUs use a lot of silicon to do branch prediction. The secret to fast GPU code is "never branch if you can help it" so that silicon can be used for other things.
Half-precision is a "thing". Doubles the speed if you don't need full precision.
GPU memory is optimized differently than CPU memory.
Programming a GPU takes some skill, so doing it badly is harder.

So, which ones make the most difference? Have I missed any important ones?

Baluncore · Aug 22, 2024

A GPU is a vector processor, like an early Cray supercomputer, on one chip.

It is not the thousand bit processor width, but the thousand processor width, that makes the difference. There can be one 32-bit processor for each pixel across the screen. The optimum SIMD.

jedishrfu · Aug 22, 2024

The design of the IBM cell processor had a CPU and eight special-purpose processing units. The synergistic processing elements (SPE) ran very fast until you branched. The cell processor was a hybrid with generic CPU and Nvidia GPU features.

https://en.wikipedia.org/wiki/Cell_(processor)

Algorithms with minimal branching ran eight times faster, but when branching occurred, the deep pipeline had to be flushed and reloaded with new instructions to process, reducing the CPU on the chip to two times.

One of the cell processor's core issues was trying to parallelize more algorithms to extract more speed from the chip. Sadly, they could only double their speed, so the Intel multi-core chips superseded them.

These chips were used in the PS3:

• CPU: “Cell Broadband Engine,” a unique processor co-developed by Sony, Toshiba, and IBM, running at 3.2 GHz.

phinds · Aug 22, 2024

Vanadium 50 said:

Programming a GPU takes some skill, so doing it badly is harder.

That sounds backwards. If programming a GPU takes skill, then doing it badly should be easy, not harder than for a CPU.

Vanadium 50 · Aug 23, 2024

Baluncore said:

A GPU is a vector processor, like an early Cray supercomputer, on one chip.

I don't understand the distinction you are drawing, but glad you used the term SIMD.

If I want to do 32 logical ANDs on an x86, I give the ALU these values one at a time, and have it compute them. If I want to do this in a GPU, I load up a 2048 bit "register" with all 32 numbers and send them to the ALU at once.

I also have the freedom to arrange these 2048 bits differently if I choose - 64 32-bit numbers, 256 8-bit numbers, etc.

GPUs are fast because they are wide -they take big bites...er...bytes....er..no, bites.

phinds · Aug 23, 2024

@Vanadium 50 what am I missing in post #4? I assume you have a reason for your item 6 but I can't see what it would be.

Vanadium 50 · Aug 23, 2024

phinds said:

then doing it badly should be easy, not harder than for a CPU.

Because if you do it badly, it won't work at all.

The concept of being able to hack something together like spaghetti-python on a GPU isn't really a thing. You need to know enough about data structures and data motion to get the data on and off the GPU. Or nothing will happen.

jedishrfu · Aug 23, 2024

However, like most programming, doing bad may work until edge cases are found where it fails, and then there's the bad where it just doesn't work.

Vanadium 50 · Aug 23, 2024

Fundamentally, the thing that absolutely kills GPU performance is data motion. "Bad" GPU code has too much data motion - moving data on and off the card unnecessarily. While it is always possible to write bad code, you don't usually get a lot of this because it takes more writing.

This assumes that someone already knows that this is a a SIMD architecture so a bramble of branches is a mistake.

Mark44 · Aug 23, 2024

Vanadium 50 said:

If I want to do 32 logical ANDs on an x86, I give the ALU these values one at a time, and have it compute them. If I want to do this in a GPU, I load up a 2048 bit "register" with all 32 numbers and send them to the ALU at once.

Not necessarily one at a time. The computer I'm using right now (Dell tower with Xeon Silver CPU) has 32 64-byte-wide registers that can process 32 16-bit quantities in a single operation. Not quite the bandwidth of the GPU on this computer, but still well above the capabilities of the normal x-86 CPU.

Vanadium 50 · Aug 23, 2024

Yes, and both CPU and GPUs all contain multiple cores. And if I really only had 32 ANDs, there's no point to send them to the GPU at all. 32 million, on the other hand....

Vanadium 50 · Aug 24, 2024

Mark44 said:

Not necessarily one at a time.

So let's discuss how GPUs work. Suppose I wanted to calculate π by the world's slowest algorithm.

[tex]\frac{\pi}{4} = 1 - \frac{1}{3} + \frac{1}{5} - \frac{1}{7}... [/tex].

So, I code that up:

Code:

for N = 1 to many
   if N is odd
    sum = sum + 1/(2N-1)
  else
     sum = sum - 1/(2N-1)
  end if'
end for

And I discover it is slow.

The thing that's killing it is the branch. Because what;s really going on is a 2048-bit calculation, I can't branch on some and not on others. So the GPU can only run 64 bits at a time. This costs me a factor of 32.

OK, so let's fix this. Let's do all the odds and then the evens, and then subtract them. That doesn't work because those series are divergent.

All right, so, maybe we send it an array and have it add them: hand it (1, -3, +5. -7...) and so on and have the GPU do the math. This is less bad, but you are spending a lot of CPU time in creating that array, and probably more tim,e getting it on the GPU compared to the GPU calculation time.

No, the best thing to do is to remove the branch:

Code:

for N = 1 to many step 2
    sum = sum + 1/(2N-1) -1/(2N+1)
end for

Mark44 · Aug 24, 2024

Vanadium 50 said:
No, the best thing to do is to remove the branch:
Code:
for N = 1 to many step 2
    sum = sum + 1/(2N-1) -1/(2N+1)
end for

You still have a branch, one that comes after each iteration of the loop. Of course, modern processors incorporate branch-prediction logic that can lessen the impact of having to invalidate the processor pipeline.

An improvement to your loop with its single instruction that adds two terms per iteration is to add four or eight or sixteen or more per loop iteration. Even so, it's not clear to me how to take advantage of SIMD (single instruction multiple data) instructions with this algorithm that would take advantage of the parallelism capabilities of GPUs.

FactChecker · Aug 24, 2024

##\pi/4 = 1-1/3+1/5-1/7+1/9-...## is not a hard calculation to make parallel. The calculation on a computer must be finite. You can set a total limit of summing 10 million terms and calculate 10 summations of one million each in parallel. The lack of any dependence between the parallel calculations is key.

Vanadium 50 · Aug 24, 2024

Mark44 said:

You still have a branch, one that comes after each iteration of the loop.

Yes, but all 32 take the same branch - not 16 in one direction and 16 in the other. That's the key - think of it as one giant execution unit operating on 2k bit words.

Branch prediction is a big area of research. Unfortunately, it is running against compiler optimization. The simplest thing to do in prediction is "always go back; it's probably a loop" which stops working when the optimizer starts unrolling loops.

You could lump these together 4 at a time., 8 at a time whatever. What does that save? Not computation time - the computations are the same. It saves you the time it takes to do the reduction. There is also a minor issue - "many" gets partitions out to all the 2K execution units there are. If you make each chunk of the problem too big, the compiler can't distribute it among all the execution units.

Baluncore · Aug 24, 2024

Vanadium 50 said:

So let's discuss how GPUs work. Suppose I wanted to calculate π by the world's slowest algorithm.

To evaluate the series in parallel, group the terms in pairs of; [1/n - 1/(n+2)]
Start with the minor terms on the right first, so round off errors are minimised.

Initialise each processor with a value n. Each processor unit evaluates two reciprocals by long division, and their difference. Other processors are evaluating the difference of other pairs of terms, in parallel.

The terms are then summed from right to left as they are generated across the processor array.

FactChecker · Aug 24, 2024

Vanadium 50 said:

Branch prediction is a big area of research. Unfortunately, it is running against compiler optimization. The simplest thing to do in prediction is "always go back; it's probably a loop" which stops working when the optimizer starts unrolling loops.

It's my understanding that a lot of the benefit of branch prediction comes from starting early retrieval of data from slower memory. I don't see how unrolling could hurt that.

Vanadium 50 · Aug 24, 2024

If you don't unroll, you guess right very often with very little circuitry. If you do unroll, you guess right less often and you need more transistors to guess well.

FactChecker · Aug 24, 2024

Vanadium 50 said:

If you don't unroll, you guess right very often with very little circuitry. If you do unroll, you guess right less often and you need more transistors to guess well.

I guess I don't understand what assembly code is produced by unrolling.

ADDED: The way I imagined unrolling would remove the conditional branch.

Baluncore · Aug 24, 2024

The internals of the processor elements will need the ability to perform a long division, without branching. That will require a comparison or subtraction, followed by a result determined register selection.

Vanadium 50 · Aug 24, 2024

You can definitely improve on the calculation with algebra. However, my general reaction is to let the compiler figure it out.

The first step in performance is to get all the hardware working on the problem.

Why are GPUs fast?

Similar threads

Hot Threads

Recent Insights