Branchless programming

Presentations

Branchless programming

How to use modern CPUs effectively

Early CPU architectures

images/intel8080.png

How to make such beast faster?

images/cpu1.png

Classic RISC pipeline

images/pipeline.png

Classic RISC pipeline

Problems in classic RISC pipeline

Data hazards

SUB r3,r4 -> r10     ; writes r3 - r4 to r10
AND r10,r3 -> r11    ; writes r10 & r3 to r11

images/hazard.png

Data hazards

“Bubbles” in pipeline

images/bubble.png

Data forwarding

images/forwarding.png

Branches

Conditional branches

CMP r0, r1     ; compare content of two registers
BNE non_equal  ; branch if not equal

Conditional branches - solutions

Branch delay slot

LD   r0, 0        ; instruction before jump
JMP  foobar       ; jump
ADD  r0, r1 -> r2 ; first delay slot
SUB  r4, r5 -> r6 ; second delay slot
LD   r0, 0        ; instruction before jump
CALL printf       ; function (subroutine) call
ADD  r0, r1 -> r2 ; first delay slot
SUB  r4, r5 -> r6 ; second delay slot
LD   r0, 0        ; instruction before jump
RET               ; return from subroutine
ADD  r0, r1 -> r2 ; first delay slot
SUB  r4, r5 -> r6 ; second delay slot

Branch prediction

images/branch_predictor.png

Modern architectures

Even developers can help

Branchless programming

def min(a, b):
    if a < b:
        return a
    else
        return b
def min_branchless(a, b):
    return a * (a < b) + b * (b <= a)

Branchless programming

for (int i = 0; i < N; i++)
    a[i] = rand() % 100;

volatile int s;

for (int i = 0; i < N; i++)
    if (a[i] < 50)
        s += a[i];
for (int i = 0; i < N; i++)
    s += (a[i] < 50) * a[i];

Graph

Images

Branchless Programming https://en.algorithmica.org/hpc/pipelining/branchless/