C-5.1
5 Code Parallelization

Processor with instruction level parallelism (ILP)       Parallel functional units, VLIW
executes several instructions in parallel.          FU1  FU2  FU3  super scalar

Classes of processors and parallelism:
VLIW, super scalar          parallelized
Pipelined processors          instruction
Data parallel processors          sequence

Compiler analyzes sequential programs to
exhibit potential parallelism
on instruction level;          Pipeline processor

model dependences        S3  S2  S1
between computations          sequential code scheduled for pipelining

Compiler arranges instructions for
shortest execution time:          Data parallel processor, SIMD
instruction scheduling          FU0        ...   FU31

Compiler analyzes loops
to execute them in parallel          for i := 0 to 31
loop transformation          do c[i]  := a[i] + b [i];
array transformation          is one instruction!

(c) 2009 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 501

Objectives:
3 abstractions of processor parallism

In the lecture:
* explain the abstract models
* relate to real processors
* explain the instruction scheduling tasks

Suggested reading:
Kastens / Übersetzerbau, Section 8.5

Questions:
* Whathas tobeknownabout instructionexecution inorder to solvethe instructionscheduling problem in the compiler?

--------------------------------------------------------------------------------

C-5.2
5.1 Instruction Scheduling
Data Dependence Graph

Exhibit potential fine-grained parallelism among operations.
Sequential code is over-specified!

Data dependence graph (DDG) for a basic block:
Node: operation;
Edge a -> b: operation b uses the result of operation a

Example for a basic block:          data dependence graph
1:  t1   := a
 2:   t2    := b          1    2     5        8     9

 3:   t3    := t1 + t2          t1   t2

 4:   x    := t3          3          t4        t6      t7

 5:   t4    := c          t3   t3

 6:   t5    := t3 + t4          10
6
 7:   y    := t5          4        t5          t8
 8:   t6    := d          x     7          11
 9:   t7    := e          y          z
10:  t8    := t6 + t7          ti are symbolic registers, store intermediate

11:  z    := t8          results, obey single assignment rule

(c) 2002 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 502

Objectives:
DDG exhibits parallelism

In the lecture:
* Show where sequential code is overspecified.
* Derive reordered sequences from the ddg.
* single assignment for ti: ti contains exactly one value; ti is not reused for other values.
* Without that assumption further dependencies have to manifest the order of assignments to those registers.

Suggested reading:
Kastens / Übersetzerbau, Section 8.5, Abb. 8.5-1

Assignments:
* Write the operations of the basic block in a different order, such that the effect is not changed and the same DDG is
produced.

Questions:
* Why does this example have so much freedom for rearranging operations?
* Why are further dependences necessary if registers are allocated?

--------------------------------------------------------------------------------


C-5.3
List Scheduling

Input:      data dependence graph
Output:    a schedule of at most k operations per cycle,
such that all dependences point forward; DDG arranged in levels

Algorithm: A ready list contains all operations that are not yet scheduled,
but whose predecessors are scheduled
Iterate: select from the ready list up to k operations for the next cycle (heuristic),
update the ready list

* Algorithm is optimal only for trees.
cycle
1       1    2    5          * Heuristic: Keep ready list sorted by

distance to an end node, e. g.
2       3         8     9          (1 2 5) (8 9 3) (6 10 4) (7 11)

3       6     4     10          without this heuristic:
(1 8 9) (2 5 10) (3 11) (6 4) (7)
4       7          11          ( ) operations in one cycle

Critical paths determine minimal schedule length: e. g. 1 -> 3 -> 6 -> 7

(c) 2002 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 503

Objectives:
A simple fundamental scheduling algorithm

In the lecture:
* Explain the algorithm using the example.
* Show variants of orders in the ready list, and their consequences.
* Explain the heuristic.

Suggested reading:
Kastens / Übersetzerbau, Section 8.5.1

Assignments:
* Write the parallel code for this example.

Questions:
* Explain the heuristic with respect to critical paths.

--------------------------------------------------------------------------------


C-5.4
Variants and Restrictions for List Scheduling

* Allocate as soon as possible, ASAP (C-5.3); as late as possible, ALAP
* Operations have unit execution time (C-5.3); different execution times:
selection avoids conflicts with already allocated operations
* Operations only on specific functional units (e. g. 2 int FUs, 2 float FUs)

* Resource restrictions between operations, e. g. <= 1 load or store per cycle

Scheduled DDG models
cycle          number of needed registers:

1       1    2    5          cut width
* arc represents the use of an
3          intermediate result
2       3         8     9          * cut width through a level

4         gives the number of
3       6     4     10          registers needed

2       The tighter the schedule the
4          more registers are needed
7          11          (register pressure).

one value is used twice

(c) 2009 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 504

Objectives:
A simple fundamental scheduling algorithm

In the lecture:
* Explain ASAP and ALAP.
* Explain restrictions on the selection of operations.
* Show how the register need is modeled.

Suggested reading:
Kastens / Übersetzerbau, Section 8.5.1

Assignments:
* The algorithm allocates an operation as soon as possible (ASAP). Describe a variant of the algorithm which allocates
an operation as late as possible (ALAP).
* Describe a variant, that allocates operations of different execution times.

Questions:
* Compare the way register need is modeled with the approach of Belady for register allocation.
* Why need tight schedules more registers?

--------------------------------------------------------------------------------


C-5.5
Instruction Scheduling for Pipelining

Instruction pipeline
with 3 stages:          3    2    1          instruction sequence
I4     I5   nop          I6   nop   I7         ...

Dependent instructions may not

without scheduling:          follow one another immediately.
 1:     t1     := a
 2:     t2     := b          Schedule rearranges the operation sequence,
nop          to minimize the number of delays:
 3:     t3     := t1 + t2
nop
 4:     x      := t3          1:     t1     := a
 5:     t4     := c          2:     t2     := b
nop          5:     t4     := c
 6:     t5     := t3 + t4          3:     t3     := t1 + t2     with
nop          8:     t6     := d        scheduling
 7:     y      := t5          9:     t7     := e
 8:     t6     := d          6:     t5     := t3 + t4    no delays
 9:     t7     := e          10:    t8     := t6 + t7
nop          4:    x      := t3
10:    t8     := t6 + t7          7:    y      := t5
nop          11:   z      := t8
11:    z      := t8

(c) 2002 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 505

Objectives:
Restrictions for pipelining

In the lecture:
* Requirements of pipelining processors.
* Compiler reorders to meet the requirements, inserts nops (empty operations), if necessary.
* Some processors accept too close operations, delays the second one by a hardware interlock.
* Hardware bypasses may relax the requirements

Suggested reading:
Kastens / Übersetzerbau, Section 8.5.2

Questions:
* Why are no nops needed in this example?

--------------------------------------------------------------------------------


C-5.6
Instruction Scheduling Algorithm for Pipelining

Algorithm: modified list scheduling:          Ready list with additional information:

Select from the ready list such that the
selected operation          opr.   1 2 5 8 9 3 6 4 10 7  11

* has a sufficient distance to all          succ # 1 1 1 1 1 2 1 0 1  0  0
predecessors in DDG

* has many successors (heuristic)          to end 3 3 2 2 2 2 1 1 1  0  0

* has a long path to the end node (heuristic)   sched. 1 2 3 5 6 4 7 9 8  10 11
cycle

Insert an empty operation if none is selectable.

cycle
data dependence graph          1      1:     t1     := a

2      2:     t2     := b

1    2     5        8     9     3      5:     t4      := c

t1   t2          4      3:     t3      := t1 + t2    with
5      8:     t6     := d          scheduling

3          t4        t6      t7
6      9:     t7     := e
t3   t3          10        7      6:     t5      := t3 + t4
6          8      10:    t8      := t6 + t7
4        t5          t8       9      4:    x      := t3

x      7          11        10    7:    y      := t5
y          z        11    11:   z      := t8

(c) 2013 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 506

Objectives:
Adapted list scheduling

In the lecture:
* Explain the algorithm using the example.
* Explain the selection criteria.

Suggested reading:
Kastens / Übersetzerbau, Section 8.5.2

--------------------------------------------------------------------------------


C-5.6b
Reused registers: anti- and output-dependences

u       v  flow-dependence:
u writes before v uses

u   a   v anti-dependence:
u uses a value
before v overwrites it
o
u        v output-dependence:
u writes before v overwrites

DDG with symbolic registers ti          DDG with reused registers ti
flow-dependences only          flow, anti-, and output-dependences

o
1    2     5        8     9        1    2 o  5        8     9

t1   t2          t1   t2   a       a

3          t4        t6      t7          3          t2        t1      t7
t3   t3          10          t3   t3          10
6          6
4        t5          t8          4        t5          t8

x      7          11          x     7          11
y          z          y          z

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 506b

Objectives:
Understand anti- and output-dependences

In the lecture:
Explain anti- and output-dependences:

* Reuse of registers introduces new dependences

--------------------------------------------------------------------------------


C-5.6d
DDG with Loop Carried Dependences

Factorial computation:

program:         seq. machine code:   Data dependence graph:

i = 0; f = 1;

while ( i != n)         L: beq  r1, r2 : exit          beq r1, r2 : exit

{   i = i + 1;          add  r1, 1  : r1          a   c
r1          r1
a
f = f * i;          mul  r5, r1 : r5          add r1, 1 : r1

add  r8, 4  : r8          r8
r1

m[i] = f;          sto   r5 : m[r8]        mul r5, r1 : r5          add r8, 4: r8

}          bra  L          r5          r8

u          a
v  flow-dependence:          sto r5 : m [r8]          a
u writes before v uses          r5          c          r8

u        v flow-dependence into
subsequent iteration          bra L

u   a   v anti-dependence:
u uses a value

before v overwrites it          u  c     v control-dependence:
o
u        v output-dependence:          u has to be executed before v
u writes before v overwrites          (u or v may branch)

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 506d

Objectives:
Loop carried dependences

In the lecture:
Explain loop carried dependences

* the 4 kinds,
* they occur, because a new value is stored in the same register on every iteration,
* they are relevant, because we are going to merge operations of several iterations.

Questions:
* Explain why loops with arrays can have dependences into later iterations that are not the next one. Give an example.

--------------------------------------------------------------------------------

C-5.6u
Loop unrolling
Loop unrolling: A technique for parallelization of loops.

A single loop body does not exhibit enough parallelism   => sparse schedule.
Schedule the code (copies) of several adjacent iterations together
=> more compact schedule

sequential      parallel schedule         unrolled loop        parallel schedule
loop          for single body          (3 times)          for unrolled loop

Prologue and epilogue needed to take
care of iteration numbers that are not

multiples of the unroll factor

(c) 2009 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 506u

Objectives:
Understand the idea of loop unrolling

In the lecture:
* Compare the single body schedule to the schedule of the unrolled loop.
* Explain the consequences of loop carried dependences.

Suggested reading:
Kastens / Übersetzerbau, Section 8.5.2

--------------------------------------------------------------------------------


C-5.7
Software Pipelining
Software Pipelining: A technique for parallelization of loops.

A single loop body does not exhibit enough parallelism   => sparse schedule.
Overlap the execution of several adjacent iterations  => compact schedule

The pipelined loop body

has each operation of the original sequential body,
they belong to several iterations,
they are tightly scheduled,
its length is the initiation interval II,
is shorter than the original body.

Prologue, epilogue: initiation and finalization code

sequential
software pipelined

done          II          prologue

II
pipelined
II       loop

to be          epilogue
done

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 507

Objectives:
Understand the underlying idea

In the lecture:
* Explain the underlying idea
* II is both: length of the piplined loop and time between the start of two successive iterations.

Questions:
Explain:

* The shorter the initiation interval is, the greater is the parallelism, and the compacter is the schedule.

--------------------------------------------------------------------------------

C-5.8

Transform Loops by Software Pipelining

Technique:

1. Data dependence graph for the loop body,

include loop carried dependences.          ... = t1;
...
2. Chose a small initiation interval II -
not smaller than #instructions / #FUs          t1 = ...;

3. Make a "Modulo Schedule" s for the loop body:
Two instructions can not be scheduled on the same FU, i1
in cycle c1 and i2 in cycle c2, if c1 mod II = c2 mod II

4. If (3) does not succeed without conflict, increase II and
repeat from 3

5. Allocate the instructions of s in the new loop of length II:
ij scheduled in cycle cj is allocated to c          11
j mod II

6. Construct prologue and epilogue.          21   12          done

Modulo schedule for a loop body          13         14
cycle

0 0  11          loop length II  31   22
23   15
1 1          24

2 0        12          32          to be
3 1  13         14          33   25    34    done

4 0
5 1        15          35

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 508

Objectives:
Understand the technique

In the lecture:
* Explain the algorithm.
* Explain reasons for conflicts in step 4.

Questions:
Explain:

* The shorter the initiation interval is, the greater is the parallelism, and the compacter is the schedule.
* The transformed loop contains each instruction of the loop body exactly once.

--------------------------------------------------------------------------------


C-5.10

Result of Software Pipelining

t      tm          ADD          MUL          MEM         CTR          4 dedicated FUs

0   0      L:          beq r1, r2:exit      schedule of the
loop body for II = 2
1   1          add r1, 1: r1
2   0          add r8, 4 : r8          mul r5, r1 : r5          mul and sto need 2 cycles
3   1          ... mul
4   0          sto  r5 : m r8          add and sto in tm=0,
5   1          ... sto          sto reads r8 before
add writes it
6   0
7   1          bra L          bra not in cycle 6,
it collides with beq: tm=0
t       tm          ADD          MUL          MEM         CTR
0    0          beq r1;r2:exit
1    1          add r1, 1 : r1          prologue
2    0          add r8, 4 : r8        mul r5, r1 : r5          beq r1; r2 : ex
3    1          add r1, 1 : r1        ... mul
4    0     L:        add r8, 4 : r8        mul r5, r1 : r5        sto  r5 : m r8        beq r1; r2 : ex          software pipline
5    1          add r1, 1 : r1        ... mul          ... sto          bra L          with II = 2
6    1     ex:          ... mul          ... sto
7    0          sto  r5 : m r8          epilogue
8    1          ... sto
9    0          bra exit

(c) 2009 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Compilation Methods SS 2013 / Slide 510

Objectives:
A software pipeline for a VLIW processor

In the lecture:
Explain

* the properties of the VLIW processor,
* the schedule,
* the software pipline,

Assignments:
* Make a table of run-times in cycles for n = 1, 2, ... iterations, and compare the figures without and with software
pipelining.

--------------------------------------------------------------------------------


C-5.11 / PPJ-50
5.2 / 6. Data Parallelism: Loop Parallelization

Development steps (automated by compilers):         DECLARE B[0..N,0..N+1]

* nested loops operating on arrays,          FOR I := 1 ..N
FOR J := 1 .. I
sequential execution of iteration space          B[I,J] :=
B[I-1,J]+B[I-1,J-1]
END FOR
END FOR

* analyze data dependences
data-flow: definition and use of array elements

* transform loops
keep data dependences forward in time
* parallelize inner loop(s)
map to field or vector of processors

* map arrays to processors
such that many accesses are local,
transform index spaces

(c) 2011 bei Prof. Dr. Uwe Kastens

j          j
1          N

i

N

-1

1-N

1

1          N       i

--------------------------------------------------------------------------------