C-5.1 5 Code Parallelization Processor with instruction level parallelism (ILP) Parallel functional units, VLIW executes several instructions in parallel. FU1 FU2 FU3 super scalar Classes of processors and parallelism: VLIW, super scalar parallelized Pipelined processors instruction Data parallel processors sequence Compiler analyzes sequential programs to exhibit potential parallelism on instruction level; Pipeline processor model dependences S3 S2 S1 between computations sequential code scheduled for pipelining Compiler arranges instructions for shortest execution time: Data parallel processor, SIMD instruction scheduling FU0 ... FU31 Compiler analyzes loops to execute them in parallel for i := 0 to 31 loop transformation do c[i] := a[i] + b [i]; array transformation is one instruction! (c) 2009 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 501 Objectives: 3 abstractions of processor parallism In the lecture: * explain the abstract models * relate to real processors * explain the instruction scheduling tasks Suggested reading: Kastens / Übersetzerbau, Section 8.5 Questions: * Whathas tobeknownabout instructionexecution inorder to solvethe instructionscheduling problem in the compiler? -------------------------------------------------------------------------------- C-5.2 5.1 Instruction Scheduling Data Dependence Graph Exhibit potential fine-grained parallelism among operations. Sequential code is over-specified! Data dependence graph (DDG) for a basic block: Node: operation; Edge a -> b: operation b uses the result of operation a Example for a basic block: data dependence graph 1: t1 := a 2: t2 := b 1 2 5 8 9 3: t3 := t1 + t2 t1 t2 4: x := t3 3 t4 t6 t7 5: t4 := c t3 t3 6: t5 := t3 + t4 10 6 7: y := t5 4 t5 t8 8: t6 := d x 7 11 9: t7 := e y z 10: t8 := t6 + t7 ti are symbolic registers, store intermediate 11: z := t8 results, obey single assignment rule (c) 2002 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 502 Objectives: DDG exhibits parallelism In the lecture: * Show where sequential code is overspecified. * Derive reordered sequences from the ddg. * single assignment for ti: ti contains exactly one value; ti is not reused for other values. * Without that assumption further dependencies have to manifest the order of assignments to those registers. Suggested reading: Kastens / Übersetzerbau, Section 8.5, Abb. 8.5-1 Assignments: * Write the operations of the basic block in a different order, such that the effect is not changed and the same DDG is produced. Questions: * Why does this example have so much freedom for rearranging operations? * Why are further dependences necessary if registers are allocated? -------------------------------------------------------------------------------- C-5.3 List Scheduling Input: data dependence graph Output: a schedule of at most k operations per cycle, such that all dependences point forward; DDG arranged in levels Algorithm: A ready list contains all operations that are not yet scheduled, but whose predecessors are scheduled Iterate: select from the ready list up to k operations for the next cycle (heuristic), update the ready list * Algorithm is optimal only for trees. cycle 1 1 2 5 * Heuristic: Keep ready list sorted by distance to an end node, e. g. 2 3 8 9 (1 2 5) (8 9 3) (6 10 4) (7 11) 3 6 4 10 without this heuristic: (1 8 9) (2 5 10) (3 11) (6 4) (7) 4 7 11 ( ) operations in one cycle Critical paths determine minimal schedule length: e. g. 1 -> 3 -> 6 -> 7 (c) 2002 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 503 Objectives: A simple fundamental scheduling algorithm In the lecture: * Explain the algorithm using the example. * Show variants of orders in the ready list, and their consequences. * Explain the heuristic. Suggested reading: Kastens / Übersetzerbau, Section 8.5.1 Assignments: * Write the parallel code for this example. Questions: * Explain the heuristic with respect to critical paths. -------------------------------------------------------------------------------- C-5.4 Variants and Restrictions for List Scheduling * Allocate as soon as possible, ASAP (C-5.3); as late as possible, ALAP * Operations have unit execution time (C-5.3); different execution times: selection avoids conflicts with already allocated operations * Operations only on specific functional units (e. g. 2 int FUs, 2 float FUs) * Resource restrictions between operations, e. g. <= 1 load or store per cycle Scheduled DDG models cycle number of needed registers: 1 1 2 5 cut width * arc represents the use of an 3 intermediate result 2 3 8 9 * cut width through a level 4 gives the number of 3 6 4 10 registers needed 2 The tighter the schedule the 4 more registers are needed 7 11 (register pressure). one value is used twice (c) 2009 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 504 Objectives: A simple fundamental scheduling algorithm In the lecture: * Explain ASAP and ALAP. * Explain restrictions on the selection of operations. * Show how the register need is modeled. Suggested reading: Kastens / Übersetzerbau, Section 8.5.1 Assignments: * The algorithm allocates an operation as soon as possible (ASAP). Describe a variant of the algorithm which allocates an operation as late as possible (ALAP). * Describe a variant, that allocates operations of different execution times. Questions: * Compare the way register need is modeled with the approach of Belady for register allocation. * Why need tight schedules more registers? -------------------------------------------------------------------------------- C-5.5 Instruction Scheduling for Pipelining Instruction pipeline with 3 stages: 3 2 1 instruction sequence I4 I5 nop I6 nop I7 ... Dependent instructions may not without scheduling: follow one another immediately. 1: t1 := a 2: t2 := b Schedule rearranges the operation sequence, nop to minimize the number of delays: 3: t3 := t1 + t2 nop 4: x := t3 1: t1 := a 5: t4 := c 2: t2 := b nop 5: t4 := c 6: t5 := t3 + t4 3: t3 := t1 + t2 with nop 8: t6 := d scheduling 7: y := t5 9: t7 := e 8: t6 := d 6: t5 := t3 + t4 no delays 9: t7 := e 10: t8 := t6 + t7 nop 4: x := t3 10: t8 := t6 + t7 7: y := t5 nop 11: z := t8 11: z := t8 (c) 2002 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 505 Objectives: Restrictions for pipelining In the lecture: * Requirements of pipelining processors. * Compiler reorders to meet the requirements, inserts nops (empty operations), if necessary. * Some processors accept too close operations, delays the second one by a hardware interlock. * Hardware bypasses may relax the requirements Suggested reading: Kastens / Übersetzerbau, Section 8.5.2 Questions: * Why are no nops needed in this example? -------------------------------------------------------------------------------- C-5.6 Instruction Scheduling Algorithm for Pipelining Algorithm: modified list scheduling: Ready list with additional information: Select from the ready list such that the selected operation opr. 1 2 5 8 9 3 6 4 10 7 11 * has a sufficient distance to all succ # 1 1 1 1 1 2 1 0 1 0 0 predecessors in DDG * has many successors (heuristic) to end 3 3 2 2 2 2 1 1 1 0 0 * has a long path to the end node (heuristic) sched. 1 2 3 5 6 4 7 9 8 10 11 cycle Insert an empty operation if none is selectable. cycle data dependence graph 1 1: t1 := a 2 2: t2 := b 1 2 5 8 9 3 5: t4 := c t1 t2 4 3: t3 := t1 + t2 with 5 8: t6 := d scheduling 3 t4 t6 t7 6 9: t7 := e t3 t3 10 7 6: t5 := t3 + t4 6 8 10: t8 := t6 + t7 4 t5 t8 9 4: x := t3 x 7 11 10 7: y := t5 y z 11 11: z := t8 (c) 2013 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 506 Objectives: Adapted list scheduling In the lecture: * Explain the algorithm using the example. * Explain the selection criteria. Suggested reading: Kastens / Übersetzerbau, Section 8.5.2 -------------------------------------------------------------------------------- C-5.6b Reused registers: anti- and output-dependences u v flow-dependence: u writes before v uses u a v anti-dependence: u uses a value before v overwrites it o u v output-dependence: u writes before v overwrites DDG with symbolic registers ti DDG with reused registers ti flow-dependences only flow, anti-, and output-dependences o 1 2 5 8 9 1 2 o 5 8 9 t1 t2 t1 t2 a a 3 t4 t6 t7 3 t2 t1 t7 t3 t3 10 t3 t3 10 6 6 4 t5 t8 4 t5 t8 x 7 11 x 7 11 y z y z (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 506b Objectives: Understand anti- and output-dependences In the lecture: Explain anti- and output-dependences: * Reuse of registers introduces new dependences -------------------------------------------------------------------------------- C-5.6d DDG with Loop Carried Dependences Factorial computation: program: seq. machine code: Data dependence graph: i = 0; f = 1; while ( i != n) L: beq r1, r2 : exit beq r1, r2 : exit { i = i + 1; add r1, 1 : r1 a c r1 r1 a f = f * i; mul r5, r1 : r5 add r1, 1 : r1 add r8, 4 : r8 r8 r1 m[i] = f; sto r5 : m[r8] mul r5, r1 : r5 add r8, 4: r8 } bra L r5 r8 u a v flow-dependence: sto r5 : m [r8] a u writes before v uses r5 c r8 u v flow-dependence into subsequent iteration bra L u a v anti-dependence: u uses a value before v overwrites it u c v control-dependence: o u v output-dependence: u has to be executed before v u writes before v overwrites (u or v may branch) (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 506d Objectives: Loop carried dependences In the lecture: Explain loop carried dependences * the 4 kinds, * they occur, because a new value is stored in the same register on every iteration, * they are relevant, because we are going to merge operations of several iterations. Questions: * Explain why loops with arrays can have dependences into later iterations that are not the next one. Give an example. -------------------------------------------------------------------------------- C-5.6u Loop unrolling Loop unrolling: A technique for parallelization of loops. A single loop body does not exhibit enough parallelism => sparse schedule. Schedule the code (copies) of several adjacent iterations together => more compact schedule sequential parallel schedule unrolled loop parallel schedule loop for single body (3 times) for unrolled loop Prologue and epilogue needed to take care of iteration numbers that are not multiples of the unroll factor (c) 2009 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 506u Objectives: Understand the idea of loop unrolling In the lecture: * Compare the single body schedule to the schedule of the unrolled loop. * Explain the consequences of loop carried dependences. Suggested reading: Kastens / Übersetzerbau, Section 8.5.2 -------------------------------------------------------------------------------- C-5.7 Software Pipelining Software Pipelining: A technique for parallelization of loops. A single loop body does not exhibit enough parallelism => sparse schedule. Overlap the execution of several adjacent iterations => compact schedule The pipelined loop body has each operation of the original sequential body, they belong to several iterations, they are tightly scheduled, its length is the initiation interval II, is shorter than the original body. Prologue, epilogue: initiation and finalization code sequential software pipelined done II prologue II pipelined II loop to be epilogue done (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 507 Objectives: Understand the underlying idea In the lecture: * Explain the underlying idea * II is both: length of the piplined loop and time between the start of two successive iterations. Questions: Explain: * The shorter the initiation interval is, the greater is the parallelism, and the compacter is the schedule. -------------------------------------------------------------------------------- C-5.8 Transform Loops by Software Pipelining Technique: 1. Data dependence graph for the loop body, include loop carried dependences. ... = t1; ... 2. Chose a small initiation interval II - not smaller than #instructions / #FUs t1 = ...; 3. Make a "Modulo Schedule" s for the loop body: Two instructions can not be scheduled on the same FU, i1 in cycle c1 and i2 in cycle c2, if c1 mod II = c2 mod II 4. If (3) does not succeed without conflict, increase II and repeat from 3 5. Allocate the instructions of s in the new loop of length II: ij scheduled in cycle cj is allocated to c 11 j mod II 6. Construct prologue and epilogue. 21 12 done Modulo schedule for a loop body 13 14 cycle 0 0 11 loop length II 31 22 23 15 1 1 24 2 0 12 32 to be 3 1 13 14 33 25 34 done 4 0 5 1 15 35 (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 508 Objectives: Understand the technique In the lecture: * Explain the algorithm. * Explain reasons for conflicts in step 4. Questions: Explain: * The shorter the initiation interval is, the greater is the parallelism, and the compacter is the schedule. * The transformed loop contains each instruction of the loop body exactly once. -------------------------------------------------------------------------------- C-5.10 Result of Software Pipelining t tm ADD MUL MEM CTR 4 dedicated FUs 0 0 L: beq r1, r2:exit schedule of the loop body for II = 2 1 1 add r1, 1: r1 2 0 add r8, 4 : r8 mul r5, r1 : r5 mul and sto need 2 cycles 3 1 ... mul 4 0 sto r5 : m r8 add and sto in tm=0, 5 1 ... sto sto reads r8 before add writes it 6 0 7 1 bra L bra not in cycle 6, it collides with beq: tm=0 t tm ADD MUL MEM CTR 0 0 beq r1;r2:exit 1 1 add r1, 1 : r1 prologue 2 0 add r8, 4 : r8 mul r5, r1 : r5 beq r1; r2 : ex 3 1 add r1, 1 : r1 ... mul 4 0 L: add r8, 4 : r8 mul r5, r1 : r5 sto r5 : m r8 beq r1; r2 : ex software pipline 5 1 add r1, 1 : r1 ... mul ... sto bra L with II = 2 6 1 ex: ... mul ... sto 7 0 sto r5 : m r8 epilogue 8 1 ... sto 9 0 bra exit (c) 2009 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Compilation Methods SS 2013 / Slide 510 Objectives: A software pipeline for a VLIW processor In the lecture: Explain * the properties of the VLIW processor, * the schedule, * the software pipline, Assignments: * Make a table of run-times in cycles for n = 1, 2, ... iterations, and compare the figures without and with software pipelining. -------------------------------------------------------------------------------- C-5.11 / PPJ-50 5.2 / 6. Data Parallelism: Loop Parallelization Development steps (automated by compilers): DECLARE B[0..N,0..N+1] * nested loops operating on arrays, FOR I := 1 ..N FOR J := 1 .. I sequential execution of iteration space B[I,J] := B[I-1,J]+B[I-1,J-1] END FOR END FOR * analyze data dependences data-flow: definition and use of array elements * transform loops keep data dependences forward in time * parallelize inner loop(s) map to field or vector of processors * map arrays to processors such that many accesses are local, transform index spaces (c) 2011 bei Prof. Dr. Uwe Kastens j j 1 N i N -1 1-N 1 1 N i --------------------------------------------------------------------------------