PPJ-38 5. Data Parallelism: Barriers Many processes execute the same operations at the same time on different data; usually on elements of regular data structures: arrays, sequences, matrices, lists. Data parallelism as an architectural model of parallel computers: vector machines, e. g. Cray SIMD machines (Single Instruction Multiple Data), e. g. Connection Machine, MasPar GPUs (Graphical Processing Units); massively parallel processors on graphic cards Data parallelism as a programming model for parallel computers: * computations on arrays in nested loops * analyze data dependences of computations, transform and parallelize loops * iterative computations in rounds, synchronize with Barriers * systolic computations: 2 phases are iterated: compute - shift data to neighbour processes Applications mainly in technical, scientific computing, e. g. * fluid mechanics * image processing * solving differential equations * finite element method in design systems (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 38 Objectives: Overview over notions of data parallelism In the lecture: Explain the notions -------------------------------------------------------------------------------- PPJ-39 Data parallelism as an architectural model SIMD machine: Single Instruction Multiple Data * very many processors, massively parallel e. g. 32 x 64 processor field * local memory for each processor * same instructions in lock step * fast communication in lock step program field of processors * fixed topology, usually a grid * machine types e. g. Connection Machine, MasPar (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 39 Objectives: Architecture of a SIMD computer In the lecture: Explanation of the properties -------------------------------------------------------------------------------- PPJ-40 Data parallelism as a programming model * regular data structures (arrays, lists) are mapped onto a field of processors * processes execute the same program on individual data in lock step * communication with neighbours in the same direction in lock step simple example matrix addition: C = A + B sequential: data parallel: for (i = 0; i < N; i++) distribute A, B for (j = 0; j < M; j++) c = a + b 1 step! c[i,j] = a [i,j] + b[i,j]; collect C * these can be parallelized directly, since there are no data dependences * data mapping is trivial: array element [i,j] on process [i,j] * communication is not needed * no algorithmic idea is needed (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 40 Objectives: idea of loop parallelization In the lecture: * explain the example, * show the reasons for the simplicity of the parallelization Questions: * Give examples for array operations that can be parallelized with similar ease. -------------------------------------------------------------------------------- PPJ-41 Example prefix sums i input: sequence a of numbers; s[i] = Sigma a[j] output: sequence s of sums of the prefixes of a j=0 a [ 0 1 2 3 4 5 ] values: 5 3 1 2 1 3 + + + + + s [ 0 1 2 3 4 5 ] results: 5 8 9 11 12 15 parallel algorithmic idea: a [ 0 1 2 3 4 5 ] s [ ] round r = 0 + + + + + + s[i-1] 1 + + + + + s[i-2] 2 + + + s[i-4] (c) 2008 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 41 Objectives: Understand the parallel computation of prefix sums In the lecture: Explain * the task, * the algorithmic idea, * how to exploit associativity, * computations in rounds, * duplication of distance Questions: * What is the formula for the number of steps in the sequential and in the parallel case? -------------------------------------------------------------------------------- PPJ-41a Example prefix sums (2) i input:sequence a of numbers; s[i] = Sigma a[j] output:sequence s of sums of the prefixes of a j=0 parallel algorithmic idea: a [ 0 1 2 3 4 5 ] s [ ] round r = 0 + + + + + + s[i-1] 1 + + + + + s[i-2] 2 + + + s[i-4] Proof for process p = 0 .. n - 1 Invariant SUM: s[p] = a[p-d+1] + ... + a[p] with d = 1, 2, ..., m <= n distance before next round Induction begin: d = 1; s[p] = a[p] holds by initialization induction step: computation s[p] = s[p - d] + s[p] a[p-2d+1] + ... + a[p-d] + a[p-d+1] + ... + a[p] substitution of 2d by d implies SUM (c) 2009 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 41a Objectives: Proof the parallel computation of prefix sums In the lecture: Explain * the proof -------------------------------------------------------------------------------- PPJ-42 Prefix sums: applied methods * computational scheme reduction: all array elements are comprised using a reduction operation (here: addition) * iterative computation in rounds: in each round all processes perform a computation step * duplication of distance: data is exchanged in each round with a neighbour at twice the distance as in the previous round * barrier synchronization: processes may not enter the next round, before all processes have finished the previous one (c) 2014 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 42 Objectives: Point out the methods In the lecture: * Explain the methods for the prefix sums. * Point out other applications of these methods. -------------------------------------------------------------------------------- PPJ-43 Barriers Several processes meet at a common point of synchronization Rule: All processes must have reached the barrier (for the j-th time), before one of them leaves it (for the j-th time). Applications: * iterative computations, where iteration j uses results of iteration j-1 * separation of computational phases Scheme: public void run () { do { computeNewValues (i); b.barrier(); } while (!converged); } Implementation techniques for barriers: * central controller: monitor or coordination process * worker processes coordinated as a tree * worker processes symmetrically coordinated (butterfly barrier, dissemination barrier) (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 43 Objectives: Understand the concept of barriers In the lecture: Explain * the barrier rule, * the relation to the prefix sums, * applications. -------------------------------------------------------------------------------- PPJ-44 Barrier implemented by a monitor Monitor stops a given number of processes and releases them together: class BarrierMonitor { private int processes // number of processes to be synchronized arrived = 0; // number of processes arrived at the barrier public BarrierMonitor (int procs) { processes = procs; } synchronized public barrier () { arrived++; if (arrived < processes) try { wait(); } catch (InterruptedException e) {} // exception destroys barrier behaviour else { arrived = 0; // reset arrival count notifyAll(); // release the other processes } } } (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 44 Objectives: Understand the monitor implementation In the lecture: Explain * the implementation, * why waiting in a loop is not necessary. Questions: * Why does this central solution cause a bottleneck? -------------------------------------------------------------------------------- PPJ-45 Distributed tree barrier Barrier synchronization of the worker processes organized as a binary tree. Bottleneck of central synchronization is avoided. 2 synchronization variables (flags) at each node: arrived: all processes in a subtree have arrived, a c is propagated upward continue: all processes in a subtree may continue, a c a c is propagated downward disadvantage: a c a c a c a c different code is needed for root, inner nodes, and for leafs (c) 2012 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 45 Objectives: Understand the tree barrier In the lecture: Explain * the principle of 2 phases, * the advantage of the distributed solution, -------------------------------------------------------------------------------- PPJ-45a 2 Rules for Synchronization Using Flags Flag for synchronization between exactly 2 processes One process waits until the flag is set. The other process sets the flag. May be implemented by a monitor in Java. Flag rules: 1. The process that waits for a flag resets it. 2. A flag that is set may not be set again before being reset. Consequence: no state change will be lost. process p waits for f==1 resets f:=0 f==0 f==1 f==0 process q ensures f==0 before sets f:=1 (c) 2012 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 45a Objectives: Understand flag synchronization In the lecture: Explain * the general flag rules. Assignments: * Design a Java class for flag synchronization between 2 processes. Ensure that the flag rules are obeyed. -------------------------------------------------------------------------------- PPJ-45b Distributed tree barrier implementation 2 synchronization variables (flags) at each node: a c arrived: all processes in a subtree have arrived propagated upward continue: all processes in a subtree may continue a c a c propagated downward initially all flags are reset a c a c a c a c code for an inner node: leaf root execute this.task(); x x wait for left.arrived; reset left.arrived; x wait for right.arrived; reset right.arrived; x set this.arrived; x wait for this.continue; reset this.continue; x set left.continue; x set right.continue; x (c) 2012 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 45b Objectives: Understand the tree barrier In the lecture: Explain * the different code for the 3 kinds of nodes, Assignments: * Write the code for the 3 kinds of nodes using objects of the flag class. -------------------------------------------------------------------------------- PPJ-46 Symmetric, distributed barrier (dissemination) Processes synchronize pairwise in rounds with doubled distances. N processes are synchronized after r rounds if N <= 2r In round s (i + N - 2s-1) modulo N i process i indicates its arrival and then waits for the arrival of process (i + N - 2s-1) modulo N: round 0 1 2 3 4 5 1 2 3 After r rounds each process is synchronized with each other. Proof idea: For each process i each other process occurs in a tree of processes which have synchronized (in)directly with i. (c) 2012 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 46 Objectives: Understand the dissemination barrier In the lecture: * Symmetric code for arbitrary many processes. * Arc i to j in the diagram means j waits for arrival of i. * show the synchronization for pairs. * No cyclic waiting, because the arrival is indicated first, then the partner is waited for. * After the last round all processes are synchronized, because for all processes p a binary tree exists s.t. p is its root, all processes are in that tree, the arcs are waiting pairs from the diagram forming pathes from the leaves to the root.. Questions: * Write the synchronization code. * Show one of the binary trees. -------------------------------------------------------------------------------- PPJ-46a Symmetric, distributed barrier: implementation (i + N - 2s-1) modulo N i In round s process i indicates its arrival and waits for the arrival of process (i + N - 2s-1) modulo N: Code for each process: execute this.task(); // synchronize: s = 0; while (N > 2s) s++; wait for f==0; set f=1; partner=p[(i + N - 2s-1) modulo N]; wait partner.f; reset partner.f=0 (c) 2012 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 46a Objectives: Understand the dissemination barrier In the lecture: * Processes have to wait before they set AND before they reset the flag. * Symmetric code for arbitrary many processes. Questions: * Write the synchronization code. * Show one of the binary trees. -------------------------------------------------------------------------------- PPJ-47 Prefix sums with barriers class PrefixSum extends Thread { private int procNo; // number of process private BarrierMonitor bm; // barrier object public PrefixSum (int p, BarrierMonitor b) { procno = p; bm = b; } public void run () { int addIt, dist = 1; // distance // global arrays a and s s[procNo] = a[procNo]; // initialize result array bm.barrier(); // invariant SUM: s[procNo] == a[procNo-dist+1]+...+a[procNo] while (dist < N) { if (procNo - dist >= 0) addIt = s[procNo - dist]; // value before overwritten bm.barrier(); if (procNo - dist >= 0) s[procNo] += addIt; bm.barrier(); dist = dist * 2; // doubled distance } } } (c) 2011 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 47 Objectives: Examples for synchonization points In the lecture: Explain * the invariant, * the access of s[procNo], * the reasons for the 3 synchronization points. Questions: * Explain the reasons for the 3 synchronization points. -------------------------------------------------------------------------------- PPJ-48 Prefix sums in a synchronous parallel programming model Notation in Modula-2* with synchronous (and asynchronous) loops for parallel machines VAR a, s, t: ARRAY [0..N-1] OF INTEGER; VAR dist: CARDINAL; BEGIN ... FORALL i: [0..N-1] IN SYNC parallel loop in lock step s[i] := a[i]; END; dist := 1; WHILE dist < N parallel loop in lock step FORALL i: [0..N-1] IN SYNC IF (i-dist) >= 0 THEN t[i] := s[i - dist]; implicit barrier s[i] := s[i] + t[i]; for each statement END END; dist := dist * 2; END END (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 48 Objectives: Implicit barriers In the lecture: * Explain the language constructs. * If expressions were evaluated in lock step, too, the array t could be omitted. * The MasPar SIMD machine would be programmed similarly. Questions: * Explain the execution if values were not saved in t[i]. -------------------------------------------------------------------------------- PPJ-49 Finding list ends: data parallel approach input: int array link stores lists; link[i] contains the index of the successor or nil output: int array last; last[i] contains the index of the last element of list link[i] method: worker process i computes last[i] = last[last[i]] in log N rounds int d = 1; last[i] = link[i]; barrier nil nil nil while (d < N) { int newlast = nil; if ( last[i] != nil && last[last[i]] != nil) newlast = last[last[i]]; barrier nil nil nil if (newlast != nil) last[i] = newlast; barrier d = 2*d; } nil nil nil last[i] points to the end of those lists which are not longer than d (c) 2003 bei Prof. Dr. Uwe Kastens -------------------------------------------------------------------------------- Lecture Parallel Programming WS 2014/2015 / Slide 49 Objectives: Data parallelism not only for arrays! In the lecture: Explain * parallel scanning of lists, * doubling distances for lists, * last[last[i]], * that it is only useful if the ends of many long lists are searched. Questions: * Which role plays the distance d here? -------------------------------------------------------------------------------- C-5.11 / PPJ-50 5.2 / 6. Data Parallelism: Loop Parallelization Development steps (automated by compilers): DECLARE B[0..N,0..N+1] * nested loops operating on arrays, FOR I := 1 ..N FOR J := 1 .. I sequential execution of iteration space B[I,J] := B[I-1,J]+B[I-1,J-1] END FOR END FOR * analyze data dependences data-flow: definition and use of array elements * transform loops keep data dependences forward in time * parallelize inner loop(s) map to field or vector of processors * map arrays to processors such that many accesses are local, transform index spaces (c) 2011 bei Prof. Dr. Uwe Kastens j j 1 N i N -1 1-N 1 1 N i --------------------------------------------------------------------------------