PPJ-38
5. Data Parallelism: Barriers

Many processes execute the same operations at the same time on different data;
usually on elements of regular data structures: arrays, sequences, matrices, lists.

Data parallelism as an architectural model of parallel computers:
vector machines, e. g. Cray
SIMD machines (Single Instruction Multiple Data), e. g. Connection Machine, MasPar
GPUs (Graphical Processing Units); massively parallel processors on graphic cards

Data parallelism as a programming model for parallel computers:
* computations on arrays in nested loops
* analyze data dependences of computations, transform and parallelize loops
* iterative computations in rounds, synchronize with Barriers

* systolic computations: 2 phases are iterated: compute - shift data to neighbour processes

Applications mainly in technical, scientific computing, e. g.
* fluid mechanics
* image processing
* solving differential equations

* finite element method in design systems

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 38

Objectives:
Overview over notions of data parallelism

In the lecture:
Explain the notions

--------------------------------------------------------------------------------


PPJ-39
Data parallelism as an architectural model
SIMD machine: Single Instruction Multiple Data

* very many processors, massively parallel
e. g. 32 x 64 processor field
* local memory for each processor
* same instructions in lock step
* fast communication in lock step          program          field of processors
* fixed topology, usually a grid

* machine types e. g. Connection Machine, MasPar

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 39

Objectives:
Architecture of a SIMD computer

In the lecture:
Explanation of the properties

--------------------------------------------------------------------------------


PPJ-40
Data parallelism as a programming model

* regular data structures (arrays, lists) are mapped onto a field of processors
* processes execute the same program on individual data in lock step
* communication with neighbours in the same direction in lock step
simple example matrix addition:

C      =     A     +     B

sequential:          data parallel:

for (i = 0; i < N; i++)          distribute A, B
for (j = 0; j < M; j++)          c = a + b          1 step!
c[i,j] = a [i,j] + b[i,j];        collect C

* these can be parallelized directly, since there are no data dependences
* data mapping is trivial: array element [i,j] on process [i,j]
* communication is not needed
* no algorithmic idea is needed

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 40

Objectives:
idea of loop parallelization

In the lecture:
* explain the example,
* show the reasons for the simplicity of the parallelization

Questions:
* Give examples for array operations that can be parallelized with similar ease.

--------------------------------------------------------------------------------


PPJ-41

Example prefix sums
i

input:     sequence a of numbers;          s[i] = Sigma a[j]
output:   sequence s of sums of the prefixes of a          j=0

a [  0        1        2        3        4        5   ]
values:       5        3        1        2       1       3
+      +      +      +      +

s [  0         1         2        3        4        5   ]
results:        5        8        9       11        12      15
parallel algorithmic idea:

a [  0         1         2        3        4        5   ]

s [          ]
round
r = 0          +     +      +      +     +        + s[i-1]
1          +       +       +      +        + s[i-2]

2          +     +        + s[i-4]

(c) 2008 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 41

Objectives:
Understand the parallel computation of prefix sums

In the lecture:
Explain

* the task,
* the algorithmic idea,
* how to exploit associativity,
* computations in rounds,
* duplication of distance

Questions:
* What is the formula for the number of steps in the sequential and in the parallel case?

--------------------------------------------------------------------------------


PPJ-41a

Example prefix sums (2)          i

input:sequence a of numbers;          s[i] = Sigma a[j]
output:sequence s of sums of the prefixes of a          j=0

parallel algorithmic idea:

a [  0         1         2        3        4        5   ]

s [          ]
round
r = 0          +     +      +      +     +        + s[i-1]

1          +       +       +      +        + s[i-2]

2          +     +        + s[i-4]

Proof for process p = 0 .. n - 1

Invariant SUM: s[p] = a[p-d+1] + ... + a[p] with d = 1, 2, ..., m <= n distance before next round

Induction begin: d = 1; s[p] = a[p] holds by initialization

induction step:  computation s[p] = s[p - d] +          s[p]
a[p-2d+1] + ... + a[p-d] + a[p-d+1] + ... + a[p]

substitution of 2d by d implies SUM

(c) 2009 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 41a

Objectives:
Proof the parallel computation of prefix sums

In the lecture:
Explain
* the proof

--------------------------------------------------------------------------------


PPJ-42
Prefix sums: applied methods

* computational scheme reduction:
all array elements are comprised using a reduction operation (here: addition)

* iterative computation in rounds:
in each round all processes perform a computation step

* duplication of distance:
data is exchanged in each round with a neighbour at twice the distance as in the previous
round

* barrier synchronization:
processes may not enter the next round, before all processes have finished the previous one

(c) 2014 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 42

Objectives:
Point out the methods

In the lecture:
* Explain the methods for the prefix sums.
* Point out other applications of these methods.

--------------------------------------------------------------------------------


PPJ-43
Barriers

Several processes meet at a common point of synchronization

Rule:   All processes must have reached the barrier (for the j-th time),
before one of them leaves it (for the j-th time).

Applications:
* iterative computations, where iteration j uses results of iteration j-1

* separation of computational phases

Scheme:

public void run ()
{ do { computeNewValues (i);
b.barrier();
}
while (!converged);
}

Implementation techniques for barriers:
* central controller: monitor or coordination process
* worker processes coordinated as a tree

* worker processes symmetrically coordinated (butterfly barrier, dissemination barrier)

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 43

Objectives:
Understand the concept of barriers

In the lecture:
Explain

* the barrier rule,
* the relation to the prefix sums,
* applications.

--------------------------------------------------------------------------------


PPJ-44
Barrier implemented by a monitor

Monitor stops a given number of processes and releases them together:

class BarrierMonitor
{ private int  processes       // number of processes to be synchronized
arrived = 0;   // number of processes arrived at the barrier

public BarrierMonitor (int procs)
{ processes = procs; }

synchronized public barrier ()
{ arrived++;
if (arrived < processes)
try { wait(); } catch (InterruptedException e) {}
// exception destroys barrier behaviour
else
{ arrived = 0;          // reset arrival count
notifyAll();          // release the other processes
} } }

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 44

Objectives:
Understand the monitor implementation

In the lecture:
Explain

* the implementation,
* why waiting in a loop is not necessary.

Questions:
* Why does this central solution cause a bottleneck?

--------------------------------------------------------------------------------


PPJ-45
Distributed tree barrier

Barrier synchronization of the worker processes organized as a binary tree.
Bottleneck of central synchronization is avoided.

2 synchronization variables (flags) at each node:

arrived:  all processes in a subtree
have arrived,          a c
is propagated upward

continue: all processes in a subtree
may continue,          a c          a c
is propagated downward

disadvantage:          a c   a c       a c    a c
different code is needed for
root, inner nodes, and for leafs

(c) 2012 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 45

Objectives:
Understand the tree barrier

In the lecture:
Explain

* the principle of 2 phases,
* the advantage of the distributed solution,

--------------------------------------------------------------------------------


PPJ-45a
2 Rules for Synchronization Using Flags

Flag for synchronization between exactly 2 processes

One process waits until the flag is set.
The other process sets the flag.

May be implemented by a monitor in Java.

Flag rules:   1. The process that waits for a flag resets it.
2. A flag that is set may not be set again before being reset.
Consequence: no state change will be lost.

process p          waits for f==1          resets f:=0

f==0          f==1          f==0
process q          ensures f==0 before sets f:=1

(c) 2012 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 45a

Objectives:
Understand flag synchronization

In the lecture:
Explain
* the general flag rules.

Assignments:
* Design a Java class for flag synchronization between 2 processes. Ensure that the flag rules are obeyed.

--------------------------------------------------------------------------------

PPJ-45b

Distributed tree barrier implementation

2 synchronization variables (flags) at each node:
a c
arrived:   all processes in a subtree have arrived
propagated upward

continue: all processes in a subtree may continue          a c          a c
propagated downward

initially all flags are reset          a c   a c       a c    a c

code for an inner node:

leaf      root
execute this.task();          x         x
wait for left.arrived; reset left.arrived;          x
wait for right.arrived; reset right.arrived;          x
set this.arrived;          x
wait for this.continue; reset this.continue;    x
set left.continue;          x
set right.continue;          x

(c) 2012 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 45b

Objectives:
Understand the tree barrier

In the lecture:
Explain
* the different code for the 3 kinds of nodes,

Assignments:
* Write the code for the 3 kinds of nodes using objects of the flag class.

--------------------------------------------------------------------------------


PPJ-46
Symmetric, distributed barrier (dissemination)

Processes synchronize pairwise in rounds with doubled distances.

N processes are synchronized after r rounds if N <= 2r

In round s          (i + N - 2s-1) modulo N       i
process i indicates its arrival and then waits
for the arrival of process (i + N - 2s-1) modulo N:

round          0     1      2      3     4     5
1

2

3

After r rounds each process is synchronized with each other. Proof idea: For each process i
each other process occurs in a tree of processes which have synchronized (in)directly with i.

(c) 2012 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 46

Objectives:
Understand the dissemination barrier

In the lecture:
* Symmetric code for arbitrary many processes.
* Arc i to j in the diagram means j waits for arrival of i.
* show the synchronization for pairs.
* No cyclic waiting, because the arrival is indicated first, then the partner is waited for.
* After the last round all processes are synchronized, because for all processes p a binary tree exists s.t. p is its root, all
processes are in that tree, the arcs are waiting pairs from the diagram forming pathes from the leaves to the root..

Questions:
* Write the synchronization code.
* Show one of the binary trees.

--------------------------------------------------------------------------------


PPJ-46a
Symmetric, distributed barrier: implementation

(i + N - 2s-1) modulo N          i
In round s
process i indicates its arrival and
waits for the arrival of process (i + N - 2s-1) modulo N:

Code for each process:

execute this.task();

// synchronize:
s = 0;

while (N > 2s)
s++;
wait for f==0; set f=1;

partner=p[(i + N - 2s-1) modulo N];
wait partner.f; reset partner.f=0

(c) 2012 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 46a

Objectives:
Understand the dissemination barrier

In the lecture:
* Processes have to wait before they set AND before they reset the flag.
* Symmetric code for arbitrary many processes.

Questions:
* Write the synchronization code.
* Show one of the binary trees.

--------------------------------------------------------------------------------


PPJ-47
Prefix sums with barriers

class PrefixSum extends Thread
{ private int procNo;          // number of process
private BarrierMonitor bm;          // barrier object

public PrefixSum (int p, BarrierMonitor b)
{ procno = p; bm = b; }

public void run ()
{ int addIt, dist = 1;          // distance
// global arrays a and s
s[procNo] = a[procNo];          // initialize result array
bm.barrier();

// invariant SUM: s[procNo] == a[procNo-dist+1]+...+a[procNo]
while (dist < N)
{ if (procNo - dist >= 0)
addIt = s[procNo - dist];   // value before overwritten
bm.barrier();
if (procNo - dist >= 0)

s[procNo] += addIt;
bm.barrier();
dist = dist * 2;          // doubled distance
} } }

(c) 2011 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 47

Objectives:
Examples for synchonization points

In the lecture:
Explain

* the invariant,
* the access of s[procNo],
* the reasons for the 3 synchronization points.

Questions:
* Explain the reasons for the 3 synchronization points.

--------------------------------------------------------------------------------


PPJ-48
Prefix sums in a synchronous parallel programming model

Notation in Modula-2* with synchronous (and asynchronous) loops for parallel machines

VAR a, s, t: ARRAY [0..N-1] OF INTEGER;
VAR dist: CARDINAL;
BEGIN
...
FORALL i: [0..N-1] IN SYNC          parallel loop in lock step

s[i] := a[i];
END;

dist := 1;

WHILE dist < N          parallel loop in lock step
FORALL i: [0..N-1] IN SYNC
IF (i-dist) >= 0 THEN
t[i] := s[i - dist];          implicit barrier
s[i] := s[i] + t[i];          for each statement
END
END;
dist := dist * 2;
END
END

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 48

Objectives:
Implicit barriers

In the lecture:
* Explain the language constructs.
* If expressions were evaluated in lock step, too, the array t could be omitted.
* The MasPar SIMD machine would be programmed similarly.

Questions:
* Explain the execution if values were not saved in t[i].

--------------------------------------------------------------------------------


PPJ-49
Finding list ends: data parallel approach

input:     int array link stores lists; link[i] contains the index of the successor or nil

output:    int array last; last[i] contains the index of the last element of list link[i]

method: worker process i computes last[i] = last[last[i]] in log N rounds

int d = 1;
last[i] = link[i];
barrier          nil          nil          nil

while (d < N)
{ int newlast = nil;
if ( last[i] != nil &&
last[last[i]] != nil)
newlast = last[last[i]];
barrier          nil          nil          nil
if (newlast != nil)

last[i] = newlast;
barrier
d = 2*d;
}

nil          nil          nil
last[i] points to the end of those lists which are
not longer than d

(c) 2003 bei Prof. Dr. Uwe Kastens

--------------------------------------------------------------------------------

Lecture Parallel Programming WS 2014/2015 / Slide 49

Objectives:
Data parallelism not only for arrays!

In the lecture:
Explain

* parallel scanning of lists,
* doubling distances for lists,
* last[last[i]],
* that it is only useful if the ends of many long lists are searched.

Questions:
* Which role plays the distance d here?

--------------------------------------------------------------------------------


C-5.11 / PPJ-50
5.2 / 6. Data Parallelism: Loop Parallelization

Development steps (automated by compilers):         DECLARE B[0..N,0..N+1]

* nested loops operating on arrays,          FOR I := 1 ..N
FOR J := 1 .. I
sequential execution of iteration space          B[I,J] :=
B[I-1,J]+B[I-1,J-1]
END FOR
END FOR

* analyze data dependences
data-flow: definition and use of array elements

* transform loops
keep data dependences forward in time

* parallelize inner loop(s)
map to field or vector of processors

* map arrays to processors
such that many accesses are local,
transform index spaces

(c) 2011 bei Prof. Dr. Uwe Kastens

j          j
1          N

i

N

-1

1-N

1

1          N       i

--------------------------------------------------------------------------------