Matrix factorizations on multicores with OpenMP (Calcul Réparti et Grid Computing)

alfredo.buttari@enseeiht.fr

for an up-to-date version of the slides:
http://buttari.perso.enseeiht.fr
Introduction

Objective of these lectures
Present the main issues and challenges we have to face when programming shared memory (multicore) computers. Understand the limits of an algorithm and show techniques to achieve better parallelism and performance.

Why
Because multicore processors are ubiquitous

How
We will take a reference algorithm from a book, we will show why it is not suited for parallelization and we will modify it producing different variants that achieve better and better parallelism and performance
Introduction

Which algorithm?
The Householder QR factorization
  ▶ because you have studied it (*Algèbre Linéaire Numérique*, first year)
  ▶ because there’s a lot of literature about it and its parallelization

Which tools for the parallelization?
OpenMP
  ▶ because you have studied it (*Systèmes Concurrents*, second year)
  ▶ because it’s simple
Part I

(Very) Brief OpenMP resume
Section 1

OpenMP
OpenMP (Open specifications for MultiProcessing) is an Application Program Interface (API) to explicitly direct multi-threaded, shared memory parallelism.

- Comprised of three primary API components:
  - Compiler directives (OpenMP is a compiler technology)
  - Runtime library routines
  - Environment variables
- Portable:
  - Specifications for C/C++ and Fortran
  - Already available on many systems (including Linux, Win, IBM, SGI etc.)
- Full specs
  http://openmp.org
- Tutorial
  https://computing.llnl.gov/tutorials/openMP/
How to program multicores: OpenMP

OpenMP is based on a *fork-join* execution model:

- Execution is started by a single thread called *master thread*
- when a parallel region is encountered, the master thread spawns a set of threads
- the set of instructions enclosed in a parallel region is executed
- at the end of the parallel region all the threads synchronize and terminate leaving only the master
How to program multicores: OpenMP

Parallel regions and other OpenMP constructs are defined by means of compiler directives:

```c
#include <omp.h>

main () {
    int var1, var2, var3;
    /* Serial code */

#pragma omp parallel private(var1, var2)
    shared(var3)
    {
        /* Parallel section executed by all threads */
    }
    /* Resume serial code */
}
```

```fortran
program hello
    integer :: var1, var2, var3
    ! Serial code

    !$omp parallel private(var1, var2)
    !$omp& shared(var3)
    ! Parallel section executed by all threads

    !$omp end parallel
    ! Resume serial code

end program hello
```
OpenMP: worksharing constructs

The DO/for directive:

```fortran
program do_example

  integer :: i, chunk
  integer, parameter :: n=1000, chunksize=100
  real(kind(1.d0)) :: a(n), b(n), c(n)

  ! Some sequential code...
  chunk = chunksize

  !$omp parallel shared(a,b,c) private(i)

  !$omp do
do i = 1, n
    c(i) = a(i) + b(i)
  end do
  !$omp end do

  !$omp end parallel

end program do_example
```
OpenMP: the task construct

Example of the TASK construct:

```fortran
program example_task
    integer :: i, n
    n = 10

    !$omp parallel
    !$omp master
        do i=1, n
        !$omp task firstprivate(i)
            call tsub(i)
        !$omp end task
        end do
    !$omp end master
    !$omp end parallel

    stop
end program example_task

subroutine tsub(i)
    integer :: i
    integer :: iam, nt, omp_get_num_threads, &
        &omp_get_thread_num

    iam = omp_get_thread_num()
    nt = omp_get_num_threads()

    write(*,')(''iam::,i2,'' nt::,i2,'' i::,i4''))iam,nt,i

    return
end subroutine tsub
```

<table>
<thead>
<tr>
<th>iam</th>
<th>nt</th>
<th>i</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>4</td>
<td>2</td>
</tr>
<tr>
<td>0</td>
<td>4</td>
<td>4</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>1</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>0</td>
<td>4</td>
<td>7</td>
</tr>
<tr>
<td>2</td>
<td>4</td>
<td>6</td>
</tr>
<tr>
<td>1</td>
<td>4</td>
<td>8</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
<td>9</td>
</tr>
<tr>
<td>0</td>
<td>4</td>
<td>10</td>
</tr>
</tbody>
</table>
Data scoping in tasks

The data scoping clauses shared, private and firstprivate, when used with the task construct are not related to the threads but to the tasks.

- **shared(x)** means that *x* means that within the task (at the moment when it is executed) *x* is the same variable (not its value but really the same memory location) as when the task was created.

- **private(x)** means that *x* is private to the task, i.e., when the task is created, a brand new variable *x* is created as well. This new copy goes out of scope (i.e., does not exist anymore) when the task is finished.

- **firstprivate(x)** means that *x* is private to the task, i.e., when the task is created, a brand new variable *x* is created as well and its value is set to be the same as the value of *x* in the enclosing context at the moment when the task is created. This new copy goes out of scope (i.e., does not exist anymore) when the task is finished.
Data scoping in tasks

```fortran
program example_task
    integer :: x, y, z, j
    !$omp parallel private(x,y)
    ...
    !$omp master
    j = 2
    !$omp task ! x is implicitly private, j shared
    x = j
    !$omp end task

    j = 4
    !$omp task ! y is implicitly private, j shared
    y = j
    !$omp end task

    !$omp taskwait

    z = x+y
    !$omp end master
    ...
    !$omp end parallel
end program example_task
```
Data scoping in tasks

```fortran
program example_task

    integer :: x, y, z, j

!$omp parallel private(x,y)
...
!$omp master

j = 2
!$omp task shared(x) firstprivate(j)
x = j
!$omp end task

j = 4
!$omp task shared(y) firstprivate(j)
y = j
!$omp end task

!$omp taskwait

z = x+y

!$omp end master
...
!$omp end parallel

end program example_task
```
OpenMP: the task construct

The `depend` clause enforces additional constraints on the scheduling of tasks.

Task dependences are derived from the dependence-type of a depend clause and its list items, where dependence-type is one of the following:

- The **in** dependence-type. The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an out or inout dependence-type list.

- The **out** and **inout** dependence-types. The generated task will be a dependent task of all previously generated sibling tasks that reference at least one of the list items in an in, out, or inout dependence-type list.
OMP tasks: example

Thanks to the specified dependencies the OpenMP runtime can build a graph of dependencies and schedule the tasks accordingly.

```
!$omp parallel
!$omp single
!$omp task depend(out:a)
  a = f_a()
!$omp end task

!$omp task depend(out:b)
  b = f_b()
!$omp end task

!$omp task depend(out:c)
  c = f_c()
!$omp end task

!$omp task depend(in:b,c) depend(out:x)
  x = f1(b, c)
!$omp end task

!$omp task depend(in:a,x) depend(out:y)
  y = f2(a, x)
!$omp end task

!$omp end single
!$omp end parallel
```
Target architecture

Our target architecture has four, hexa-core AMD Opteron 8431 processors connected through HyperTransport links in a ring topology with a memory module attached to each of them.

- Clock frequency 2.4GHz → peak performance in DP
  \[ = 2.4 \times 2 \times 2 = 9.6 \text{ Gflop/s} \] (the first 2 is for SSE units, the second is for dual-issue);

- Memory frequency = 667 MHz → peak memory bandwidth
  \[ = 0.667 \times 2 \times 8 = 10.6 \text{ GB/s} \] (the 2 is for dual-channel and the 8 is for the bus width).
Part II

Roofline model and BLAS routines
Performance modeling and evaluation

How to model and evaluate the performance of a sequential and/or multithreaded code?

Assumptions:

- An operation is a succession of data transfers and computations.
- Computations and data transfers can be done at the same time: we will assume that while doing some computations, we can prefetch data for the computations step.
- Once in the cache, the access to data costs nothing.
Performance modeling and evaluation

= $\frac{1}{9.6} = 0.102 \ \text{nsec}$

= $\frac{8}{10.6} = 0.755 \ \text{nsec}$
Performance modeling and evaluation

= 1/9.6 = 0.102 nsec

= 8/10.6 = 0.755 nsec

1.325 Gflop/s
Performance modeling and evaluation

= 1/9.6 = 0.102 nsec

= 8/10.6 = 0.755 nsec

1.325 Gflop/s

2.650 Gflop/s
Performance modeling and evaluation

- $\frac{1}{9.6} = 0.102$ nsec
- $\frac{8}{10.6} = 0.755$ nsec

- 1.325 Gflop/s
- 2.650 Gflop/s
- 9.600 Gflop/s
Operational intensity

The Operational Intensity is defined as the number of operations per data transfer.
How is this quantity important? A modern processor (single-core for the moment) has a typical peak performance around 10 Gflop/s for double-precision computations and a typical memory bandwidth of 10 GB/s = 10/8 GDW/s. This means that a single core can process data much faster than the memory can stream it. This is why cache memories exist.

Caches
Data that is reused over time (temporal locality) or which is used right after some other contiguous data (spatial locality) can be kept in cache memories and accessed much faster.

Therefore, if we can reuse a data multiple times once we have brought it into cache, performance improves.
The roofline model \[1\] is a method for computing a performance upper bound as a function of the operational intensity:

\[
\text{Attainable Gflop/s} = \min \left\{ \begin{array}{c}
\text{Peak Floating-point Performance} \\
\text{Peak Memory Bandwidth} \times \text{Operational intensity}
\end{array} \right\}
\]
The roofline model

The **Roofline model** [1] is a method for computing a performance upper bound as a function of the *operational intensity*:

\[
\text{Attainable Gflop/s} = \min \left\{ \frac{9.6}{10.6/8} \times \text{Operational intensity} \right\}
\]

![Roofline model for 1 core](image-url)
The roofline model

How is the roofline model adapted to the case of multiple cores?
The roofline model

How is the roofline model adapted to the case of multiple cores?
The roofline model

How is the roofline model adapted to the case of multiple cores?
The roofline model

How is the roofline model adapted to the case of multiple cores?

Roofline model for 7 cores
The roofline model

How is the roofline model adapted to the case of multiple cores?
The BLAS [2, 3, 4] library offers subroutines for basic linear algebra operations (it’s called Basic Linear Algebra Subroutines for a reason...). These routines are grouped in three levels

**Level-1**

vector or vector-vector operations such as

- _axpy:_ \( y = \alpha x + y \)
- _dot:_ \( dot = x^T y \)
- _nrm2:_ \( nrm2 = \|x\| \)
BLAS routines

**Level-2**

matrix-vector operations such as

- \_gemv: \( y = \alpha Ax + \beta y \)
- \_ger: \( A = \alpha xy^T + A \)
- \_trsv: \( x = T^{-1}x \) (\( T \) triangular)

**Level-3**

matrix-matrix operations such as

- \_gemm: \( C = \alpha AB + \beta C \)
- \_trsm: \( \alpha X = T^{-1}X \) (\( T \) triangular)
BLAS routines

From the point of view of performance, there is a considerable difference among the three levels and it is due to the different ratios between floating-point operations and memory-to-cpu data transfers:

- **Level-1** routines typically perform $O(n)$ floating-point operations on $O(n)$ values
- **Level-2** routines typically perform $O(n^2)$ floating-point operations on $O(n^2)$ values
- **Level-3** routines typically perform $O(n^3)$ floating-point operations on $O(n^2)$ values

This means that, unlike in Level-1 and 2 where each coefficient is used only once (no locality, operational intensity $O(1)$), in Level-3 routines each coefficient is used $n$ times (lots of locality, operational intensity $O(n)$).
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

\[
\begin{array}{c}
A \\
\begin{array}{|c|c|c|}
\hline
\end{array} \\
\end{array} 
\quad 
\begin{array}{c}
B \\
\begin{array}{|c|c|c|}
\hline
\end{array} \\
\end{array} 
\quad 
\begin{array}{c}
C \\
\begin{array}{|c|c|c|}
\hline
\end{array} \\
\end{array}
\]
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

\begin{align*}
\text{unblocked} &= 2n^3 = 2^G \\
\text{blocked} &= (b^2 + 2bn) \times ((n/b)^2) = 67M (b = 30)
\end{align*}
Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

\[
\begin{align*}
\text{A} & \quad \text{B} & \quad \text{C} \\
\begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} & \begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} & \begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} \\
\begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} & \begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} & \begin{array}{|c|c|c|c|c|c|c|c|c|c|}
\hline
\end{array} \\
\end{align*}
\]
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

number of cache misses:

- **unblocked** $= 2n^3 = 2G$
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

Number of cache misses:

- **unblocked** $= 2n^3 = 2G$
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

number of cache misses:

- unblocked $= 2n^3 = 2G$
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

\[
\text{number of cache misses:} \quad \begin{array}{c}
\text{unblocked} = 2n^3 = 2G \\
\end{array}
\]
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

![Matrix A](image1)

![Matrix B](image2)

![Matrix C](image3)

Number of cache misses:

- Unblocked $= 2n^3 = 2G$
BLAS routines

Although we normally handle matrices that don’t fit in any cache level, the locality can be exploited with blocking. Assume the operation $A = A + BC$ with $A$, $B$ and $C$ square of size 1000 and a fully-associative cache of size 1000 coefficients with line size of 1 coefficient.

Number of cache misses:
- **unblocked** = $2n^3 = 2G$
- **blocked** = $(b^2 + 2bn) \times ((n/b)^2) = 67M \quad (b = 30)$
BLAS routines

Because modern processors are equipped with multiple levels of cache memories plus TLB, nested blocking is commonly used to achieve a good memory locality at each level:
BLAS routines

Because modern processors are equipped with multiple levels of cache memories plus TLB, nested blocking is commonly used to achieve a good memory locality at each level:
BLAS routines

Because modern processors are equipped with multiple levels of cache memories plus TLB, nested blocking is commonly used to achieve a good memory locality at each level:
Because modern processors are equipped with multiple levels of cache memories plus TLB, nested blocking is commonly used to achieve a good memory locality at each level:

More about BLAS optimization techniques can be found in ATLAS [5], or GotoBLAS [6, 7].
BLAS routines

Here is how Level-1, 2 and 3 BLAS perform on our reference architecture:

<table>
<thead>
<tr>
<th>routine</th>
<th>#ops</th>
<th>#data</th>
<th>Gflop/s</th>
<th>RM</th>
</tr>
</thead>
<tbody>
<tr>
<td>daxpy</td>
<td>$2n$</td>
<td>$3n$</td>
<td>0.46</td>
<td>0.88</td>
</tr>
<tr>
<td>dgemv</td>
<td>$2n^2$</td>
<td>$n^2 + 2n$</td>
<td>1.18</td>
<td>2.65</td>
</tr>
<tr>
<td>dgemm</td>
<td>$2n^3$</td>
<td>$4n^2$</td>
<td>8.95</td>
<td>9.60</td>
</tr>
</tbody>
</table>

Roofline model for 1 core
Multithreaded BLAS routines

As long as we stay on a single socket, it doesn’t help much to use more cores for memory bound operations like Level-2 BLAS.
Part III

Dense factorizations on multicores
Point QR factorization

The “Matrix Computations” book by Golub and Van Loan [8] presents the Householder algorithm for the QR factorization of a matrix $A$ of size $m \times n$ as

$$
\text{for } j = 1 : n \\
[\nu, \beta] = \text{house}(A(j : m, j)) \\
A(J : m, j : n) = (I_{m-j+1} - \beta \nu \nu^T)A(j : m, j : n) \\
\text{if } j < m \\
A(j + 1 : m, j) = \nu(2 : m - j + 1) \\
\text{end} \\
\text{end}
$$

This algorithm decomposes the matrix $A$ into the product of an orthogonal matrix $Q$ and a triangular one $R$ and can be used to solve linear systems of any shape. Its cost is (assuming $m \geq n$)

$$
\sum_{k=1}^{n} 4(m - k + 1)(n - k) = 2n^2(m - \frac{n}{3})
$$
Householder QR

The $V$ and $R$ matrices replace the $A$ matrix:

\begin{verbatim}
  do  k = min(m,n)
    ! Generate elementary reflector $H(k)$
    ! to annihilate $A(k+1:m,k)$
    call dlarfp(m-k+1, a( k, k ), a(min(k+1,m),k), &
                 & 1, tau( k ) )
    ! Apply $H(i)$ to $A(k:m,k+1:n)$ from the left
    akk = a(k,k)
    a(k,k) = 1.d0
    call dlarf('left', m-k+1, n-k, a(k,k), 1, tau(k), &
                & a(k,k+1), m, work )
    a(k,k) = ak
  end do
\end{verbatim}

LAPACK point QR: \texttt{dgeqr2}

This algorithm/code is never used in practice because it is extremely inefficient because it is based on Level-2 BLAS operations (\texttt{dlarf})
Block algorithms

The point QR algorithm we have seen so far only use Level-1 and 2 routines: `dnrm2` (inside `dlarfp`), `dgemv` (inside `dlarf`) etc. As a result, it cannot achieve a reasonable performance on modern processors with (deep) cache hierarchies.

Considerable performance improvements can be obtained if operations are rearranged in such a way that the factorization is achieved through the following steps:

1. factorize a small portion of the matrix using unblocked, Level-2 BLAS code and accumulate the computed transformations. This small portion is commonly referred to as panel;
2. apply the accumulated transformations to the trailing submatrix at once using Level-3 routines;
3. repeat steps 1 and 2 on the trailing submatrix.

These algorithms are commonly called block (or blocked) and are at the base of the LAPACK library.
Block Householder QR

Take a matrix $A$ of size $m \times n$ and partition it as below with $A_{11}$ of size $b \times b$, $b \ll m, n$ and $A_{22}$ of size $(m - b) \times (n - b)$.

The basic step of the block QR factorization is

\[
Q_1^T \begin{bmatrix}
A_{11} & A_{12} \\
A_{21} & A_{22}
\end{bmatrix} = \begin{bmatrix}
R_{11} & R_{12} \\
\tilde{A}_{22}
\end{bmatrix}
\]

achieved through the following operations [14]

\[
\begin{bmatrix}
A_{11} \\
A_{21}
\end{bmatrix} \rightarrow \begin{bmatrix}
V_{11} \\
V_{21}
\end{bmatrix}, \ [R_{11}], \ [T_1]
\]

\[
\begin{bmatrix}
R_{12} \\
\tilde{A}_{22}
\end{bmatrix} = \left( I - \begin{bmatrix}
V_{11} \\
V_{21}
\end{bmatrix} \cdot [T_{11}^T] \cdot \begin{bmatrix} V_{11}^T & V_{21}^T \end{bmatrix} \right) \begin{bmatrix}
A_{12} \\
A_{22}
\end{bmatrix}
\]
Block Householder QR

\begin{verbatim}
do k=1, n, b
    ! factorize the panel and compute T
    call dgeqrt(m-k+1, b, a(k,k), lda, &
                & T, ldt, &
                & w, info)
    ! use T and V to update the trailing submatrix
    call dgemqrt('l', 't', &
                & m-k+1, n-k-b+1, b, ib, &
                & a(k,k), lda, &    ! the V reflectors
                & t, ldt, &    ! the T matrix
                & a(k,k+b), lda, &    ! the trailing submatrix
                & w, info)
end do
\end{verbatim}

LAPACK block QR: \texttt{dgeqrf} (more or less)

The \texttt{dgemqrt} routine uses Level-3 operations such as \texttt{dgemm} and \texttt{dtrmm}.
Block factorizations

Experimental results on AMD Opteron 2.4GHz:

- reference LAPACK from Netlib
- ACML BLAS

![Graph showing block vs block Gflop/s for different matrix sizes. The graph compares performance using dpotrf, dgeqrf, and dgetrf.](image)
LAPACK block factorizations are essentially an interleaved sequence of Level-2 (the panels reductions) and Level-3 (the trailing submatrix updates) operations. The first are not parallelizable (to some extent), the second yes.

The naïve way of speeding up LAPACK factorizations on multicore systems is simply to link with a multithreaded BLAS. This is commonly referred to as fork-join parallelism.

Amdahl says this is not a good choice.

In the following we will focus on the QR factorization (without the hassle of pivoting) but the same holds for LU and Cholesky.
The forks and joins are more evident if the trailing submatrix update is explicitly parallelized with a 1d block-column partitioning.
Fork-join multithreading

The forks and joins are more evident if the trailing submatrix update is explicitly parallelized with a 1d block-column partitioning
Fork-join multithreading

The forks and joins are more evident if the trailing submatrix update is explicitly parallelized with a 1d block-column partitioning.
Fork-join multithreading

The forks and joins are more evident if the trailing submatrix update is explicitly parallelized with a 1d block-column partitioning.
do k=1, n, b
   ! factorize the panel and compute T
   call dgeqrt(m-k+1, b, a(k,k), lda, &
             & T, ldt, &
             & w, info)
   ! use T and V to update the trailing submatrix
   !$omp parallel do
   do c=k+b, n, b
      call dgemqrt('l', 't', 'f', 'c', &
                 & n-k+1, b, b, ib, &
                 & a(k,k), lda, &  ! the V reflectors
                 & t, ldt, &       ! the T matrix
                 & a(k,c), lda, &  ! the trailing submatrix
                 & w, ldw)
   end do
   !$omp end parallel
end do
DAG and critical path

The dependency graph for the previous algorithm (assuming 5 block-columns) is

- Critical Path: the longest directed path between input and output nodes
- Critical Path Length: the sum of weights along the critical path
- Average Degree of Concurrency: $\frac{\sum_{i \in DG} W_i}{\sum_{i \in CP} W_i}$
Fork-join multithreading: performance model

- A panel of size $mb \times b$ takes $\frac{2}{3}(3m - 1)b^3$ flops (or $2(3m - 1)$ time units)
- An update of size $mb \times b$ takes $(4m - 2)b^3$ flops (or $3(4m - 2)$ time units)

Assume a matrix of size $mb \times nb$ with $m = 4$, $n = 3$

Best possible speedup is

$$\frac{\sum_{i \in DG} w_i}{\sum_{i \in CP} w_i} = \frac{162}{120} = 1.35$$
Fork-join multithreading: performance model

For a matrix of size $mb \times nb$ (with $b$ being the block-column size), the best possible speedup with the fork-join approach is

$$\frac{\sum_{k=1}^{n} 2(3(m - k + 1) - 1) + (n - k)(4(m - k + 1) - 1)}{\sum_{k=1}^{n} 2(3(m - k + 1) - 1) + (4(m - k + 1) - 1)}$$

$$= \frac{2mn^2 - \frac{2}{3}n^3}{6mn - 3n^2}$$
Fork-join multithreading

As expected, this actually does not lead to great performance.

The achieved speedup is less than 9 on 24 cores.
Fork-join multithreading

The poor scalability is due to the fact that all the threads but one are idle when the panel reduction is performed.
Lookahead

It has to be noted that the reduction of panel $k + 1$ can be done independently of the update of all the columns $k + 2, \ldots, n$ with respect to panels $1, \ldots, k$. Therefore, as soon as the update of column $k + 1$ with respect to panel $k$ is done, the reduction of panel $k + 1$ can be executed and, possibly, overlapped with other update operations.

The panel reduction is not parallelized but rather hidden behind update operations.
Lookahead

It has to be noted that the reduction of panel $k+1$ can be done independently of the update of all the columns $k+2, \ldots, n$ with respect to panels $1, \ldots, k$. Therefore, as soon as the update of column $k+1$ with respect to panel $k$ is done, the reduction of panel $k+1$ can be executed and, possibly, overlapped with other update operations.

The panel reduction is not parallelized but rather hidden behind update operations.
Lookahead

It has to be noted that the reduction of panel $k + 1$ can be done independently of the update of all the columns $k + 2, \ldots, n$ with respect to panels $1, \ldots, k$.

Therefore, as soon as the update of column $k + 1$ with respect to panel $k$ is done, the reduction of panel $k + 1$ can be executed and, possibly, overlapped with other update operations.

The panel reduction is not parallelized but rather hidden behind update operations.
Lookahead

It has to be noted that the reduction of panel $k + 1$ can be done independently of the update of all the columns $k + 2, \ldots, n$ with respect to panels $1, \ldots, k$. Therefore, as soon as the update of column $k + 1$ with respect to panel $k$ is done, the reduction of panel $k + 1$ can be executed and, possibly, overlapped with other update operations.

The panel reduction is not parallelized but rather hidden behind update operations.
1D multithreading with lookahead

This technique is called lookahead [15] and actually gives better performance

Lookahead performance

Lookahead performance

Gflop/s

# cores
1D multithreading with lookahead

The overlap between panel reductions and updates is clear in the execution traces.
1D multithreading with lookahead

Lookahead can be seen as a way to pipeline different iterations of the outer loop of a factorization. Multiple iterations can be pipelined and their number is commonly referred to as lookahead depth.

![Lookahead performance graph](attachment://lookahead_graph.png)
1D multithreading with lookahead

A depth 2 gives even slightly better performance. Can we have more? In distributed memory the lookahead depth had to be limited to avoid excessive memory consumption. What about shared memory?

Lookahead performance

[Graph showing Gflop/s vs. # cores for different lookahead depths]
1D multithreading with lookahead

An infinite depth lookahead does not yield good results. Why does this happen?
1D multithreading with lookahead

The reason for the poor performance with infinite lookahead lies in cache locality. When many iterations are pipelined, panels are moved in and out of cache memories depending on what update is performed.

![Cumulative computing time graph](image-url)

**Cumulative computing time**

- depth 0
- depth 1
- depth 2
- depth inf

<table>
<thead>
<tr>
<th># cores</th>
<th>depth 0</th>
<th>depth 1</th>
<th>depth 2</th>
<th>depth inf</th>
</tr>
</thead>
<tbody>
<tr>
<td>6</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr>
<td>12</td>
<td>1.5</td>
<td>2.5</td>
<td>3.5</td>
<td>4.5</td>
</tr>
<tr>
<td>18</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
</tr>
<tr>
<td>24</td>
<td>2.5</td>
<td>3.5</td>
<td>4.5</td>
<td>5.5</td>
</tr>
</tbody>
</table>
1D multithreading with lookahead

The previous graph shows two important facts

1. The more cores are used, the higher is the cumulative times, i.e., the slower each single operation is. This is, again, due to data locality (either in caches or in NUMA memory modules). This mostly happens because cores operate on data that do not necessarily reside in their own memory hierarchy.

2. The deeper the lookahead is, the higher is the cumulative times, i.e., the slower each single operation is. This is due to the fact that with shallow lookaheads panels has a higher chance of staying in cache memories and of being reused for multiple update operations.
The limits of 1D parallelism

1D parallelization does not work well on small matrices or on *tall-and-skinny* (i.e., strongly overdetermined) ones.

This is due to the fact that a 1D partitioning does not deliver enough concurrency in these cases.
The limits of 1D parallelism

This can be overcome by applying to the matrix a 2D partitioning, commonly referred to as 2D block partitioning or, less frequently, tile partitioning
The tile QR algorithm based on the 2D block decomposition [16] can be roughly described like this:

1. At step \( k \) the QR factorization of the diagonal tile is done.
2. The previous transformation modifies all the other tiles along the \( k \)-th row.
3. then the diagonal tile (which is now an upper triangle) is used to annihilate all the subdiagonal tiles \( i \) one at a time.
4. Each of the previous operations also modifies the tiles along rows \( k \) and \( i \)
Define the following kernels

**DGEQRT**: \( A_{kk} \rightarrow (V_{kk}, R_{kk}, T_{kk}) = QR(A_{kk}) \)

**DGEMQRT**: \( A_{kj}, V_{kk}, T_{kk} \rightarrow R_{kj} = (I - V_{kk} T_{kk} V_{kk}^T) A_{kj} \)

**DTSQRT**: \[
\begin{bmatrix}
R_{kk} \\
A_{ik}
\end{bmatrix} 
\rightarrow 
(V_{ik}, T_{ik}, R_{kk}) = QR
\begin{bmatrix}
R_{kk} \\
A_{ik}
\end{bmatrix}
\]

**DTSMQRT**: \[
\begin{bmatrix}
R_{kj} \\
A_{ij}
\end{bmatrix}, V_{ik}, T_{ik} 
\rightarrow 
\begin{bmatrix}
R_{kj} \\
A_{ij}
\end{bmatrix} = (I - V_{ik} T_{ik}^T V_{ik}^T) \begin{bmatrix}
R_{kj} \\
A_{ij}
\end{bmatrix}
\]
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
Tile QR

The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
Tile QR

The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the \( k \)-th diagonal tile
2. **DGEMQRT**: update the \( k \)-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row \( i \), column \( k \)
4. **DTSMQRT**: update rows \( k \) and \( i \) wrt DTSQRT
The (sequential) tile QR factorization proceeds like this:

1. **DGEQRT**: factorize the \( k \)-th diagonal tile
2. **DGEMQRT**: update the \( k \)-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row \( i \), column \( k \)
4. **DTSMQRT**: update rows \( k \) and \( i \) wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the \( k \)-th diagonal tile
2. **DGEMQRT**: update the \( k \)-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row \( i \), column \( k \)
4. **DTSMQRT**: update rows \( k \) and \( i \) wrt DTSQRT
The (sequential) tile QR factorization proceeds like this:

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
Tile QR

The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT
The (sequential) tile QR factorization proceeds like this

1. **DGEQRT**: factorize the $k$-th diagonal tile
2. **DGEMQRT**: update the $k$-th row wrt the DGEQRT
3. **DTSQRT**: use the diagonal tile to kill a subdiagonal one on row $i$, column $k$
4. **DTSMQRT**: update rows $k$ and $i$ wrt DTSQRT

![Diagram of the tile QR factorization process]
This is the sequential algorithm (with simplified function interfaces)

```plaintext
do k=1, min(mb,nb)
  call dgeqrt(a(k,k))
  do j=k+1,nb
    call dgemqrt(a(k,k), a(k,j))
  end do
  do i=k+1, mb
  call dtsqrt(a(k,k), a(i,k))
    do j=k+1, nb
      call dtmsqrt(a(k,k), a(i,k), a(k,j), a(i,j))
    end do
  end do
end do
```

Sequential tile QR
Tiled QR: critical path analysis

Tasks costs in $b^3/3$:

- DGEQRT : 4
- DGEMQRT: 6
- DTSQRT : 6
- DTSMQRT: 12

For a $mb \times nb$ matrix with $m = 4$, $n = 3$ the dependency graph is...
Tiled QR: critical path analysis

Tasks costs in $b^3/3$:

- $DGEQRT : 4$
- $DGEMQRT: 6$
- $DTSQRT : 6$
- $DTSMQRT: 12$

For a $mb \times nb$ matrix with $m = 4$, $n = 3$ the dependency graph is

\[
\begin{align*}
\sum_{i \in DG} w_i & = 162 \\
\sum_{i \in CP} w_i & = 70
\end{align*}
\]

and the best achievable speedup is

\[
\frac{\sum_{i \in DG} w_i}{\sum_{i \in CP} w_i} = \frac{162}{70} = 2.31
\]
Tiled QR: critical path analysis

Assuming $n < m$

\[
\frac{\sum_{k=1}^{n} 4 + (n - k) \times 6 + (m - k) \times 6 + (m - k - 1) \times (n - k) \times 12}{4 + 6 + (m - 2) \times 12 + (n - 1) \times (12 + 6)}
\]

Much better theoretical speedups with respect to the 1D parallelization.
Tile QR

The tile algorithm exposes much more concurrency than the block-column one thanks to the 2D data (and workload) partitioning.

However, its execution pattern is much more complex which renders the parallelization more difficult. One idea, that has recently received great interest, consists in defining a task for each of the elementary operations calls, and then arranging all these tasks into a Directed Acyclic Graph (DAG) which is nothing more than the graph of dependencies.

Modern tools exists that automatically build the DAG and use it for dynamically scheduling its tasks to the available processing units. These tools are also commonly called runtime execution engines and are based on what we normally refer to as a Data-Flow or Task-Flow programming models. Examples of such tools are QUARK (used in the PLASMA [17] project), Parsec, StarPU or even OpenMP 4.0.
This can be parallelized “easily” using the DEPEND clause of the OpenMP TASK construct

do k=1, min(mb,nb)
   !$omp task depend(inout:a(k,k))
   call dgeqrt(a(k,k))
   !$omp end task
end do

do j=k+1,nb
   !$omp task depend(in:a(k,k)) depend(inout:a(k,j))
   call dgemqrt(a(k,k), a(k,j))
   !$omp end task
end do

do i=k+1, mb
   !$omp task depend(inout:a(k,k), a(i,k))
   call dtsqrt(a(k,k), a(i,k))
   !$omp end task
end do

do j=k+1, nb
   !$omp task depend(in:a(k,k),a(i,k))
   !$omp& depend(inout:a(k,j),a(i,j))
   call dtsmqrt(a(k,k), a(i,k), a(k,j), a(i,j))
   !$omp end task
end do
end do
end do

Sequential tile QR
The DAG expresses the dependencies between the tasks: a task can be executed only when all the other tasks to which it is connected by an incoming edge are completed.

Based on this principle, a task whose dependencies are satisfied can be dynamically assigned to a running thread which then takes care of executing the corresponding job.

Threads work asynchronously and no synchronizations (other than those in the DAG) are imposed by the execution model. This minimizes the threads idle time.
All the classical methods for scheduling the tasks in a dependency graph can be used for minimizing the makespan (execution time) based on a number of criteria; e.g.

- assign higher priority to tasks which lie along the critical path
- define a loose coupling between threads and tasks in order to improve data locality
- assign tasks to preferred threads and employ work stealing techniques in case of starvation
As shown by the execution traces, the threads idle time is reduced (apart from the tail)

The execution trace is much more dense (almost no idle time). An extra overhead is added to handle the increased number of tasks.
Tile QR

In both cases the better performance/scalability can be explained with the increased concurrency exposed by the tile algorithm.
Tile QR

- on the big, square matrix tile QR starts off slower but then scales better
Tile QR

- on the big, square matrix tile QR starts off slower but then scales better
- tile QR has the same (better?) performance and scalability on the smaller matrix
Tile QR

- on the big, square matrix tile QR starts off slower but then scales better
- tile QR has the same (better?) performance and scalability on the smaller matrix

In both cases the better performance/scalability can be explained with the increased concurrency exposed by the tile algorithm
You may have noticed that all the tasks related to tiles along a column form a chain in the DAG. This means that their execution is serialized. This is only slightly better than what we had with a 1D partitioning mostly because the reduction/update on a column is not parallelized but only decomposed in multiple tasks that can be interleaved with others.

Therefore, we can still expect that this version of the tile algorithm suffers on tall and skinny matrices.

Thanks to the great flexibility of Householder transformations, it is possible to devise different algorithms that are more suited to this kind of matrices [18]
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.

For this we need two new kernels, DTTQRT and DTTMQRT. The first computes the QR factorization of a triangle on top of a triangle and the second does the corresponding updates.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.

For this we need two new kernels, $\text{DTTQRT}$ and $\text{DTTMQRT}$. The first computes the QR factorization of a triangle on top of a triangle and the second does the corresponding updates.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.

For this we need two new kernels, \( \text{DTTQRT} \) and \( \text{DTTMQRT} \). The first computes the QR factorization of a triangle on top of a triangle and the second does the corresponding updates.
Tile QR

Such algorithms are based on the idea that multiple tiles along a column can be annihilated concurrently.

For this we need two new kernels, DTTQRT and DTTMQRT. The first computes the QR factorization of a triangle on top of a triangle and the second does the corresponding updates.
Critical path analysis

DTTQRT and DTTMQRT take, respectively 2 and 6 time units.
For a $mb \times nb$ matrix with $m = 4$, $n = 3$ the dependency graph is...
Critical path analysis

$\text{DTTQRT}$ and $\text{DTTMQRT}$ take, respectively 2 and 6 time units. For a $mb \times nb$ matrix with $m = 4$, $n = 3$ the dependency graph is

and the best achievable speedup is

$$\frac{\sum_{i \in DG} w_i}{\sum_{i \in CP} w_i} = \frac{162}{50} = 3.24$$
Critical path analysis

The binary panel reduction provides better parallelism for extremely tall and skinny matrices but suffer from poor pipelining of successive panel stages on less overdetermined ones.
Tile QR

Note that different approaches are possible and that the reduction tree can have different shapes. Using these algorithms the tile QR algorithm scales on tall and skinny matrices as well as on square ones
Tile QR

Note that different approaches are possible and that the reduction tree can have different shapes.
Using these algorithms the tile QR algorithm scales on tall and skinny matrices as well as on square ones
Data access

Ideally, when an operation is rich in Level-3 BLAS routines, the data transfers (be it from/to memory or on a network) could be completely hidden and, virtually, costless. Unfortunately this is not the case in practice and even Level-3 BLAS operations can take a serious hit if data access is not carefully done, especially in a NUMA architecture.
Consider a target architecture with four, hexa-core AMD Istanbul processors connected through HyperTransport links in a ring topology with a memory module attached to each of them:
Target architecture

Consider a target architecture with four, hexa-core AMD Istanbul processors connected through HyperTransport links in a ring topology with a memory module attached to each of them:

- The bandwidth depends on the number of hops
- The bandwidth drops because of memory conflicts when multiple threads are active

6.2 GDW/s
4.0 GDW/s
3.1 GDW/s

The bandwidth depends on the number of hops
Target architecture

Consider a target architecture with four, hexa-core AMD Istanbul processors connected through HyperTransport links in a ring topology with a memory module attached to each of them:

- The bandwidth depends on the number of hops
- The bandwidth drops because of memory conflicts when multiple threads are active
Memory allocation policies

Should we do something to reduce conflicts?

Matrix

NUMA-0

Normal allocation

NUMA-1

NUMA-2

NUMA-3

Memory interleaving should provide a more uniform distribution of data that (supposedly) reduces conflicts and increases the memory bandwidth.
Memory allocation policies

Should we do something to reduce conflicts?

Matrix

Interleaved allocation

(numactl -i all)

Memory interleaving should provide a more uniform distribution of data that (supposedly) reduces conflicts and increases the memory bandwidth
Memory allocation policies

Note that the same can be achieved (also with different granularities) using the **first touch rule**

**First touch rule**
Independently from who did the allocation, a memory page is allocated close to the thread who first “touches” (references) it.

For example, multiple threads can be used to initialize data in order to achieve an even distribution on the available NUMA memory modules.
Memory allocation policies

The memory interleave policy delivers much better performance and scalability on NUMA architectures.
Memory allocation policies

The memory interleave policy deliver much better performance and scalability on NUMA architectures

Local vs interleaved allocation

![Graph showing performance comparison between local and interleaved allocation]
Part IV

Performance Analysis
Performance evaluation

The performance of a complex, parallel algorithm is hard to evaluate, especially on a complex architecture.

The classical way to evaluate the performance and scalability of a parallel code are the speedup and the parallel efficiency:

$$S = \frac{t^-(1)}{t^-(p)}, \quad e = \frac{t^-(1)}{t^-(p) \times p}$$

where $t^-(1)$ is the time to run the best sequential algorithm on 1 process and $t^-(p)$ is the time to execute the parallel algorithm on $p$ processes.

This metrics, however do not contain much information and are not useful to identify the factors that limit the performance and scalability of the code.
Performance evaluation

In order to achieve a more detailed performance analysis, we can observe that (independently of the algorithm)

\[ t(p) = \frac{t_t(p) + t_i(p) + t_c(p) + t_o(p)}{p} \]

where

- **\( t_t(p) \)**: The total time spent in tasks which represent the workload of the application.
- **\( t_i(p) \)**: The total idle time spent waiting for dependencies between tasks to be satisfied. \( t_i(1) := 0 \)
- **\( t_c(p) \)**: The total time spent performing explicit communications that are not overlapped by computations. This corresponds to the time spent by processes waiting for data to be transferred on their associated memory node before being able to execute a task. \( t_c(1) := 0 \)
- **\( t_o(p) \)**: this is the total overhead for handling parallelism (e.g., create and schedule tasks). \( t_o(1) := 0 \)
Performance evaluation

\[
e(p) = \frac{t^-(1)}{t^-(p) \times p} = \frac{t^-(1)}{t^-_t(p) + t^-_o(p) + t^-_c(p) + t^-_i(p)} =
\]

where:

- **\( e_a \): the algorithm efficiency**, which measures how the overall efficiency is reduced by the data partitioning and the use of parallel algorithms.

- **\( e_t \): the task efficiency**, measures the loss of tasks efficiency due to parallelism (memory contention, cache, NUMA etc.).

- **\( e_o \): the overhead efficiency**, which measures the cost of the overhead with respect to the actual work done.

- **\( e_c \): the communication efficiency**, which measures the cost of communications with respect to the actual work done due to data transfers between workers.

- **\( e_p \): the pipeline efficiency**, which measures how much concurrency is available and how well it is used.
Performance evaluation: observations

\[ e_a = \frac{t^-(1)}{t^+(1)} \]

The algorithm used for a sequential execution is commonly different than the one used for a parallel execution. Also, no data and/or operations partitioning is commonly applied in a sequential execution, whereas in parallel data and/or operations are partitioned to achieve better concurrency. Therefore \( t_p^- (1) \leq t_t^- (1) \) because:

- the parallel algorithm may perform more operations if it is beneficial for parallelism
- in the parallel algorithm operations are done at a smaller granularity which implies lower tasks efficiency
Performance evaluation: observations

$$e_t = \frac{t_t^\equiv(1)}{t_t^\equiv(p)}$$

When an algorithm is executed sequentially, all the data are stored in its own memory (or NUMA module) and, therefore, can be accessed fast. When the algorithm is executed in parallel on a shared memory machine, the time of each single task may increase because of poorer cache behavior, remote accesses in the NUMA system (i.e., implicit communications) and contentions on the shared memory.

$$e_o = \frac{t_t^\equiv(p)}{t_t^\equiv(p) + t_r^\equiv(p)}$$

Commonly the overhead is relatively small but, for example, if tasks are of very small granularity the relative cost for handling them grows and thus this efficiency is lesser.
Performance evaluation: observations

\[ e_c = \frac{t_t^{-}(p) + t_r^{-}(p)}{t_t^{-}(p) + t_r^{-}(p) + t_c^{-}(p)} \]

Explicit communications obviously reduce the performance and scalability of a code wrt the sequential execution. If the data distribution is bad, this cost may become dominant. In shared memory, multithreaded parallelism, there are no explicit communications (i.e., \( t_c(p) = 0 \)) and thus we will ignore this metric here. Note that there are implicit communications in the NUMA system which decrease the tasks efficiency \( e_t \).
Performance evaluation: observations

\[ e_p = \frac{t_t(p) + t_r(p) + t_c(p)}{t_t(p) + t_r(p) + t_c(p) + t_i(p)} \]

If, for example, we have a DAG which is a chain, this efficiency will be very small no matter what, because only one process will be working at any time and all the others will be idle. If, however, concurrency is available, this efficiency can still be low because of a bad scheduling policy.
Increasing the block size

- improves $e_a$ and $e_t$ because of the large granularity
- reduces the overhead because there are fewer tasks and of bigger size
- reduces $e_p$ because of lower concurrency available
Performance evaluation: experiments

Size: 21504x1536 -- 1D vs 2D algorithm

- 1D has better \( e_a \), \( e_t \) and \( e_o \) because of the larger granularity
- 2D has much much better \( e_p \) because of the higher concurrency available
- a smaller \( bh \) gives more concurrency (better \( e_p \)) but lower \( e_a \) and \( e_t \)
References


References


