#### Solutions you can apply to similar parallelization problems

In this article, I will discuss a simple, hands-on approach to optimizing the runtime of an interactive algorithm — the Jacobi method. More specifically, we’ll look at ways to apply it to solve the Laplace Equation for Heat Transfer.

For that, I used two similar CUDA (Compute Unified Device Architecture) approaches. This article discusses both solutions, their pros and cons, and some theoretical basis you can apply to similar parallelization problems.

First of all, let’s briefly discuss the problem: the Laplace Equation for Heat Transfer, more specifically, its solution using the Jacobi Iteractive Method. This problem was taken from the 16th Marathon of Parallel Programming.

In it, it calculates how heat transfers locally over time using the following equation:

The Jacobi iterative method is a means for iteratively calculating the solution to a differential equation by continuously refining the solution until the answer has converged upon a stable solution or some fixed number of steps have been completed, and the answer is either deemed good enough or unconverged.

The example code represents a 2D plane of material that has been divided into a grid of equally sized cells. As heat is applied to the center of this plane, the Laplace equation dictates how the heat will transfer from grid point to grid point over time.

To calculate the temperature of a given grid point for the next time iteration, one calculates the average of the temperatures of the neighboring grid points from the current iteration.

Once the next value for each grid point is calculated, those values become the current temperature, and the calculation continues. At each step, the maximum temperature change across all grid points will determine if the problem converged upon a steady state—a straightforward solution.

The main Jacobi iterative loop of this equation (In C) ended up being the following:

// Jacobi iteration

// This loop will end if either

// the maximum change reaches below a set threshold (convergence)

// or a fixed number of maximum iterations have completed

while ( hasError && iter < iter_max_num ) {

hasError = 0;

// calculates the Laplace equation to determine each cell's next value

for( int i = 1; i < size-1; i++) {

for(int j = 1; j < size-1; j++) {

new_grid[i][j] = 0.25 * (grid[i][j+1] + grid[i][j-1] +

grid[i-1][j] + grid[i+1][j]);

double errorCurrent = fabs(new_grid[i][j] - grid[i][j]);

if(errorCurrent > CONV_THRESHOLD ){

hasError = 1;

}

}

}

aux = grid;

grid = new_grid;

new_grid = aux;

iter++;

}

And the complete solution can be found in the gist below:

For a 2048 x 2048 matrix and a convergence threshold of 15000 iterations, this algorithm takes roughly 2.43 seconds (Yeah, this is a lot). This means this is not scalable and points towards a possible parallel optimization.

There are many options to make a program parallel (And, therefore, faster). A simple solution that can be easily applied to a C program is using the OpenMP (Open Multi-Processing) API — a couple of pragma directives and the speedup would be notorious.

But if we want to go the extra mile, we can take advantage of the GPU to reach an even bigger speedup. The programming part should be pretty straightforward, given the program is small overall (Only 140 lines of code total). The main problem when dealing with these types of programs is identifying the parallelization points.

In this case, it is possible to observe that the point with the most computational complexity (Therefore, where most operations occur) is in the main Jacobi loop inside each interaction (Refer to Code Block 1). This loop has complexity O(n²), with n being the grid size, the first argument passed onto the code.

It is easy to observe the new_grid variable depends only on the values of the grid. The access to the latter is read-only, so an excellent starting point is to fill all variables inside new_grid in parallel.

Unfortunately, each step depends on the previous step to be complete, which means it is impossible to further optimize our code outside of a few tweaks and memory usage optimization.

Upon using CUDA, we will break down the code into two sections to be executed in the GPU. This first section will identify which part of the grid (Defined by the variables i and j) will be executed and then apply the Laplace Equation to them.

The variables are defined using blockIdx*, *blockDim*, *threadIdx. This is to identify themselves among the GPU structure, which will be discussed below.

The convergence criteria for each tile are stored in a CUDA’s shared memory variable (Defined by the __shared__** **qualifier). The loop keeps going if at least a single tile passes the convergence criteria. Observe also that atomic functions are used to have a bit more efficiency.

Finally, the __syncthreads()** **function call ensures the shared memory is synchronized across all Jacobi function calls.

__global__ void Jacobi(double* phi, double* phi_old, double* max_diff, int N) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

double diff = 0.0;

if (i > 0 && i < N-1 && j > 0 && j < N-1) {

phi[i*N+j] = 0.25 * (phi_old[(i-1)*N+j] + phi_old[(i+1)*N+j] + phi_old[i*N+j-1] + phi_old[i*N+j+1]);

diff = fabs(phi[i*N+j] - phi_old[i*N+j]);

}

__shared__ double s_max_diff[TILE_SIZE*TILE_SIZE];

int tid = threadIdx.x + threadIdx.y * blockDim.x;

s_max_diff[tid] = diff;

__syncthreads();

for (int s=blockDim.x*blockDim.y/2; s>0; s/=2) {

if (tid < s) {

s_max_diff[tid] = fmax(s_max_diff[tid], s_max_diff[tid+s]);

}

__syncthreads();

}

if (tid == 0) {

atomicMax_double(max_diff, s_max_diff[0]);

}

}

The first section is defined by the __global__** **qualifier. This defines a kernel function, the function executed on the GPU. They are executed in parallel and called by the CPU. Later, we will see how the kernel function is called within the code.

__device__ double atomicMax_double(double* address, double val)

{

unsigned long long int* address_as_ull = (unsigned long long int*)address;

unsigned long long int old = *address_as_ull, assumed;

do {

assumed = old;

old = atomicCAS(address_as_ull, assumed,

__double_as_longlong(fmax(val, __longlong_as_double(assumed))));

} while (assumed != old);

return __longlong_as_double(old);

}

The second section is defined by the __device__** **qualifier, which defines functions executed on the GPU, but not callable on the CPU. It serves more to help the “main” __global__** **function and better organize the GPU-executed code.

This atomicMax_double** **function is just a common macro used to compare double variables inside the GPU.

To utilize these GPU functions, we must first define the architecture of the reserved GPU space the code will access. The 2D grid will be divided into tiles, and each tile will be passed to a three-dimensional block structure. The block structure represents several parallelization units that share a local memory space. The grid represents a structure of blocks.

#define CONV_THRESHOLD 1.0e-4f // threshold of convergence

#define TILE_SIZE 16 // empirically decided

...

// set the grid and block sizes

dim3 grid((N+TILE_SIZE-1)/TILE_SIZE, (N+TILE_SIZE-1)/TILE_SIZE, 1);

dim3 block(TILE_SIZE, TILE_SIZE, 1);

// Jacobi iteration

while (hasError && iter < MAX_ITER) {

hasError = 0;

d_temp = d_phi_old;

d_phi_old = d_phi;

d_phi = d_temp;

// launch the kernel

diff = 0;

cudaMemcpy(d_diff, &diff, sizeof(double), cudaMemcpyHostToDevice);

Jacobi<<<grid, block>>>(d_phi, d_phi_old, d_diff, N);

// copy the diff value back to the host

cudaMemcpy(&diff, d_diff, sizeof(double), cudaMemcpyDeviceToHost);

// check for convergence

if (diff > CONV_THRESHOLD) {

hasError = 1;

}

iter++;

}

In this specific case, the tile size was defined as 16 since this gave the biggest speedup upon testing. The block size is then defined as (16, 16, 1), and the grid size is defined to partition the 2D grid of size optimally N — grid = ( (N+15)/16, (N+15)/16, 1).

The line which defines the function to be called in parallel and used within the GPU is:

Jacobi<<<grid, block>>>(d_phi, d_phi_old, d_diff, N);

For a 2048 x 2048 matrix and a convergence threshold of 15,000 iterations, this algorithm takes roughly 0.134 seconds. This is a major improvement from the sequential solution, but we can still push it further.

The full code for this solution is shown below:

Even with all these optimizations, there is still a major change we can make to the __global__** **Jacobi function to speed it up.

By utilizing the tiling process, in which submatrixes of size 16 x 16 are copied to the shared memory of each block (With each block containing 16 x 16 = 196 threads), each thread can change the value of the shared memory locally.

The synchronization process is way quicker by changing the code to access a smaller portion of the entire GPU shared memory, passed to each block. A single synchronization is necessary to join the tiles at the end and change the variables necessary to keep the loop going, but just the fact that each thread is operating with a much smaller chunk of memory already gives us a good speedup.

The final version of the Jacobi function is shown below:

__global__ void Jacobi(double* phi, double* phi_old, double* max_diff, int n) {

__shared__ double tile[TILE_SIZE+2][TILE_SIZE+2];

__shared__ double s_max_diff[TILE_SIZE*TILE_SIZE];

double diff = 0.0;

int tx = threadIdx.x;

int ty = threadIdx.y;

int bx = blockIdx.x * blockDim.x;

int by = blockIdx.y * blockDim.y;

int i = bx + tx;

int j = by + ty;

// load the tile into shared memory

if (i < n && j < n) {

tile[tx+1][ty+1] = phi_old[i*n+j];

}

if (tx == 0 && i > 0) {

tile[tx][ty+1] = phi_old[(i-1)*n+j];

}

if (tx == TILE_SIZE-1 && i < n-1) {

tile[tx+2][ty+1] = phi_old[(i+1)*n+j];

}

if (ty == 0 && j > 0) {

tile[tx+1][ty] = phi_old[i*n+(j-1)];

}

if (ty == TILE_SIZE-1 && j < n-1) {

tile[tx+1][ty+2] = phi_old[i*n+(j+1)];

}

__syncthreads();

if (i > 0 && i < n-1 && j > 0 && j < n-1) {

double tmp = 0.25 * (tile[tx][ty+1] + tile[tx+2][ty+1] + tile[tx+1][ty] + tile[tx+1][ty+2]);

diff = fabs(tmp - tile[tx+1][ty+1]);

tile[tx+1][ty+1] = tmp;

}

__syncthreads();

if (i < n && j < n) {

phi[i*n + j] = tile[tx+1][ty+1];

}

// Change 3 - Put each diff value

int tid = threadIdx.x + threadIdx.y * blockDim.x;

s_max_diff[tid] = diff;

__syncthreads();

for (int s=blockDim.x*blockDim.y/2; s>0; s/=2) {

if (tid < s) {

s_max_diff[tid] = fmax(s_max_diff[tid], s_max_diff[tid+s]);

}

__syncthreads();

}

if (tid == 0) {

atomicMax_double(max_diff, s_max_diff[0]);

}

}

With this change, we reach, for a 2048 x 2048 matrix and a convergence threshold of 15,000 iterations, a final time of 0.118 seconds.

For reference, the full code of this second CUDA solution can be found below:

Aaaand that’s it. This is a simple study I did in college, but it was so interesting I thought I might as well share it here. Thank you to everyone who read this so far, and I hope it helped you understand more about parallel programming and how to mess around the GPU using CUDA.

Optimizing Heat Transfer in a 2D Grid Simulation’s Runtime Using CUDA: A Parallel Programming Study was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.