CUDA代写:CME213-Heat-Diffusion-Equation

Requirement

This request builds on the previous request’s theme of examining memory access patterns. You will implement a finite difference solver for the 2D heat diffusion equation in different ways to examine the performance characteristics of the different implementations.

Background on the heat diffusion PDE

To solve this PDE, we are going to discretize both the temporal and the spatial derivatives. To do this, we define a two-dimensional grid G(i, j), 1 ≤ i ≤ n(x), 1 ≤ j ≤ n(y), where we denote by n(x) (resp. n(y) ) the number of points on the x-axis (resp. y-axis). At each time-step, we will evaluate the temperature and its derivatives at these gridpoints.

While we will consistently use a first order discretization scheme for the temporal derivative, we will use 2nd, 4th or 8th order discretization of the spatial derivative.

If we denote by T(i, j) the temperature at time t at point (i, j) of the grid, the 2nd order spatial discretization scheme can be written.

The C(x) (xfcl in the code) and C(y) (yfcl in the code) constants are called Courant numbers. They depend on the temporal discretization step, as well as the spatial discretization step. To ensure the stability of the discretization scheme, they have to be less than a maximum value given by the Courant-Friedrichs-Lewy condition. You do not have to worry about this, because the starter code already takes care of picking the temporal discretization step as to maximize the Courant numbers while ensuring stability.

The starter code also contains host and device functions named stencil which contain the coefficients that go into the update equation. Therefore, you do not need to figure out how to implement the different order updates. You only need to understand how this function works and pass in the arguments correctly.

Boundary conditions

The starter code contains the functions that will update the boundary conditions for you (see file BC.h, in particular the function gpuUpdateBCsOnly) and the points that are in the border (which has a size of order / 2). This way, you do not have to worry about the size of the stencil as you approach the wall.

Various implementations

In this programming request, we are going to implement 2 different kernels (and you can do a third one for extra credit):

  • Global (function gpuStencil): this kernel will use global memory and each thread will update exactly one point of the mesh. You should use a 2D grid with n(x) × n(y) threads total.

  • Block (function gpuStencilLoop): this kernel will also use global memory. Each thread will update numYPerStep points of the mesh (these points form a vertical line). You should also use a 2D grid with n(x) × n(y) / numYPerStep threads total.

  • (Extra Credit) Shared (function gpuShared): this kernel will use shared memory. A group of threads must load a piece of the mesh in shared memory and then compute the new values of the temperatures on the mesh. Each thread will load and update several elements.
    Parameter file

The parameters used in the computation are read from the file params.in. You will need to modify some parameters (see description of the starter code) in this file to answer some of the questions. But this file will be not be submitted through the submission script.

Here is a list of parameters that are used in the order they appear in the file:

1
2
3
4
int nx_, ny_;    // number of grid points in each dimension
double lx_, ly_; // extent of physical domain in each dimension
int iters_; // number of iterations to do
int order_; // order of discretization

Start code

The starter code is composed of the following files:

  • main.cu — This is the CUDA driver file. Do not modify this file.
  • gpuStencil.cu — This is the file containing the kernels. You will need to modify this file.
  • Makefile — make will build the binary. make clean will remove build files as well as debug output. Do not modify this file.
  • params.in — This file contains a basic set of parameters. For debugging, performance testing, and to answer the questions, you will need to modify this file. The only parameters you should modify are nx, ny (line 1) and order (line 4). This file however will not get submitted through the script.
  • simParams.h and simParams.cpp — These files contain the data structure necessary to handle the parameters of the problem. Do not modify these files.
  • Grid.h and Grid.cu — These files contain the data structure that models the grid used to solve the PDE. Do not modify this file.
  • BC.h — This file contains the class boundary_conditions that will allow us to update the boundary conditions during the simulation. Do not modify this file.
  • hw3.sh — This script is used to submit jobs to the queue

Note The files in the starter code contain some additional information about the implementation in the form of comments. Additionally, the CPU implementation should provide a clearer picture of the method and should aid your GPU work.

Question 1

(30 points) Implement the function gpuStencil that runs the solver using global memory. You must also fill in the function gpuComputation. The difference (in terms of the norms) between your implementation and the reference to should be in the expected range.

Question 2

(35 points) Implement the function gpuStencilLoop that runs the solver using global memory but where each thread computes several points on the grid. You must also fill in the function gpuComputationLoop.
The difference (in terms of the norms) between your implementation and the reference to should be in the expected range.

Question 3

(15 points) Plot the bandwidth (GB/s) as a function of the grid size (in units of MegaPoints) for the following grid sizes: 256×256; 512×512; 1024×1024; 2048×2048; 4096×4096.
You must have 2 plots (or 3 plots if you choose to do the extra credit) as follows:

  1. For order = 4, plot the bandwidth for the 2 (or 3) different algorithms.
  2. For the block implementation, plot the bandwidth for the 3 different orders.
  3. If you implemented the shared algorithm, plot the bandwidth for the 3 different orders.
Question 4

(20 points) Which kernel (global, block or shared) and which order gives the best performance (your answer may depend on the grid size)? Explain the performance results you got in Question 3.

Question 5

(20 points Extra Credit) Implement the function gpuShared that runs the solver using shared memory. You should also fill in the function gpuComputationShared. Note that you have to answer the questions related to.
Total number of points: 100 (+20 Extra credit)