CUDA代写：CME213-Neural-Networks-on-CUDA-Part2

Introduction

In this second part of the final project, we provide further details about the grading policy and introduce you to the starter code. You can also find instructions for running and profiling the code on the cluster and submitting your work.

Grading details

Please refer to Part I for an overall grading information. Here we explain in detail how we determine the correctness of the code and test the performance. We have setup four testcases (with corresponding grading modes in the code) for testing correctness and performance. These testcases or grading modes can be run by passing command line arguments to the program. More details about them are given in later sections.

Outline

You can find the grading outline below. More details about them are in the subsections that follow.

Preliminary Report
Final Report

GEMM correctness

Since the GEMM function is a building block of any neural network implementation and will be an important tool in your arsenal, we test the GEMM implementation separately from the overall code testing. We have provided a function prototype called myGEMM for you in gpu_func.cu, which takes inputs as two scalars a, b, three matrices A, B, C, and returns the result of D = a A B + b * C in C (in place).

Your job is to fill in this function, and we will test your implementation on two sets of inputs that are relevant to this project. You are welcome to, but you don’t have to use this myGEMM function in your parallel training; this is only for the purpose of grading.

We test this correctness by running grading mode 4, which runs the myGEMM function alone. This myGEMM function is called only by rank 0 in the grading mode, i.e., for this part you just need to write kernels to do GEMM on a single GPU.

Overall correctness

In large neural network problems, a common issue encountered is the aggregation of rounding errors or inconsistencies.

Unfortunately, the implementations of several operations are not exactly same on CPU and GPU. Some of the sources for differences include exp() operations used in Softmax and Sigmoid functions, FMA (fused multiply add), the order of operations etc. There are some differences at the hardware level of implementation too. These discrepancies are usually of the order of 1e-16 for double precision calculations. However, such discrepancies can build up over time. In general, as the learning rate gets larger, the instability of the algorithm due to roundoff errors is high. These discrepancies might not lead to any parameter blow-up, but might create significant differences between the CPU and GPU solutions. This makes determining correctness challenging.

In order to tackle this, we have setup three testcases for determining correctness in the form of grading modes. In all those modes, a max. norm of the difference between final CPU and GPU results (parameters W(1) , W(2) , b(1) , b(2) ) is considered. If this max. norm is greater than a set threshold (1e-7) for any case, your code will fail correctness for that case. The actual max. norm values we get are much lower than this, but we want to provide some leeway in this regard and have relaxed the threshold. Apart from passing the three correctness tests, the precision on the validation set of the CPU and GPU implementations must be very close.

The hyper-parameters for the three testcases are as follows,

Low learning rate: 0.001, large # iterations: 40 epochs;
Medium learning rate: 0.01, medium # iterations: 10 epochs;
High learning rate: 0.025, small # iterations: 1 epoch.

The grading modes 1,2 and 3 run the above three testcases respectively.

Note: In order to get full credit on overall code correctness, these above thresholds must be met by a fully parallel code running on 4 GPUs through four different processes (or CPU threads) using MPI and CUDA. If the code is running on a single GPU or is not using GPUs (just MPI), you will lose a significant portion of the grade. Similarly, if you are running four processes but only one of them is using GPUs, you will again lose points. Here, when we say running on GPUs, we expect that all the GEMM, Softmax and Sigmoid calculations be done on GPUs.

GEMM Performance

This refers to the performance of your myGEMM function. To test this we run the code in grading mode 4. The grade for this will be based on the performance of your GEMM function (in terms of the time taken) relative to other students in the class. The exact method for calculating this relative grade will be determined by us later depending on the range of performances we get.

In the code, we run this myGEMM function repeatedly for a number of iterations. This has been currently set to 10, but we might change this based on the performance we see in the submissions. We believe that this should not affect your implementation.
Caveat: If your GEMM implementation does not pass the GEMM correctness test, you will not receive any points for performance.

Overall Performance

This refers to the performance of your full NN code. Here we use the default settings of the program for benchmarking the performance (time taken). Here again, the grade is based on your performance relative to other students in the class. The exact method for calculating this relative grade will be determined by us later depending on the range of performances we get.

Caveat: If you do not pass the overall correctness tests, you will be penalized and we will determine the penalty on a case by case basis.

Running instructions

We have provided a sample .bashrc file in sample_bashrc. You can replace your current ̃/.bashrc (or bash profile) file on the cluster with this. You can also copy the relevant portions to your current bashrc file. The modules that have been loaded are as follows:

module add shared
module add slurm
module add gcc/4.8.5
module add cuda75
module add mvapich2/gcc/64/2.1
module add intel-cluster-runtime/intel64/3.7

(Please load gcc/4.8.5 instead of gcc, because the nvcc does not support gcc version 4.9 and up.)
Make sure all the above modules are loaded. If you changed your .bashrc file, you may have to source it for the changes to take effect. Alternatively you can exit your ssh session and log back in. You can see the modules that have been loaded by using

1	module list

With the correct modules loaded, run

./init.sh

This downloads the MNIST dataset and installs the Armadillo library. You only need to do this the first time after you download the code.
Edit the job script run.sh to add command line arguments or change number of processes you want to run with. By default, we request for 4 processes on a single node in the cluster and request for 4 GPUs. The single node is to reduce MPI overhead. Communication across nodes is slower than within a single node. Note that the program prints out the number of MPI processes and CUDA devices used in the very beginning to help you make sure you are running it correctly.
Submit the job script run.sh using sbatch as follows

1	sbatch run.sh

You can check whether your job is still running via the command squeue.