UserDoc Blog

Now we have started running the Linpack benchmark on CyTERRA (previous experiments had run on Euclid).

In our input file (HPL.dat) we used 256 for the block size and 80% for memory percentage because that was the best combination we have found in the previous step.

Also, now we are running the application on 4 nodes and we use 12 processes per node (=> total 48 processes).

In this phase, we want to see how different compiler flags affect the Linpack execution time. The compiler we used is the intel compiler.

We have tried many different flag combinations and saw how much better is the execution time than without any optimization (-O0 flag).

The different flags that were used are the following:

  • -O2
  • -O3 -xSSE4.2 (the -xSSE4.2 flag was not used together with the -O2 optimization flag)
  • -w -nocompchk (makefile default flags)
  • -parallel
  • -parallel -par-threshold (the -par-threshold flag can only be used together with the -parallel flag)
  • -fp-model fast
  • -DFASTSWAP

Let's see the time results.  

We divide our results in 2 different tables. The first result's table which is presented below are the results that occured when we used different combinations with the -O3 -xSSE4.2 optimization.

The second result's table are the results that occured when we used different combinations with the -O2 optimization.

first result's table: (combinations with the -O3  optimization)


a simple combination example is -O3 -xSSE4.2 -fp-model fast -DFASTSWAP with 6936.52 seconds.

second result's table: (combinations with the -O2 optimization)

a simple combination example is -O2 -parallel -par-threshold with 5878.06 seconds.

Before to proceed with our conclusions we have to see what is the execution time of the Linpack Benchmark without any optimization flags (-O0).
The execution time with the -O0 flag is 5813.08 seconds.

Conclusions:
As we can see from the above time tables, we have the best execution time result when using the   -O2 -DFASTSWAP  optimization flags(5740.72 seconds). It is only 1.25% better than the execution time without using any optimization flags. This is a very small percentage as we have almost no improvement in the execution time of the Linpack Benchmark.
Another conclusion, is that the number of GigaFlops is very big. We have 490 GFlops for the best case and 405 GFlops for the worst case.
The expected GFlops according to the http://hpl-calculator.sourceforge.net/ is 430 GFlops. From our results we get a much better performance (490 GFlops as we have said above) using the optimization flags -O2 -DFASTSWAP.

So, for our next experiments we are going to use the optimization flags -O2 -DFASTSWAP as it gives better time result than every other compiler flag combination.  

PS: Someone could try more or different intel compiler optimizations depending on his application's characteristics. You can take a look on the intel's compiler optimizations in the following website: http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf 

The first test, was to find the best combination of the block size and the percentage of memory that will be used. (e.c if a node has a 20Gbyte memory, and the percentage is 80%, then only 16Gbytes are going to be used).

I first visited the http://hpl-calculator.sourceforge.net/ site. I submited my system's data which are the following:

Number of nodes: 1     -------  because i used only 1 core for my tests

Cores per node: 8

Speed per core: 2,27 GHz

Memory per node: 16GByte

Operations per cycle: 4

For this test, i didn't use any compiler optimizations e.t.c. That will be done in another test. For now, the only we care is to find the best combination of block size and memory percentage. 

I used these different combinations:

block size: 128,  memory percentage: 80%, 84%, 90%

block size: 152,  memory percentage: 80%, 84%, 90%

block size: 176,  memory percentage: 80%, 84%, 90%

block size: 200,  memory percentage: 80%, 84%, 90%

block size: 224,  memory percentage: 80%, 84%, 90%

block size: 256,  memory percentage: 80%, 84%, 90%

The time results for each combination are as follow:


     
the resulting graph is as follows:      

        

The x-axes on the graph represents the different block size and the y-axes represents the execution time in seconds.
Every different line colour is representing a different memory percentage (blue for 80%, red for 84% and green for 90%).
As we see, we have the best results when the memory percentage is 80%. As the used memory percentage increases, execution time also increases. 
Another observation, is that execution time is reduced for bigger block size (e.c for 80% memory percentage, if the block size is 128 execution time is 3351.53 seconds and for block size 256 execution time is 2893.64 seconds).

We have the smallest execution time for memory percentage 80% and block size 256, so we will proceed to our work with this combination as it gives as the best time results.

What about the number of Giga-Flops? The result is not the expected. The bigest number of Giga-Flops we have is 23.85, for a block size of 256 and memory percentage 90%.
The expected number of Giga-Flops we should have is 57 for 80% mem.percentage, 60 for 84% mem.percentage and 64 for 90% mem.percentage.

First, we have to edit the makefile. The makefile has the name "Make.em64t". We have to point the makefile to the Intel MPI Directories.

To be specific,we have to change these lines of code:

#MPdir        = /opt/intel/mpi/3.0
#MPinc        = -I$(MPdir)/include64
#MPlib        = $(MPdir)/lib64/libmpi.a

to:

MPdir        = /opt/soft/mpi/openmpi-1.4.2/intel/noib
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.a

Before we proceed, it is good to load the intel MPI module. This can be done by executing the command:module load intel openmpi

After, we have to run these 2 commands in the terminal: (to run these commands we have to be in the mp_linpack directory)

(1) make arch=em64t clean_arch_all
(2) make arch=em64t

Then, the executable and a .dat file are created and placed in the directory bin/em64t .
After, we have to move to that directory(cd bin/em64t). 
We then have to edit the HPL.dat file. A website that helps us to do that, is  http://hpl-calculator.sourceforge.net 
We submit our system's data, and then the website gives us suggestions about the data we should use for the HPL input file (HPL.dat).

To understand properly the parameters of the HPL input file, we have to take a look at the TUNING file, which is under the mp_linpack directory.
After we edit the HPL input file, we make a job script to execute the application.
A simple script could be the following: 

##### run.job ---this script executes the xhpl executable #####

#!/bin/bash
#PBS -V
#PBS -N Lintest
#PBS -q batch
#PBS -l nodes=1:ppn=8,walltime=24:00:00

cd $PBS_O_WORKDIR

mpirun -np 8 path_to_executable_directory/xhpl

##### end of script file #####

we submit our script executing the command:    qsub run.job

p.s: someone can find more details about how to run the Linpack benchmark reading the  HOWTO.pdf