Now we have started running the Linpack benchmark on CyTERRA (previous experiments had run on Euclid).
In our input file (HPL.dat) we used 256 for the block size and 80% for memory percentage because that was the best combination we have found in the previous step.
Also, now we are running the application on 4 nodes and we use 12 processes per node (=> total 48 processes).
In this phase, we want to see how different compiler flags affect the Linpack execution time. The compiler we used is the intel compiler.
We have tried many different flag combinations and saw how much better is the execution time than without any optimization (-O0 flag).
The different flags that were used are the following:
- -O3 -xSSE4.2 (the -xSSE4.2 flag was not used together with the -O2 optimization flag)
- -w -nocompchk (makefile default flags)
- -parallel -par-threshold (the -par-threshold flag can only be used together with the -parallel flag)
- -fp-model fast
Let's see the time results.
We divide our results in 2 different tables. The first result's table which is presented below are the results that occured when we used different combinations with the -O3 -xSSE4.2 optimization.
The second result's table are the results that occured when we used different combinations with the -O2 optimization.
first result's table: (combinations with the -O3 optimization)
a simple combination example is -O3 -xSSE4.2 -fp-model fast -DFASTSWAP with 6936.52 seconds.
second result's table: (combinations with the -O2 optimization)
a simple combination example is -O2 -parallel -par-threshold with 5878.06 seconds.
Before to proceed with our conclusions we have to see what is the execution time of the Linpack Benchmark without any optimization flags (-O0).
The execution time with the -O0 flag is 5813.08 seconds.
As we can see from the above time tables, we have the best execution time result when using the -O2 -DFASTSWAP optimization flags(5740.72 seconds). It is only 1.25% better than the execution time without using any optimization flags. This is a very small percentage as we have almost no improvement in the execution time of the Linpack Benchmark.
Another conclusion, is that the number of GigaFlops is very big. We have 490 GFlops for the best case and 405 GFlops for the worst case.
The expected GFlops according to the http://hpl-calculator.sourceforge.net/ is 430 GFlops. From our results we get a much better performance (490 GFlops as we have said above) using the optimization flags -O2 -DFASTSWAP.
So, for our next experiments we are going to use the optimization flags -O2 -DFASTSWAP as it gives better time result than every other compiler flag combination.
PS: Someone could try more or different intel compiler optimizations depending on his application's characteristics. You can take a look on the intel's compiler optimizations in the following website: http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdf