optimization with OpenMP on Blue Gene/Q

Replace the vectorized FORALL loop with parallelized DO loops in sweep_scheme.f90 An example is to replace:

        DO i=mB(1,1), mB(1,2)
               FORALL(j=mB(2,1):mB(2,2),k=mB(3,1):mB(3,2))
                  beforesweepstep_%data(beforesweepstep_%x(i),j,k,1,1:NrHydroVars) = &
                       Info%q(index+i,j,k,1:NrHydroVars)
               END FORALL
            END DO

by

    !$OMP PARALLEL DO PRIVATE(k,j,i) COLLAPSE(3)
            DO k=mB(3,1),mB(3,2)
               DO j=mB(2,1),mB(2,2)
                  DO i=mB(1,1), mB(1,2)
                     beforesweepstep_%data(1:NrHydroVars,1,i,j,beforesweepstep_%x(k)) = Info%q(i,j,index+k,1:NrHydroVars)
                  END DO
               END DO
            END DO
                !$OMP END PARALLEL DO

Testing results on Blue Streak are

  1. 1283 + 4AMR, Current Revision Running Time on 512 cores: 224.57 (Tasks per node=16)
Tasks per node OMP_NUM_THREADS Total Running Time
1 32 3375.17
2 16 2019.94
4 8 1265.58
8 4 1052.74
16 2 907.62
32 1 1151.02

Tasks per node OMP_NUM_THREADS Total Running Time
1 64 >3600
2 32 2039.45
4 16 1181.27
8 8 946.2
16 4 741.81
32 2 737.68
64 1 877.07
  1. 323 + 4 AMR, Current Revision Running Time on 512 cores: 33.26 (Tasks per node=16)
Tasks per node OMP_NUM_THREADS Total Running Time
1 64 191.42
2 32 122.68
4 16 82.43
8 8 70.78
16 4 72.78
32 2 85.65
64 1 129.95
Tasks per node OMP_NUM_THREADS Total Running Time
1 32 164.59
2 16 105.90
4 8 86.67
8 4 79.62
16 2 84.98
32 1 128.47

The job submission script on Blue Streak is like

#!/bin/bash
#SBATCH -J strongTest
#SBATCH --nodes=32 
#SBATCH --ntasks-per-node=4
#SBATCH -p debug 
#SBATCH -t 01:00:00

module purge
module load mpi-xl
module load hdf5-1.8.8-MPI-XL
module load fftw-3.3.2-MPI-XL
module load hypre-2.8.0b-MPI-XL

ulimit -s unlimited
export OMP_NUM_THREADS=16
#1node 8 processors
srun astrobear > strong_4ThreadsperNode_X16.log
                                      

swap the DO loop layers to i, j, k — the difference of running time is small comparing k,j,i case

                !$OMP PARALLEL DO PRIVATE(i,j,k) COLLAPSE(3)
            DO i=mB(1,1), mB(1,2)
               DO j=mB(2,1),mB(2,2)
                  DO k=mB(3,1),mB(3,2)
                     beforesweepstep_%data(1:NrHydroVars,1,i,j,beforesweepstep_%x(k)) = Info%q(i,j,index+k,1:NrHydroVars)
                  END DO
               END DO
            END DO
                !$OMP END PARALLEL DO
Tasks per node OMP_NUM_THREADS Total Running Time
1 16 >3600
2 16 2099.57
4 16 1208.53
8 8 912.56
16 4 758.78
16 2 969.74
16 1 1436.98

Comments

No comments.