wiki:OptimizingScrambler

Version 13 (modified by trac, 12 years ago) ( diff )

Optimizing Scrambler

List of current culprits

Fixed Grid AMR
Serial Effective filling fractions due to small grids
Parallel CreateChildren, Scatter, Re-allocating grids, not favoring proportioned grids Load Balancing, Additional Fractionation, Scatter

Single Processor Fixed Grid

First lets consider a single processor run without any AMR as this will give a good baseline for comparison… As expected, nearly all of the time (99.1%) is spent on advancing the level 0 grid. .5% is spent on ApplyOverlaps on level 0, and the next single most expensive part of the algorithm (.07%) is the createchildrens call on levels -2 and -1 ( most likely because it has to calculate hilbert values and sum up costmaps along each direction over the entire fixed grid) Here are the highest time syncs for a single processor fixed grid run.

Advance 99.07 0
ApplyOverlaps .51 0
SyncFluxes .09 0
CreateChildrens .07 -1
CreateChildrens .07 -2
InitInfos .05 0
AfterFixups .03 0

And the total runtime was 444.841 seconds

Of course the current distribution algorithm would not distribute the 4th and 5th entry in the list. These routines are not parallel. As we run the same problem on more and more processors, this overhead will become costly. If we include all of the serial parts of the distribution we get .2% for a 3D fixed grid run which means that when we get to 500 processors, this time would be comparable to the time spent advancing and our efficiencies would be 50% at best. These steps could however be eliminated for a fixed grid run. Alternatively the routines for calculating the hilbert values of a region and determining the distribution of root grids could be done in parallel using all of the processors.

Single Processor AMR

Next lets consider what happens if we add one level of AMR keeping the effective resolution constant.

Advance 93.91 1
Advance 4.18 0
ProlongateParentsData .79 1
ApplyOverlaps .36 1
AfterApplyOverlaps .32 1
SyncFluxes .16 1

And the total runtime was 668 seconds

Now the distribution time on levels -2 and -1 have dropped by a factor of 8 so the serial part of the distribution does not inhibit scaling to 4000+ processors. As we add more and more levels of AMR this will become a smaller and smaller percentage - so as not to be a severe hinderance to scaling out to 100,000+ processors assuming that we have sufficient levels of AMR…

It is surprising that the Advance on level 1 takes 22 times as long compared to the advance on level 0 - when it should in principal only take 16 times as long if the filling fraction were 100%. If we look at the actual filling fraction (ie what we would see if we counted the number of refined cells in visit) we see that the mean filling fraction is of order .170 which is roughly consistent with a puck of radius .2 and thickness of .25 in a box that is 1x1x1… (pi*.22*.25)/1 = .0314

A naive calculation would put the run time at 444.841*(.170+1/16) = 103.4 seconds - not the 688 seconds that it actually took. To understand why the calculation took so long we have to consider the effective filling fraction

Adjusting the filling fractions and the minimum grid size

To see why the effective filling fraction is so much higher than the actual filling fraction we have to consider the work that goes into updating a grid. If we assume for the moment that the cost of a grid is solely a function of the number of cells that needed to be updated (here we are ignoring extra computation near the boundaries), we still have to consider the fact that each grid is responsible for taking two time steps. So let's assume that after the first time step each grid needs data within mbc cells. Then the number of cell updates for a grid that is would be .

For a 4x4x4 grid with an mbc of 2 this would give an effective efficiency of (64+8)/(2*8)=4.5!!! It would take 4.5 times as long to update these cells if it is isolated then if it is embedded in a larger grid. This estimate is a lower bound since their are additional calculations at the boundary that we have ignored - but these effectively increase the mbc by mbc/2…

Even for a fixed grid simulation, the efficiencies drop with grid size. For example here is a plot of efficiencies for 2D grids from 2x2 to 32x32. And here is a plot of efficiencies for 3D grids from 2x2x2 to 32x32x32. (Note the larger scatter for the 3D case perhaps due to cache limits etc…)

For the simulation above we had a minimum grid size of 2 and an mbc of 4 and a filling fraction of .9 which all led to an effective filling fraction of 1.8 even though the actual filling fraction was .170. This explains the longer run time.

Tuning the filling fraction and the minimum grid size

Before considering multiple processors, lets try to raise the minimum grid size (to 8) and lower the filling fraction (to .7) in order to encourage larger grids.

Advance 88.89 1
Advance 9.86 0
ProlongateParentsData .44 1
AfterOverlaps .18 1
ApplyOverlaps .14 1
SyncFluxes .05 1

And the total run time was 301.3 seconds

Actual filling fraction = .228 Effective filling fraction = .753

0th order overall expected AMR efficiency = (.753+1/16)=.816 (.677 = 301.3/444.841)

If we consider that a 16x16x16 base grid has an effective cost ratio of 1.37 and that a 32x32x32 base grid has an effective cost ratio of 1.16, then the amr overall efficiency should have been [163/2 x 1.37 + 323 x.753] / [323 x 1.16] = .72 which is close enough to the .67 to be caused by a number of possible factors (including errors in estimating the actual computational cost which were of order 7%).

Modifying the NewSubGrids routine

Clearly since the ghost costs can be substantial - it would be worth taking into account the cost due to child grid fractionization when creating new sub grids. Instead of using the filling fraction and the minimum number of cells to constrain refinement regions, the routine could minimize the collective advance times of all of the grids children. Of course the number of permutations to consider when blindly grouping cells into grids goes like N! where N is the number of refined cells - so a semi-intelligent algorithm would need to be written. One could use the current algorithm with a somewhat high filling fraction of .9 to first generate a collection of child grids, and then consider the pairwise groupings of the child grids into child 'super-grids' and then repeat this process until no new grids are grouped. This would be or order N2 log N? where N is the number of child grids and would save overall computational time as long as individual grids did not generate more than say 1000 child grids (which is probably a conservative estimate).

Multiple Processor Fixed Grid

When considering multiple processor runs, we need to combine the times spent on each processor doing various tasks. Here we average the actual time spent by each processor on each task - not the percentages. If we are doing strong scaling tests, then the size of each fixed region assigned to each processor will decrease leading to a decreased efficiency even before communication efficiencies are considered…

2 Processors on a local workstation (grass)

If we take the 323 fixed grid run above (with an effective efficiency of 1.16 and run it on two processors, we would have two 16x32x32 grids each with an effective efficiency of 1.28 so our run time should not drop by 2 but only by 1.8125 giving a run time of 245 seconds. The actual run time was 267 seconds so we still need to account for the missing 22 seconds or roughly 11%. The error in estimating the work load cannot completely explain this discrepancy since the errors in workload estimations become small as the grids become larger. If we look at the time spent in the different stages of the algorithm we find…

Advance 93.94 0
RecvGridsFromParents 1.55 0
CreateChildrens 1.52 -1
PrintAdvance 1.08 0
RecvOverlaps .70 0
SendOverlaps .38 0
ApplyOverlaps .33 0
RecvFluxes .13 0

And the total runtime was 269 seconds

  • CreateChildrens on level -1 can be explained by the additional Hilbert calculations and summation that occurs when the -1 grid splits in to two. In a single processor run, there is no reason to calculate Hilbert values or sum up costmaps. I checked and scrambler was calculating Hilbert values. Now it checks to see if there are multiple children and multiple processors to streamline the distribution process. This will speed up single processor fixed grid runs, as well as parallel AMR runs - since in both cases you often have several grids with only one processor assigned to their sub tree.

After making those changes we re-ran the single processor fixed grid run and got the following performance. Note the Advance percentage increased by .14% while the time spend creating children on levels -2 and -1 went from .07% each to 0.

Advance 99.21 0
ApplyOverlaps .51 0
SyncFluxes .09 0
InitInfos .05 0
AfterFixups .03 0

Out of curiosity we re-ran the single processor AMR run with 8 cell minimums and .7 filling fractions and got marginal improvements. The total run time decreased by 8 seconds or by 2.7% but this may have been due to the processor having better access to resources. To test this we re-ran the same setup and had a very consistent run time.

Advance 89.08 1
Advance 9.67 0
ProlongateParentsData .45 1
AfterOverlaps .18 1
ApplyOverlaps .14 1
SyncFluxes .05 1
  • The RecvGridsFromParents time is the average time spent waiting on processor 1 for processor 0 to finish the createchildrens on level -1. Since these are both averages - the actual time spent waiting for each stage is 3.04% and 3.10%. So collectively this create children call accounts for 3% of the runtime on each processor. This stage will not scale - so we can expect this to seriously degrade performance. If 3% of the runtime is a unparallelized, then the relative efficiency should scale like 1/((N(.03+.97/N)) = 1/(.03xN+.97) which on 64 processors is 34%. For small N this is just .03*(N-1). This compounds the inefficiencies you get from strong scaling due to smaller grids. This should however accurately predict the current weak scaling. (except for inefficencies from overloading nodes)
  • PrintAdvance can also be explained by the scatter in computational times for the two processors. Even if the load is technically balanced - if the cpus are on the same machine they will be competing for resources which can make them complete their tasks at slightly different rates. The PrintAdvance is the first collective communication that occurs post advance and as such is the place where any de-synchronization that occured during the advance will be felt. There is currently another synchronization call that happens at the beginning of Advance - so any synchronization costs unrelated to advances would likely show up within the Advance. To avoid this we moved the collective communication call into scheduleadvances instead.
  • RecvOverlaps and RecvFluxes have blocking communicators and as such will show effects from network latency. However, RecvOverlaps requires 3 or 4 times as much data as recv fluxes and as such would be approximately 3 or 4 times as high if latency were not an issue. (This is however on the same blade so network latency should not be a factor here anyways…) Compared to the .51% time spent in applyoverlaps on the single processor run combined with the additional surface area produced by two grids instead of just one can easily account for the time spent in RecvOverlaps, SendOverlaps, & ApplyOverlaps. The same thing can also be said for RecvFluxes, SendFluxes, & SyncFluxes.
  • If we combine the efficiency due to the CreateChildren on level -1 combined with the inefficiencies produced by smaller grids we get a combined inefficiency of .03*(2-1)+(1.34-1.18) = .03+.16 So a previous run that took 444 seconds should now take 444/2*(1.19) = 264.18 seconds which is close to the actual 269 seconds and the difference can be accounted for by the scatter in the advance step. We should be able to get 16% inefficiencies for this strong scaling case due only to grid sizes - or a 3.0% inefficiency for weak scaling without changing the algorithm. To test this we ran a 64x32x32 base grid for the same number of steps on two processors. It completed in 503 seconds which gives an inefficiency of 13% This is much larger than the expected 3.0% inefficiency. If we look at the total cpu time spent advancing grids on each processor it is much higher than for the single processor run. Upon inspecting the output however, we noticed that the hilbert split algorithm chose to split the 64x32x32 grid into two pieces that were 64x32x16 not 32x32x32! This could easily explain the low efficiency. We modified the base grid to be 32x32x64 to see what would happen if the two grids were actually 32x32x32. The run time actually went up again!!! This time to 540 seconds. The main culprit appeared to be the synchronization time which was 2.4%. For the 64x32x16 domains this was only 1% It is puzzling why the same size grids would now have different update times - unless the presence of the field loop actually changed the advance times. To address this we ran the same setup but without any field loop. This should guarantee that both processors are performing the same calculations - and the PrintAdvance time should only reflect the processors speed. The PrintAdvance time dropped back down to 1.87% and the run time dropped to 527. This scatter was likely due to the workstation being used for other tasks

Modifying algorithm for fixed grid runs

Since level n+1 nodes are aged and destroyed during level n calls to AMR, it was possible to just call AMR(0) after the initial AMRStart(-2) to keep the level 0 grids persistent. The calls to recvgrids from parents, etc… needed to be placed inside of a conditional that checked whether n > BaseLevel instead of n > -2.

1, 2, 4, & 8 processors on grass

After modifying the algorithm, we ran a 3D fieldloop on a 32x32x64 grid on 1, 2, 4, & 8 processors on grass - and here is a table summarizing the results

NP Time Cell Updates Effective cell updates % time advancing % time waiting post advance actual efficiency Grid-size corrected efficiency Advance/Synchronization corrected efficiency
1 92.035 1441792 3132008 99.81 0 1 1 99.81
2 46.2833 1441792 3249532 97.5 2.25 0.994 1.032 99.75
4 28.105 1441792 3420032 98.32 1.24 0.819 0.894 99.56
8 23.068 1441792 3760944 96.67 2.61 0.499 0.599 99.28

From Cell Updates you can see that in all four runs, the same total number of cells were advanced. If we look at the actual efficiency (column 7) which is the wall-time x NP scaled to the single processor run we see that it drops fairly quickly once we are passed 2 processors. To investigate the reason we first need to consider the modifications to the computational cost produced by ghost zones. The fourth column is an estimate of this additional computational cost in terms of effective cell updates. Runs with more processors therefore have more effective cell updates to perform. If we weight the actual efficiency by this additional cost, we get the grid-size corrected efficiency which again only scales well out to two processors.

What we are really interested in, is what fraction of the time processors spend actually advancing grids. This is given in column 5, where we see a similar trend although not as severe. The fact that this does not scale with the grid-size corrected efficiency implies that our estimates of the work load for smaller grids is wrong - (by 36% - which is unlikely) or that the cpu's are not as dedicated to the problem which would not be surprising given that these simulations were run on an 8 processor personal work station. Because the processors are restricted - they no longer advance their grids at the same rate. This introduces an overall loss as well as an additional loss due to the need to resynchronize after each step. If we combine the time spent advancing along with the time spent waiting for synchronization effects due to cpu resource limitations, we find that the time lossed doing everything else remains above 99% even for the 8 processor run.

1, 2, 4, & 8 Processors on BlueHive

Strong Scaling

In order to test this we performed the same simulation on a dedicated machine (bluehive) while remaining on a single node with 8 cpus.

NP Time Cell Updates Effective cell updates % time advancing % time waiting post advance actual efficiency Grid-size corrected efficiency Advance/Synchronization corrected efficiency
1 87.891 1441792 1627032 99.82 0 1 1 99.81
2 43.717 1441792 1632664 97.26 0.45 1.005 1.009 99.71
4 30.889 1441792 2178704 98.64 0.75 0.711 0.953 99.39
8 19.501 1441792 2185040 96.98 0.35 0.563 0.757 99.33

Here we see the time spend waiting has dropped significantly and that the semi-modified efficiency has improved. It is still not keeping pace with the %time advancing which indicates that our estimates of the time spent to update smaller grids is wrong - or the competition between different cpus for memory is reducing the performance.

Weak Scaling

To test this we ran a weak scaling test with constant grid sizes per processor.

NP Time Cell Updates Effective cell updates % time advancing % time waiting post advance Actual efficiency Grid-size corrected efficiency Advance/Synchronization corrected efficiency
1 43.679 720896 816332 99.83 0 1 1 99.81
2 43.717 1441792 1632664 99.26 0.45 .999 .999 99.71
4 50.924 2883584 3265328 99.29 0.27 .858 .858 99.56
8 55.935 5767168 6530656 98.77 0.53 .781 .781 99.3

Now we see the basic efficiency has improved to where the previous semi-modified efficiency was at. It still, however, does not keep pace with the advance efficiencies. This strongly suggests that multiple cpus on a single node will compete for memory resources and may cause a 20% drop in performance.

Fair Scaling

To further test this we performed another strong scaling test of a 1283 domain on 1, 2, 4, & 8 processors as well as 'one' simulation where we ran 8 single processor 643 runs simultaneously on the same node.

nJobs @ NP Time Cell Updates Effective cell updates % time advancing % time waiting post advance Actual efficiency Grid-size corrected efficiency Advance/Synchronization corrected efficiency
1 @ 1 1284.2 23068672 23783859 99.87 0 1 1 99.81
1 @ 2 611.4 23068672 23768624 99.57 0.17 1.05 1.05 99.74
1 @ 4 327.3 23068672 24510508 99.00 0.42 .981 1.01 99.42
1 @ 8 209.8 23068672 24527448 97.74 1.23 .765 .789 98.97
8 @ 1 202 23068672 24527448 99.79 0 .795 .795 99.79

If we compare the basic efficiencies we see that by running 8 copies of the same single cpu job, that the performance has dropped by 20%.

Fixed Grid Scaling Results

For all of the fixed grid scaling tests we ran a 3D MHD advection problem for 11 time steps at various resolutions on bluehive with the densest packing of cpu's unless otherwise noted.

Strong Scaling

Looking at the actual run time compared to the ideal run time, we see a bump that begins at 8 processors that is due to the drop in advance speeds due to processor resource reduction. There is also a general trend in increased inefficiency due to the decreasing grid-sizes inherit in strong scaling calculations. This is however, a small effect since the smallest grid is still rather large at 32x32x32. If we increased the number of processors by another factor of 8 we would expect to begin seeing larger inefficiencies due to smaller grid sizes. However, these two effects aside - the scaling is excellent (except for the bump at 64 processors that needs to be investigated)

Results of strong scaling test on bluehive

When comparing strong and weak scaling it is more convenient to think in terms of efficiencies rather then run times. Below is the same data but we've divided the ideal run time by the other run times.

We've also included the correction for advance synchronization times.

Results of strong scaling test on bluehive showing efficiencies

NP Time Cell Updates Effective Cell Updates % time advancing % time waiting post advance Actual efficiency Grid-size corrected efficiency Advance/Synchronization corrected efficiency Pre-Advance Synchronization
1 1247.3 23068672 23783859 99.87 0.00 1.000 1.000 99.87 0.00
2 606.0 23068672 23768624 99.6 0.23 1.029 1.028 99.83 0.01
4 324.8 23068672 24510508 99.56 0.15 0.960 0.990 99.71 0.03
8 208.0 23068672 24527448 98.50 1.16 0.750 0.773 99.66 0.06
16 109.6 23068672 25277164 96.57 2.79 0.711 0.756 99.36 0.35
32 57.49 23068672 26019840 95.56 2.26 0.678 0.742 97.82 1.75
64 32.44 23068672 26122624 84.35 1.28 0.601 0.660 85.63 13.55

Weak Scaling

For the weak scaling tests we ensured that each processor was assigned a 32x32x32 cube and that overall the aspect ratio of the simulation was never more than 2. We see very similar behavior to the strong scaling efficiencies except for the differences in advance efficiencies for the simulations on 16 and 32 processors. Note the fair scaling point is the actual efficiency of an 8 processor job across 8 nodes.

Results of weak scaling test on bluehive showing efficiencies

NP Time Cell Updates Effective Cell Updates % time advancing % time waiting post advance Actual efficiency Advance/Synchronization corrected efficiency Pre-Advance Synchronization
1 20.81 360448 408166 99.83 0.00 1.000 99.83 0
2 20.91 720896 816332 99.61 0.17 0.995 99.78 0
4 21.62 1441792 1632664 99.41 0.30 0.962 99.71 0.03
8 27.49 2883584 3265328 99.11 0.43 0.758 99.54 0.06
16 30.31 5767168 6877508 95.17 4.07 0.686 99.24 0.41
32 33.06 11534336 13032624 95.86 2.77 0.629 98.63 1.02
64 32.44 23068672 26122624 84.35 1.28 0.641 85.63 13.55
8* 21.47 2883584 3265328 97.48 0.72 0.969 98.2 1.35

*8 processors on 8 separate nodes

AMR Multi-Processor Scaling

For a discussion on how Scrambler currently handles load balancing see LoadBalancing

1 Level of AMR

Attachments (6)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.