wiki:ShellScheme

Version 7 (modified by trac, 12 years ago) ( diff )

Shell Algorithm

Fundamentally we have a series of computational cells that need to be updated using surrounding values. Some of those values will be locally maintained (stored and updated) while other cells will be externally maintained. These external surrounding cells will need to be copied and stored somehow so that the local cells can be updated. One approach is to send out all of the data that external cells will need, then to begin updating all internal cells that can be updated - while sending any new data that can be sent - and then finally wait for external cell data before updating cells that require external data. Of course each message that we receive will allow us to update new cells - so we could wait on any message - unpack it, and then update any cells that can now be updated - and then post any new possible sends. If each cell only needed it's six immediate neighboring cells - then we could create an integer field that contained the number of neighbors the cell had left to receive. Each time a new cell is received it could decrease this counter in the surrounding six cells by 1. Then cells whose counter reached 0 could then be safely updated. This approach could be extended to all cells within a certain surrounding region. After updating each cell we could store the new value, mark the cell as updated and continue.

This approach would work fine if cell updates involved completely independent calculations. Unfortunately this is not the case for all but the simplest of algorithms. Intercell fluxes for example - will be used by both the 'left' and 'right' cells. We could however, create an integer array that tracked when these fluxes could safely be calculated - and then modified the flag when these fluxes had in fact been calculated. Each intermediate piece of the calculation would then have an integer stored on each cell that counted down to when the piece of the algorithm could be updated. Each piece - after updating - could then decrease the counters of the stencil pieces that directly relied on it. This additional book-keeping would allow us to be able to perform calculations as soon as possible - even in complicated geometries. It does, however, require us to store the intermediate stencil pieces - at least until all of the depending stencil pieces are updated. This would likely lead to poor memory usage - as well as decreased vector operations. This would be sub-cellular AMR. It would however, allow us to avoid redundant computations between neighboring processors - since the first processor to update a stencil piece could send it to the neighboring processor - who could then mark that as completed etc…

This sub-cellular AMR gives the greatest flexibility but requires tracking the status of each calculation. Currently we group cells into patches and update the grids in one direction so that the status for the collection of cells is contained in one number - the sweep index. We also do not currently keep track of when data has arrived on a cell by cell basis - or even on a grid by grid basis - but only on a stage by stage basis. So if the sweep depends on external data - it can only be done after confirming that all of the data for the stage has arrived.

  • So we monitor communication completion simply by the Global Communication Status.
  • And we monitor the advance status by a node traversing pointer and a current sweep index..

Since grids are entirely maintained on a single processor - cells within the central core of a grid could be updated without waiting for any communication. Then cells surrounding the core could be updated as data was received - or updated after all data is received for the stage - or after all data is received for the grid. Storing when all data has been received for a grid would allow for grids surrounded by neighbors on the same processor to begin updating before any communication has occured. Of course this would not help for Fixed grid problems unless each processor split their grid into 27 pieces so that the central piece could be fully ghosted. However the redundant computations done between grids would likely destroy the benefit from being able to begin advancing pre-communication. Of course if we could find a way to avoid this redundant computation……. (ie ghost intermediate stencil pieces) then this approach could work. We could even create meta-grids that would only ghost stencil pieces within themselves - and then track when meta-grids could be updated… We would of course need to check whether stencil pieces had been updated before re-calculating them - but a WHERE mask could accomplish this (especially if there were all initialized to -999)

Overlapping computation with ghosting

1D Fixed Grid Case

Let's assume for a moment that a processor has two adjacent grids (in 1D). And let's assume that we have a simple Gudonov method. Let's assume that in order to advance we have to convert each cell to primitive form, then use those primitive values in neighboring cells to calculate the riemann flux at the cell interface.

  • Qi → Fi, Fi+1 (Conservative Update 5%)
  • Fi → Pi-1, Pi (Riemann calculation 90%)
  • Pi → Qi (Conversion to primitive form 5%)

In order to update cell 'i' we need

  • Fi & Fi+1
  • Pi-1, Pi, & Pi+1
  • Qi-1, Qi, & Qi+1

Thus this scheme requires knowledge of the two immediate neighbors (or a ghost zone 1 cell wide) to perform all of the computations. Let's assume we have two grids that are adjacent. Then the order of computations for updating the rightmost cell of grid 1 (cell i) and the left most cell of grid 2 (cell i+1) is

Grid 1 Grid 2
Qi-1→Pi-1 Qi→Pi
Qi→Pi Qi+1→Pi+1
Qi+1→Pi+1 Qi+2→Pi+2
(Pi-1 & Pi)→Fi (Pi & Pi+1)→Fi+1
(Pi & Pi+1)→Fi+1 (Pi+1 & Pi+2)→Fi+2
(Fi & Fi+1) → Qi (Fi+1 & Fi+2)→Qi+1

Comparing these two lists, we see that there are three redundant calculations needed by both grids

Qi→Pi
Qi+1→Pi+1
(Pi & Pi+1)→Fi+1

Our ghosting of Q by one cell allows us to perform these redundant calculations independently. If instead we sent Qi to the rightmost grid, allowed it to perform these calculations and then received Fi+1 back we would essentially be communicating the same amount of data, but avoiding the redundant computation. We have however, created the need for an extra round of 'messaging' between these grids and we have delayed the ability of grid 1 to complete it's update independently. Of course if grid 2 performed the redundant computation first and then immediately sent back Fi+1 then the first grid might receive Fi+1 before it could calculate it on it's own. This is at least true for grids on the same processor.

So for a 1D calculation - every grid could send it's right most Q to it's right neighbor, receive the left ghost Q from its left neighbor, calculate the left most flux and send it back to its left neighbor and continue fluxes from the left side.

Of course there is still the initial pause waiting for the left ghost Q. This time could be spent calculating fluxes from the 2nd cell onward. Of course if we calculate all of the fluxes before returning to the first cell, we will not be able to send the left most flux to our left neighbor until after all of the computation has finished - and our left neighbor will then be waiting after it has finished advancing. An optimal strategy would then be

  • Send the rightmost cell to the right neighbor grid.
  • Update the left half of the grid (starting at the 2nd cell)
  • Wait for the ghost cell
  • Calculate the 1st flux
  • Send the flux to the left neighbor
  • Update the right half of the grid
  • Wait for the rightmost flux from the right neighbor.

It would be better to perform the middle three steps as soon as the necessary data has arrived - this however requires frequent polling of the communication - or a threaded approach… If the time to half-advance the grid is much larger than the communication delay - then it is not too much of a problem. If this time becomes large then it is better to:

  • Send the left cell to the left neighbor and the right cell to the right neighbor
  • Update the interior of the grid (2:mx-1)
  • Wait for the ghost cells from the left and right neighbor
  • Calculate the leftmost and rightmost fluxes and update the left and right cell

Of course in both approaches, we need P1 at two different times. If we want to avoid re-calculating P1 (on the same processor for the same grid) we need to cache P1 so that it s lifetime spans the communication delay and in the second approach we need to store Pmx as well. We could reduce this time by starting in the middle of the grid and advancing outwards so that we calculate P1 (and Pmx) at the end right before waiting for Q0 (and Qmx+1)

For more complicated stencils we might need to store more than just P1 (ie P1-2, dq1 etc…) Determining how to cache these for different schemes is not entirely straightforward - however all of the information needed for determining what to store is already required by the sweep method.

2D Fixed Grid

Moving to 2D complicates the picture in two ways…

  • First, we can now have 32-1=8 neighbors instead of 31-1=2 neighbors (where we are counting kitty corner neighbors).
  • Second, the region we need to update post receiving ghost cells is a square-shell so we need to cache Pi,j in a 1 cell wide rectangular shell.

We can still sweep the interior grid to update Q - and cache the values - however after we receive ghost cells, we need to be able to re-use the cached values without re-calculating them.

We need to be able to modify the stencil ranges based on what values have already been calculated - or use where statements - which would certainly be the simplest.

If we stored or knew the calculated indices for each stencil piece - we could determine the new update box ranges for the second sweep.

Istimeshift could repopulate the cached values where available given the lead/lag and the overall ranges… - and the updated box ranges could determine where to calculate new values. If we still swept in x we would either have one or two ranges for y. In 3D we would have a square in Y and Z - where the z ranges were dependent on the y-range. IsTimeShift could also return a collection of 1 to 4 index ranges that outlined the 2D rectangular shell to update.

Of course we still need a way to cache the stencil pieces in these shells. These shells could be broken into pieces as follows:

1D

  • P(os(1,1):is(1,1))
  • P(is(1,2):os(1,2))

2D

  • P(os(1,1):is(1,1), os(2,1):os(2,2))
  • P(is(1,1):is(1,2), os(2,1):is(2,1))
  • P(is(1,1):is(1,2), is(2,2):os(2,2))
  • P(is(1,2):os(1,2), os(2,1):os(2,2))

3D

  • P(os(1,1):is(1,1), os(2,1):os(2,2), os(3,1):os(3,2))
  • P(is(1,1):is(1,2), os(2,1):is(2,1), os(3,1):os(3,2))
  • P(is(1,1):is(1,2), is(2,1):is(2,2), os(3,1):is(3,2))
  • P(is(1,1):is(1,2), is(2,1):is(2,2), is(2,1):os(3,2))
  • P(is(1,1):is(1,2), is(2,2):os(2,2), os(3,1):os(3,2))
  • P(is(1,2):os(1,2), os(2,1):os(2,2), os(3,1):os(3,2))

Implementation Plan

Add cache structure to stencil structure Add stencils to info structure Update istimeshift to return up to 4 ranges to update - and modify all routines in sweep to iterate over these ranges. Update istimeshift to populate stencil data from cache.

1D

Lets assume that we have a 1D system on two processor and that we are implementing a zeroth order gudonov system with no limiters. The calculations that need to be performed are :

  • f(i,n)=RS(q(i,n),q(i-1,n))
  • q(i,n+1)=q(i,n)+f(i+1,n)-f(i,n)

If at the beginning of each step each processor 1 has q(1:4,n) and processor 2 has q(5:8,n) and after each step processor 1 has q(1:4,n+1) and processor 2 has q(5:8,n+1) then we have different options. We could

  • send q each way and calculate the boundary flux redundantly
  • send q one way and calculate the boundary flux before sending it back the other way.

In the first option, an update takes max(L+1RS+1U,5RS+4U) The second option, however now takes max(2*L+1RS+1U, 4RS+4U)

The first option is therefore faster only if L > 4RS+3U. Otherwise, the second option is faster by 1 RS. For a small 1D problem this is nearly always true. For two 1D grids of size N, this becomes (L > N*RS+(N-1)*U). For a large 3D simulation N could be replaced by the number of cells

Here we have only considered latency. The first option requires more bandwidth…

2D

Now

  • fx(i,j,n)=RS(q(i,j,n),q(i-1,j,n))
  • fy(i,j,n)=RS(q(i,j,n),q(i,j-1,n))
  • q(i,j,n+1)=q(i,j,n)+fx(i+1,j,n)-fx(i,j,n)+fy(i,j+1,n)-fy(i,j,n)
  • option 1
    • receive q(0,1:my) and q(1:mx,0) from the left and lower processor respectively and send q(mx,1:my) and q(1:mx,my) from the right and upper processor.
    • calculate fluxes fx(1:mx,1:my) and fy(1:mx,1:my)
    • send fx(1,1:my) and fx(1:mx,1) to the left and lower processors respectively and recv fx(mx+1,1:my) and fy(1:mx,my+1) from the right and upper processors respectively.
  • option 2
    • receive q(0,1:my), q(1:mx,0), q(mx+1,1:my) and q(1:mx, my+1)
    • calculate fx(1:mx+1,1:my), fy(1:mx,1:my+1)

The time for option 1 is max(L+(mx+my)*RS + L + (mx+my)*U, ((mx+1)*my + (my+1)*mx)*RS + mx*my*U

Each stencil

q(:,:) q hdq limiter qlx,qlr fx, fy

Note: See TracWiki for help on using the wiki.