Version 13 (modified by 14 years ago) ( diff ) | ,
---|
Efficient Parallelization for AMR MHD Multiphysics Calculations; Implementaion in AstroBEAR
- Abstract
- Current and future AMR simulations will require algorithms that are highly parallelized and manage memory efficiently. In the redesign of AstroBEAR we have attempted to employ new techniques to achieve both of these goals. Patch based AMR often employs ghost cells to decouple the hyperbolic advances of each grid. This decoupling allows each patch to be advanced independently. In AstroBEAR we utilize this independence to allow patches across multiple refinement levels to be advanced independently with preference going to the finer level patches. This allows for global load balancing instead of level by level load balancing and allows for greater parallelization across both space and AMR level which can improve performance in deep simulations with many levels of refinement. With regard to memory management we have employed a distributed tree algorithm so processors no longer need to store the entire tree structure. We have also implemented a sweep method that pipelines the computations required for updating the fluid variables using unsplit algorithms. This can dramatically reduce the memory overhead required for intermediate variables.
- Introduction
- AstroBEAR history??? overview??? Do we want to frame this as a redesign of AstroBEAR 1.0
- Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. AstroBEAR, unlike many other AMR tree management codes (Paramesh, BoxLib), uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations and perform the necessary communications. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will likely have much less memory per processor similar to what is already seen in GPU's. AstroBEAR also uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent processor(s). This eliminates the need for global sharing of tree data. Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows AstroBEAR to take advantage of the network topology.
- AstroBEAR also utilizes interlevel threading to allow advances to occur on different levels independently. This allows for total load balancing across all refinement levels instead of balancing each level independently. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can often be used. With interlevel threading this restriction is lifted.
- Distributed Tree/Control Algorithm
- Many if not all current AMR codes are built upon either BoxLib or Paramesh both of which opt to store the entire AMR structure on each processor. This however requires additional memory and communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC (high performance computing) down the road, AstroBEAR uses a distributed tree algorithm in which each processor is only aware of the nodes 'surrounding' its own nodes. This includes its coarser parent node and finer children nodes, as well as physically adjacent neighbors and temporally adjacent preceding and succeeding overlaps. When new nodes are created, the parent nodes determine what information needs to be shared and with whom and then populates the childs local connections. The basic algorithm for updating local trees is as follows:
New Nodes Old Nodes First Step 1 Receive new nodes along with their parents, neighbors, and preceding overlaps from parent processors Receive new succeeding overlaps from parent processors 2 Create new children and determine on which child processors they will go 3 Determine which remote preceding nodes might have children that would overlap with its own children and send the relevant children info Determine which remote succeeding nodes might have created children that would overlap with its own children and send the relevant children info 4 Determine which remote neighboring nodes might have children that would neighbor its own children and send the relevant children info 5 Determine which local neighboring nodes have neighboring children 6 Determine which local preceding nodes have children that overlap with its own Determine which local succeeding nodes have children that overlap with its own 7 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors its own children 8 Receive children from remote preceding nodes and determine which of the nodes children overlaps with its own Receive children from remote succeeding nodes and determine which of the nodes children overlaps with its own. 9 For each remote child, send the child's info as well as information about its parents, neighbors, & preceding overlaps. For each remote child, send the child's succeeding overlaps. Successive Steps 10 Create new children and determine on which child processor they will go 11 Determine which remote neighboring nodes might have old/new children that would overlap/neighbor its own new children and send the relevant children info 12 Determine which local neighboring nodes might have old/new children that would overlap/neighbor its own new children 13 Receive new children from remote neighboring nodes and determine which of the neighbors' children neighbors/overlaps its new/old children 14 For each new remote child, send the child's information, and the information about its parent, neighbors, & preceding overlaps. 15 For each old remote child, send the child's succeeding overlaps. *Note that physically disjoint grids can overlap with each other's ghost zones - so the 1st generation of a node's neighbor's children can overlap with the nodes 2nd generation of children in steps 11-13 *See the section on distribution for calculation of parent and child processors.
- Threaded MultiLevelAdvance
- Many if not all current AMR codes tend to perform grid updates across all levels in a prescribed order (0, 1, 2, 2, 1, 2, 2, 0…) Good performance then requires each level update to be balanced across all processors. Load balancing each level however, often leads to artificial fragmentation of grids and requires each processor to have grids on every level. This however, limits the depth of the AMR tree and the resulting simulations are often performed with large base grids with only 1 or 2 levels of AMR. The benefit of AMR however, is best utilized for scenarios with large dynamic ranges - and having only 2 levels of AMR severely limits the dynamic range that can be followed. In AstroBEAR we allow each levels grid advances to happen independently so that processors with coarse grids can update those grids while other processors with fine grids update theirs…
- Load Balancing
- As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. When distributing children of level nodes each processor first calculates the average remaining workload on each coarser level and the number of remaining level advances that can be completed before level completes. Each processer then calculates the work load imbalance per level step as well as the new child load per level step . Then these two numbers and are globally shared. Then each processor calculates its share of the new level work load where . Then the workloads per processor are partitioned over so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large.
- Hyperbolic engine
- Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08). Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables. For GPU type systems, this would considerably restrict the size of a grid that could be updated. In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed. This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory.
- The pipelining is done automatically provided that the dependencies between stencil pieces is explicitly stated. For example consider a simple 2D 1st order Gudonov method in which the initial state is stored in
q
, and the updated fields are stored inQ
. The x and y fluxes are stored infx,fy
, and the left and right interface states are stored inqLx,qLy
andqRx,qRy
respectively. We also adopt the convention that stencil pieces stored on cell edges ( ieqLx, qRx, fx
) at positioni-1/2
are stored in their respective arrays with the indexi
Stated dependency Actual calculation Set_Dependency(qLx, q, (/-1,-1,0,0/)) qLx%data(i,j)=q%data(i-1,j) Set_Dependency(qRx, q, (/0,0,0,0/)) qRx%data(i,j)=q%data(i,j) Set_Dependency(qLy, q, (/0,0,-1,-1/)) qLy%data(i,j)=q%data(i,j-1) Set_Dependency(qRy, q, (/0,0,0,0/)) qRy%data(i,j)=q%data(i,j) Set_Dependency(fx, qLx, (/0,0,0,0/)) fx%data(i,j)=riemann(qLx%data(i,j),qRx%data(i,j)) Set_Dependency(fx, qRx, (/0,0,0,0/)) Set_Dependency(fx, qLy, (/0,0,0,0/)) fy%data(i,j)=riemann(qLy%data(i,j),qRy%data(i,j)) Set_Dependency(fx, qRy, (/0,0,0,0/)) Set_Dependency{Q, fx, (/0,1,0,0/)) Q%data(i,j)=q%data(i,j)+fx%data(i,j)-fx%data(i+1,j) Set_Dependency(Q, fy, (/0,0,0,1/)) +fy%data(i,j)-fy%data(i,j+1) Set_Dependency(Q, q, (/0,0/))
- Then assuming that we have the size of Q we want updated we can set
Q%range=(/1,mx, 1, my, 1, mz/)
and work backwards to determine what range of values we need to calculatefx, fy, qRx, qLx, qRy, qLy
&q
. To pipeline we first determine windows for each stencil that slide across the grid from left to right. As the window crosses the stencils range we begin calculated the values for the stencil along the right edge of the window. We then hang on to the calculated values as long as they are within the stencil's window.
Without pipelining we would then simply do the following:
FORALL(i=Q%range(1,1):Q%range(1,2), j=Q%range(2,1):Q$range(2,2), k=Q%range(3,1):Q%range(3,2)) Q%data(i,j,k)=q%data(i,j,k)+fx%data(i,j,k)-fx%data(i+1,j,k)+fy%data(i,j,k)-fy%data(i,j+1,k)+fz%data(i,j,k)-fz%data(i,j,k+1) END FORALL
To pipeline we have to update one slice in x at a time so instead we have
DO index = q%range(1,1):q%range(1,2) IF istime(q, index, i) THEN ... END IF ... ... IF istime(Q, index, i) THEN FORALL(j=Q%range(2,1):Q%range(2,2), k=Q%range(3,1):Q%range(3,2)) Q%data(Q%x(i),j,k)=q%data(q%x(i),j,k)+fx%data(fx%x(i),j,k)-fx%data(fx%x(i+1),j,k)+fy%data(fy%x(i),j,k)-fy%data(fy%x(i),j+1,k)+fz%data(fz%x(i),j,k)-fz%data(fz%x(i),j,k+1) END FORALL END IF END DO
where index is the x position within the large array that we are currently in the process of updating. The %x(:)
is an array that maps a local physical x offset with respect to the row we are updating with an i-index in an thin array that is only as wide as is needed. Finally the istime function returns the index i that represents which position within the Q%data array we should be calculating. In the above 1D example, q
would be three cells wide, fx
would be two cells wide, and everyone else would be only 1 cell wide. q
would be retrieved two cycles before updating/storing Q
, while qLx
, qRx
, & fx
would be calculated one cycle before.
The dependencies in each calculation are explicitly stated and then used to determine the physical extent each variable is needed as well as when during the sweep the variable is first calculated and how long each calculated value is stored.
- Source engine
- AstroBEAR's source engine uses a 2 step Runge-Kutta to try to update values within error bounds. If the error is to large it switches to a Runge-Kutta45 with variable time stepping to achieve the desired accuracy.
- Elliptic engine
- Solving elliptic equations across an unstructured mesh requires several iterations of communication with computation so we decided to use a library called HYPRE… It operates on both structured and semi-structured meshes, however are use is limited to structured meshes. In order to maintain level advance independence for threading purposes, it was necessary to use euler forward time integration of coarse elliptic variables as the boundary conditions for finer level grids.
- The supergridding experiment…
- Performance Results
AMR information about other codes
Code | AMR Library | Type | globally stored tree |
Ramses | Cell Based | yes | |
Flash | Paramesh | fixed patch based | yes |
Orion | Chombo | patch based | yes |
Pluto | Chombo | patch based | yes |
Enzo | Patch Based | yes | |
Zeus | Paramesh | fixed patch based | yes |
Nirvana | patch based |
cite elmegreen future trends in computing from the 2010 IAUS proceedings
Attachments (25)
-
ms.pdf
(213.3 KB
) - added by 14 years ago.
Current version of the paper
-
ms.2.pdf
(295.5 KB
) - added by 14 years ago.
Updated paper with figure 3 and improved discussion of Load Balancing
- ms.3.pdf (267.7 KB ) - added by 14 years ago.
- ms.4.pdf (267.7 KB ) - added by 14 years ago.
- MeshTree.png (81.8 KB ) - added by 14 years ago.
- ExtendedGhostZones.png (54.2 KB ) - added by 14 years ago.
- OverlapFigure.png (77.3 KB ) - added by 14 years ago.
- Overlaps.png (166.2 KB ) - added by 13 years ago.
-
ms.5.pdf
(586.1 KB
) - added by 13 years ago.
Newer draft with reworked sections 2-5
- ms.6.pdf (587.5 KB ) - added by 13 years ago.
- HilbertOrdering.png (43.3 KB ) - added by 13 years ago.
- LoadBalancing2.png (19.5 KB ) - added by 13 years ago.
- StrongMemory4.pdf (4.0 KB ) - added by 13 years ago.
- StrongScaling16.pdf (5.5 KB ) - added by 13 years ago.
- StrongScaling32.pdf (5.4 KB ) - added by 13 years ago.
- StrongScaling64.pdf (5.4 KB ) - added by 13 years ago.
- StrongScalingThreadRatios.pdf (4.6 KB ) - added by 13 years ago.
- WeakMemory.pdf (3.9 KB ) - added by 13 years ago.
- WeakScaling.pdf (4.4 KB ) - added by 13 years ago.
- WeakScalingThreadRatios.pdf (4.0 KB ) - added by 13 years ago.
- StrongMemory4wData.pdf (3.9 KB ) - added by 13 years ago.
- ms.7.pdf (661.0 KB ) - added by 13 years ago.
- MemoryComparison.png (36.3 KB ) - added by 13 years ago.
- GridSizeComparison.png (53.3 KB ) - added by 13 years ago.
- LevelComparison.png (46.1 KB ) - added by 13 years ago.