wiki:ScramblerPaper

Version 4 (modified by Jonathan, 14 years ago) ( diff )

  • Introduction
    • Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. Scrambler, unlike many other AMR codes, uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will have much less memory per processor similar to what is already seen in GPU's. Additionally Scrambler uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent. This eliminates the need for global sharing of tree data. Processors only need to communicate with parent processors(processors containing parent grids), neighbor processors (processors containing adjacent grids), overlapping processors (processors containing previous AMR grids that overlap with the processors current grids), and child processors (processors assigned to child grids). Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows Scrambler to take advantage of the network topology.
    • In addition to being completely parallelized between processors, Scrambler makes use of threading to allow for parallel advancing of multiple AMR levels. This allows for total load balancing instead of balancing each level of the AMR hierarchy. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can be used. With threading this restriction is lifted.
  • Distributed Tree/Control Algorithm
    • Many if not all current AMR codes opt to store the entire AMR structure on each processor. This however requires additional memory as well as communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC down the road, we implemented a distributed tree algorithm in which each processor is only aware of the local tree. Each processor after creating its children, decides which surrounding processors needs to be notified of it's new children. This sharing of children happens between grids that are physically neighboring in space, as well as previous or future grids that overlap in time. Each grid then receives new neighboring/overlapping children and then checks to see which of them neighbor/overlap its own children. Then when distributing its children to children processors it includes the child's neighbors/overlaps so that the child processor is aware of its surrounding tree. More detailed algorithm

  • Threaded MultiLevelAdvance
    • Many if not all current AMR codes tend to perform grid updates across all levels in a prescribed order (0, 1, 2, 2, 1, 2, 2, 0…) Good performance then requires each level update to be balanced across all processors. Load balancing each level however, often leads to artificial fragmentation of grids and requires each processor to have grids on every level. This however, limits the depth of the AMR tree and the resulting simulations are often performed with large base grids with only 1 or 2 levels of AMR. The benefit of AMR however, is best utilized for scenarios with large dynamic ranges - and having only 2 levels of AMR severely limits the dynamic range that can be followed. In AstroBEAR we allow each levels grid advances to happen independently so that processors with coarse grids can update those grids while other processors with fine grids update theirs… Include some figures from here
  • Load Balancing
    • As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. During each root step (level 0) there will be 1 set of level 1 grids, 2 sets of level 2 grids, 4 sets of level 3 grids, and 2n-1 sets of level n grids that will need to be distributed. Before each new distribution of level n grids, each processor evaluates how much work remains to be finished on levels coarser then n (0:n-1) as well as the new workload associated with its new children. These two numbers are then shared among all processors so that each processor can determine its share of the next level grids. (Include more detailed calculation). Each processor then determines the new work loads and from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large. Wiki page with more related info to incorporate
    • Hilbert Ordering / Splitting
  • Hyperbolic engine
    • Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08). Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables. For GPU type systems, this would considerably restrict the size of a grid that could be updated. In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed. The dependencies in each calculation are explicitly stated and then used to determine the physical extent each variable is needed as well as when during the sweep the variable is first calculated and how long each calculated value is stored. This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory.
  • Source engine
    • AstroBEAR's source engine uses a 2 step Runge-Kutta to try to update values within error bounds. If the error is to large it switches to a Runge-Kutta45 with variable time stepping to achieve the desired accuracy.
  • Elliptic engine
    • Solving elliptic equations across an unstructured mesh requires several iterations of communication with computation so we decided to use a library called HYPRE… It operates on both structured and semi-structured meshes, however are use is limited to structured meshes. In order to maintain level advance independence for threading purposes, it was necessary to use euler forward time integration of coarse elliptic variables as the boundary conditions for finer level grids.
  • The supergridding experiment…
  • Performance Results

Attachments (25)

Note: See TracWiki for help on using the wiki.