2 | | * Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. Scrambler, unlike many other AMR codes, uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will have much less memory per processor similar to what is already seen in GPU's. Additionally Scrambler uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent. This eliminates the need for global sharing of tree data. Processors only need to communicate with parent processors(processors containing parent grids), neighbor processors (processors containing adjacent grids), overlapping processors (processors containing previous AMR grids that overlap with the processors current grids), and child processors (processors assigned to child grids). Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows Scrambler to take advantage of the network topology. |
3 | | * In addition to being completely parallelized between processors, Scrambler makes use of threading to allow for parallel advancing of multiple AMR levels. This allows for total load balancing instead of balancing each level of the AMR hierarchy. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can be used. With threading this restriction is lifted. |
| 5 | * AstroBEAR history??? overview??? Do we want to frame this as a redesign of AstroBEAR 1.0 |
| 6 | * Larger and larger clusters alone will not allow larger and larger simulations to be performed in the same wall time without either an increase in CPU speed or a decrease in the workload of each processor. Since CPU speeds are not expected to keep pace with the requirements of large simulations, the only option is to decrease the work load of each processor. This however requires highly parallelized and efficient algorithms for managing the AMR infrastructure and the necessary computations. AstroBEAR, unlike many other AMR tree management codes (Paramesh, BoxLib), uses a distributed tree so that each processor is only aware of the AMR structure it needs to be aware of in order to carry out its computations and perform the necessary communications. While currently, this additional memory is small compared to the resources typically available to a CPU, future clusters will likely have much less memory per processor similar to what is already seen in GPU's. AstroBEAR also uses a distributed control structure that mirrors the nested AMR grid hierarchy. Processors have child processors just as AMR grids have child grids. These child processors receive new grids, and all necessary tree information from their parent processor(s). This eliminates the need for global sharing of tree data. Additionally the allocation of resources among child grids is done using a Hilbert space filling curve. This allows neighboring processors to be physically close on the network (or on the same core) and allows AstroBEAR to take advantage of the network topology. |
| 7 | * AstroBEAR also utilizes interlevel threading to allow advances to occur on different levels independently. This allows for total load balancing across all refinement levels instead of balancing each level independently. This becomes especially important for deep simulations (simulations with low filling fractions but many levels of AMR) as opposed to shallow simulations (high filling fractions and only a few levels of AMR). Processors with coarse grids can advance their grids simultaneously while processors with finer grids advance theirs. Without this capability, base grids would need to be large enough to be able to be distributed across all of the processors. For simulations with large base grids to be able to finish in a reasonable wall time, only a few levels of AMR can often be used. With interlevel threading this restriction is lifted. |
7 | | * Many if not all current AMR codes opt to store the entire AMR structure on each processor. This however requires additional memory as well as communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC down the road, we implemented a distributed tree algorithm in which each processor is only aware of the local tree. Each processor after creating its children, decides which surrounding processors needs to be notified of it's new children. This sharing of children happens between grids that are physically neighboring in space, as well as previous or future grids that overlap in time. Each grid then receives new neighboring/overlapping children and then checks to see which of them neighbor/overlap its own children. Then when distributing its children to children processors it includes the child's neighbors/overlaps so that the child processor is aware of its surrounding tree. [wiki:DistributedTree More detailed algorithm] |
| 11 | * Many if not all current AMR codes are built upon either BoxLib or Paramesh both of which opt to store the entire AMR structure on each processor. This however requires additional memory and communication. (7 floats per patch compared to 256 floats for the data (for 2D 4x4 hydro with 2 ghost zones). While this additional memory requirement is negligible for problems run on typical cpus on 100's of processors, it can be become considerable on 1000's or 10000's of processors. Since it is expected that efficient memory use and management will be required for HPC (high performance computing) down the road, AstroBEAR uses a distributed tree algorithm in which each processor is only aware of the nodes 'surrounding' to its own nodes. This includes its coarser parent node and finer children nodes, as well as physically adjacent neighbors and temporally adjacent overlaps. |
| 12 | * New Nodes |
| 13 | * Create new children and determine on which processor they will go |
| 14 | * Determine which overlapping nodes might have children that would overlap with its own children and send the relevant children info |
| 15 | * Determine which neighboring nodes might have children that would neighbor its own children and send the relevant children info |
| 16 | * Receive new children from neighbors and determine which of the neighbors' children neighbors its own children |
| 17 | * Receive old children from overlaps and determine which of the overlaps' children overlaps with its own children |
| 18 | * For each child that was distributed off processor, send to the child's processor the child's location, and the location of its parent, neighbors, & overlaps. |
| 19 | * Old Nodes |
| 20 | * Determine which overlapping nodes might have new children that would overlap with its own children and send the relevant children info |
| 21 | * Receive new children from overlaps and determine which of the overlaps' children overlaps with its own children |
| 22 | * For each old child that was distributed off processor, send to the child's processor the child's new overlaps. |
| 23 | |
| 24 | |