43 | | * As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. When distributing children of level $l$ nodes each processor first calculates the average remaining workload that needs to be accomplished per average level $l$ step. If there are $n_l$ steps remaining before level $m$ needs to finish its updates then the workload on level $m$ is divided by $n_l$ |
44 | | |
45 | | During each root step (level 0) there will be 1 set of level 1 grids, R sets of level 2 grids, R^2^ sets of level 3 grids, and R^n-1^ sets of level n grids that will need to be distributed where R is the refinement level. The basic idea is that before each new distribution of level n grids, each processor evaluates how much work remains to be finished on levels coarser then n (0:n-1) per level n-1 step as well as the new workload associated with its new children over one n-1 level advance (or R level n advances). These two numbers are then shared among all processors so that each processor can determine its share of the next level grids so that if the amr patch was static, each processor would finish updating every coarser grid at the end of the last level n step. |
46 | | |
47 | | {{{ |
48 | | #!latex |
49 | | Define $T_l=\displaystyle\sum_{p=0}^Pt_l^p$ as the total time required to complete the current updates of grids on level $l$ and $t_l^p$ as the time required to update the level $l$ grids on processor $p$. Also define $R_l=\displaystyle\sum_{p=0}^Pr_l^p$ as the remaining time required to complete the current updates of grids on level $l$ and $r_l^p$ as the remaining time required to update the level $l$ grids on processor $p$. We the want to distribute $T_N$ so as to minimize the time spent waiting after each levels update. |
50 | | }}} |
51 | | |
52 | | (Include more detailed calculation). Each processor then determines the new work loads and from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large. [wiki:ScramblerScheduling Wiki page with more related info to incorporate] |
| 43 | * As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing. When distributing children of level [[latex($l$)]] nodes each processor first calculates the average remaining workload on each coarser level [[latex($W_{l'}^p \mbox{ where } 0 <= l' <= l$)]] and the number of remaining level [[latex($l$)]] advances [[latex($n_{l'}^l$)]] that can be completed before level [[latex($l'$)]] completes. Each processer then calculates the work load imbalance per level [[latex($l$)]] step [[latex($\delta^p_l = \displaystyle\sum_{l'=0}^l \frac{W_{l'}^p}{n_{l'}^l}$)]] as well as the new child load per level [[latex($l$)]] step [[latex($W_{l+1}^p$)]]. Then these two numbers [[latex($\delta_l^p$)]] and [[latex($W_{l+1}^p$)]] are globally shared. Then each processor calculates its share of the new level [[latex($l+1$)]] work load [[latex($\epsilon^p=\frac{\displaystyle\sum_{p=1}^{NP} W_{l+1}^p+\delta^p_l}{NP}-\delta_l^p$)]] where [[latex($\displaystyle\sum_{p=1}^{NP} \epsilon^p = \displaystyle\sum_{p=1}^{NP} W_{l+1}^p$)]]. Then the workloads per processor [[latex($W_{l+1}^p$)]] are partitioned over [[latex($\epsilon^p$)]] so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to. It then sorts its children in a hilbert order before distributing them in order among its child processors. If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors. Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality. If we however are distributed finest level grids, then this splitting is used. The splitting is also used on the coarsest root grid since its size can in general be very large. [wiki:ScramblerScheduling Wiki page with more related info to incorporate] |