Changes between Version 11 and Version 12 of ScramblerPaper


Ignore:
Timestamp:
05/13/11 13:21:25 (14 years ago)
Author:
Jonathan
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • ScramblerPaper

    v11 v12  
    4141
    4242* Load Balancing
    43  * As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing.  When distributing children of level [[latex($l$)]] nodes each processor first calculates the average remaining workload on each coarser level [[latex($W_{l'}^p \mbox{ where } 0 <= l' <= l$)]] and the number of remaining level [[latex($l$)]] advances [[latex($n_{l'}^l$)]] that can be completed before level [[latex($l'$)]] completes.  Each processer then calculates the work load imbalance per level [[latex($l$)]] step [[latex($\delta^p_l = \displaystyle\sum_{l'=0}^l \frac{W_{l'}^p}{n_{l'}^l}$)]] as well as the new child load per level [[latex($l$)]] step [[latex($W_{l+1}^p$)]].  Then these two numbers [[latex($\delta_l^p$)]] and [[latex($W_{l+1}^p$)]] are globally shared.  Then each processor calculates its share of the new level [[latex($l+1$)]] work load [[latex($\epsilon^p=\frac{\displaystyle\sum_{p=1}^{NP} W_{l+1}^p+\delta^p_l}{NP}-\delta_l^p$)]] where [[latex($\displaystyle\sum_{p=1}^{NP} \epsilon^p = \displaystyle\sum_{p=1}^{NP} W_{l+1}^p$)]].  Then the workloads per processor [[latex($W_{l+1}^p$)]] are partitioned over [[latex($\epsilon^p$)]] so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to.  It then sorts its children in a hilbert order before distributing them in order among its child processors.  If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors.  Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality.  If we however are distributed finest level grids, then this splitting is used.  The splitting is also used on the coarsest root grid since its size can in general be very large. [wiki:ScramblerScheduling Wiki page with more related info to incorporate]
    44  * Hilbert Ordering / Splitting
     43 * As was mentioned before, threading level removes the need for balancing each level and instead allows for global load balancing.  When distributing children of level [[latex($l$)]] nodes each processor first calculates the average remaining workload on each coarser level [[latex($W_{l'}^p \mbox{ where } 0 <= l' <= l$)]] and the number of remaining level [[latex($l$)]] advances [[latex($n_{l'}^l$)]] that can be completed before level [[latex($l'$)]] completes.  Each processer then calculates the work load imbalance per level [[latex($l$)]] step [[latex($\delta^p_l = \displaystyle\sum_{l'=0}^l \frac{W_{l'}^p}{n_{l'}^l}$)]] as well as the new child load per level [[latex($l$)]] step [[latex($W_{l+1}^p$)]].  Then these two numbers [[latex($\delta_l^p$)]] and [[latex($W_{l+1}^p$)]] are globally shared.  Then each processor calculates its share of the new level [[latex($l+1$)]] work load [[latex($\epsilon^p=\frac{\displaystyle\sum_{p=1}^{NP} W_{l+1}^p+\delta^p_l}{NP}-\delta_l^p$)]] where [[latex($\displaystyle\sum_{p=1}^{NP} \epsilon^p = \displaystyle\sum_{p=1}^{NP} W_{l+1}^p$)]].  Then the workloads per processor [[latex($W_{l+1}^p$)]] are partitioned over [[latex($\epsilon^p$)]] so that each processor can determine from which processors it should expect to receive new child grids from as well as which processor it should give new grids to.  It then sorts its children in a hilbert order before distributing them in order among its child processors.  If the load of the next child grid is larger than the load assigned to the next child processor then the algorithm can optionally attempt to split (artificially fragment) the child grid into proportionate pieces before assigning them to the child processors.  Usually this is avoided and the first child processor is given the entire child grid since future distributions of finer grids can compensate for this inequality.  If we however are distributed finest level grids, then this splitting is used.  The splitting is also used on the coarsest root grid since its size can in general be very large.
    4544
    4645* Hyperbolic engine
    47  * Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08).  Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables.  For GPU type systems, this would considerably restrict the size of a grid that could be updated.  In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed.  The dependencies in each calculation are explicitly stated and then used to determine the physical extent each variable is needed as well as when during the sweep the variable is first calculated and how long each calculated value is stored.  This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory.
     46 * Currently AstroBEAR's hyperbolic solver uses a Gudonov type unsplit integrator that utilizes the CTU+CT integration scheme (Stone & Gardiner 08).  Unsplit integrators however, often require many intermediate variables be stored globally during a grid update which can require 10-50x the space required for storing the array of conservative variables.  For GPU type systems, this would considerably restrict the size of a grid that could be updated.  In AstroBEAR we implement a sweep method that essentially pipelines any sequence of calculations into a one dimensional pass across the grid where variables are only stored as long as they are needed.  This is ideally suited for GPU calculations in which a CPU could constantly be sending in new field values and retrieving updated field values while the GPU uses a minimum of memory. 
     47 * The pipelining is done automatically provided that the dependencies between stencil pieces is explicitly stated.  For example consider a simple 1D 1st order Gudonov in which the initial state is stored in {{{q}}}, and the updated fields are stored in {{{Q}}}.  The fluxes are stored in {{{fx}}}, and the left and right interface states are stored in {{{qLx}}} and {{{qRx}}} respectively.  We also adopt the convention that stencil pieces stored on cell edges ({{{qLx, qRx, fx}}}) at position {{{i-1/2}}} are stored in their respective arrays with the index {{{i}}}
     48
     49 ||  Stated dependency  ||  Actual calculation  ||
     50 ||Set_Dependency(qLx, q, (/-1,-1/))||qLx%data(i)=q%data(i-1)||
     51 ||Set_Dependency(qRx, q, (/0,0/))||qRx%data(i)=q%data(i)||
     52 ||Set_Dependency(fx, qLx, (/0,0/))|| fx%data(i)=riemann(qLx%data(i),qRx%data(i))||
     53 ||Set_Dependency(fx, qRx, (/0,0/)) || ||
     54 ||Set_Dependency{Q, fx, (/0,1/))||  Q%data(i)=q%data(i)+fx%data(i)-fx%data(i+1) ||
     55 ||Set_Dependency(Q, q, (/0,0/))|| ||
     56 * Then assuming that we have the size of Q we want updated we can set {{{Q%range=(/1,mx, 1, my, 1, mz/)}}} and work backwards to determine what range of values we need to calculate {{{fx, qRx, qLx,}}} & {{{q}}}.  Without pipelining we would then simply do the following:
     57{{{
     58FORALL(i=Q%range(1,1):Q%range(1,2), j=Q%range(2,1):Q$range(2,2), k=Q%range(3,1):Q%range(3,2))
     59  Q%data(i,j,k)=q%data(i,j,k)+fx%data(i,j,k)-fx%data(i+1,j,k)+fy%data(i,j,k)-fy%data(i,j+1,k)+fz%data(i,j,k)-fz%data(i,j,k+1)
     60END FORALL
     61}}}
     62To pipeline we have to update one slice in x at a time so instead we have
     63{{{
     64DO index = q%range(1,1):q%range(1,2)
     65
     66  IF istime(q, index, i) THEN
     67    ...
     68  END IF
     69
     70  ...
     71  ...
     72
     73
     74  IF istime(Q, index, i) THEN
     75    FORALL(j=Q%range(2,1):Q%range(2,2), k=Q%range(3,1):Q%range(3,2))
     76        Q%data(Q%x(i),j,k)=q%data(q%x(i),j,k)+fx%data(fx%x(i),j,k)-fx%data(fx%x(i+1),j,k)+fy%data(fy%x(i),j,k)-fy%data(fy%x(i),j+1,k)+fz%data(fz%x(i),j,k)-fz%data(fz%x(i),j,k+1)
     77    END FORALL
     78  END IF
     79END DO
     80}}}
     81where index is the x position within the large array that we are currently in the process of updating.  The {{{%x(:)}}} is an array that maps a local physical x offset with respect to the row we are updating with an i-index in an thin array that is only as wide as is needed.  Finally the istime function returns the index i that represents which position within the Q%data array we should be calculating.  In the above 1D example, {{{q}}} would be three cells wide, {{{fx}}} would be two cells wide, and everyone else would be only 1 cell wide.  {{{q}}} would be retrieved two cycles before updating/storing {{{Q}}}, while {{{qLx}}}, {{{qRx}}}, & {{{fx}}} would be calculated one cycle before.
     82
     83The dependencies in each calculation are explicitly stated and then used to determine the physical extent each variable is needed as well as when during the sweep the variable is first calculated and how long each calculated value is stored. 
    4884
    4985* Source engine