Changes between Version 18 and Version 19 of u/johannjc


Ignore:
Timestamp:
04/15/12 14:36:18 (13 years ago)
Author:
Jonathan
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • u/johannjc

    v18 v19  
    9494[[latex($\Omega t_r+\alpha = \alpha_0 - \Omega \frac{d}{v_w}$)]]
    9595
    96 
    97 
    98 
    99 = Threading =
    100 
    101 == When is it possible? ==
    102 
    103  * If we can decouple the coarser advances from the finer advances then we can use threading to speed up the computation.  For hyperbolic equations - we can use extended ghost zones to decouple the advance - but for elliptic/parabolic equations some care must be taken to allow for communication between disjoint finer grids embedded within a continuous coarser grid.  In particular, we need estimates for the time derivatives of various quantities in the boundary zones between coarse and fine grids.  There are two options:
    104   *  Use old time derivatives from the previous coarse advance to predict time derivatives for the current coarse advance...  One can still use extended ghost zones to avoid issues that arise from large fluid gradients - since most elliptic/parabolic quantities will be more smoothly varying in time...
    105   *  Lag the finer grid advances by one step - so that the previous coarse grid time derivatives will be the current fine grid time derivatives.  This however requires special care to be taken with regards to synchronization of fluxes etc... from fine grids as these will be lagged by one step.  This solution is much more difficult to implement and could result in large lagged corrections leading to instabilities?
    106 
    107  * If we don't have any elliptic or parabolic equations then we can just use extended ghost zones.  Care must be taken to not modify values of any variables that might be shared by more than one level unless every advance thread has completed...  (ie Particle masses due to accretion, particle positions, outflow velocities, etc..., temporary work arrays, and so on...)
    108 
    109 == Two Approaches ==
    110  * If we do use threading there are two ways in which it can be advantageous depending on how we distribute grids.  We can either
    111    1. Global breadth first approach to distribution.  (ie the first few processors have all the level 0 grids, the next few processors have all the level 1 grids, and so on...)  Think of running multiple simulations where the coarser simulations provide boundary values for the finer simulations. 
    112    2. Local breadth first distribution. (ie every processor has a level 0 grid, a level 1 grid, and so on...)  Similar distribution to non-threaded approach - although corrections can be made to subsequent distributions depending on the degree of coarse work completed.  Levels don't have to be exactly balanced every level step to get perfect performance - as long as globally (over the entire root step) the level is balanced. 
    113 
    114  == Discussion ==
    115  * Both approaches may require grids to be artificially split into smaller pieces though the second approach requires at least L*N grids where L is the number of levels and N the number of processors, while the first approach only requires max(L, N) grids...
    116  * Additionally the first approach uses smaller level communicators...  For things like elliptic solves, or for synchronizing ghost data etc...
    117  * The second approach allows for processors to stay busy while waiting for blocking communications as long as they have coarser work loads to advance.  Though creating more work is not the best way to achieve good parallelization.
    118  * One can imagine that if the number of cells per processor per level is large, then the additional artificial fragmentation introduces less overhead at which point the ability to switch between different level advances could then be beneficial - especially with large network latency.  With small grids, the overhead becomes much more - at which point the first approach might be more beneficial - although predicting the work loads on finer levels becomes more difficult as it will be dependent on the shape of each grid and whether or not it is split between one or more processors...
    119  * The problem is that due to processor speeds not increasing - higher resolution simulations will have to run with fewer and fewer zones per core per level to complete in the same wall time... which is the very limit when both approaches become difficult
    120 
    121  * A hybrid approach would be to try to give processors grids on every level, but not at the expense of artifically fragmenting grids.  One could imagine a round robin approach to grid distribution ignoring the sizes - as long as a single processor does not get overloaded... Each processor would calculate a list of processors and a list of desired work loads as well as maximum work loads that could be used to distribute its locally created children.  It would then try to distribute its children using the desired work loads, but with the ability to use the maximum work loads to avoid fragmentation.  Like a local knapsack approach but with stretchable sacks...  One could also perform a global knapsack algorithm provided the number of processors was small...
    122 
    123 == Calculating the desired and maximum work loads ==
    124 
    125  * The desired work load can be calculated by trying to balance the projected remaining work load for all coarser levels
    126  * The maximum work load can be calculated by trying to balance the projected remaining work load for all levels - assuming each processor does not get any finer level work load.
    127 
    128 
    129 
    130 === More on the max work load ===
    131 
    132  What is the maximum combined work we can give each processor
    133 
    134  How much work can this processor take on level n+1 before it will hold up synchronizations at the next level n, n-1, etc...
    135 === Consider 2 levels of AMR ===
    136  * And we are distributing level 0 and we have an estimate for the total level 1 workload
    137  * If we give a single processor a level 0 load > (level 0 + level 1)/MPI_NP then it will keep other processors waiting at the next level 0 sync.
    138 
    139  * So max level 0 load is just (Total WorkLoad for all levels per level 0 step) / MPI_NP
    140  * Then when distributing the level 1 work load - we have to check that nobody will be waiting for the level 1 synchronizations as well as the level 0 synchronization.  So how long before any 1 processor is idle after level 1 step?  It's level 1 workload + remaining level 0 work load.  So max level 1 workload assigned = min(level 1 workload + remaining level 0 workload)
    141