Changes between Version 17 and Version 18 of u/johannjc


Ignore:
Timestamp:
04/15/12 13:19:33 (13 years ago)
Author:
Jonathan
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • u/johannjc

    v17 v18  
    9898
    9999= Threading =
     100
     101== When is it possible? ==
     102
     103 * If we can decouple the coarser advances from the finer advances then we can use threading to speed up the computation.  For hyperbolic equations - we can use extended ghost zones to decouple the advance - but for elliptic/parabolic equations some care must be taken to allow for communication between disjoint finer grids embedded within a continuous coarser grid.  In particular, we need estimates for the time derivatives of various quantities in the boundary zones between coarse and fine grids.  There are two options:
     104  *  Use old time derivatives from the previous coarse advance to predict time derivatives for the current coarse advance...  One can still use extended ghost zones to avoid issues that arise from large fluid gradients - since most elliptic/parabolic quantities will be more smoothly varying in time...
     105  *  Lag the finer grid advances by one step - so that the previous coarse grid time derivatives will be the current fine grid time derivatives.  This however requires special care to be taken with regards to synchronization of fluxes etc... from fine grids as these will be lagged by one step.  This solution is much more difficult to implement and could result in large lagged corrections leading to instabilities?
     106
     107 * If we don't have any elliptic or parabolic equations then we can just use extended ghost zones.  Care must be taken to not modify values of any variables that might be shared by more than one level unless every advance thread has completed...  (ie Particle masses due to accretion, particle positions, outflow velocities, etc..., temporary work arrays, and so on...)
     108
     109== Two Approaches ==
     110 * If we do use threading there are two ways in which it can be advantageous depending on how we distribute grids.  We can either
     111   1. Global breadth first approach to distribution.  (ie the first few processors have all the level 0 grids, the next few processors have all the level 1 grids, and so on...)  Think of running multiple simulations where the coarser simulations provide boundary values for the finer simulations. 
     112   2. Local breadth first distribution. (ie every processor has a level 0 grid, a level 1 grid, and so on...)  Similar distribution to non-threaded approach - although corrections can be made to subsequent distributions depending on the degree of coarse work completed.  Levels don't have to be exactly balanced every level step to get perfect performance - as long as globally (over the entire root step) the level is balanced. 
     113
     114 == Discussion ==
     115 * Both approaches may require grids to be artificially split into smaller pieces though the second approach requires at least L*N grids where L is the number of levels and N the number of processors, while the first approach only requires max(L, N) grids...
     116 * Additionally the first approach uses smaller level communicators...  For things like elliptic solves, or for synchronizing ghost data etc...
     117 * The second approach allows for processors to stay busy while waiting for blocking communications as long as they have coarser work loads to advance.  Though creating more work is not the best way to achieve good parallelization.
     118 * One can imagine that if the number of cells per processor per level is large, then the additional artificial fragmentation introduces less overhead at which point the ability to switch between different level advances could then be beneficial - especially with large network latency.  With small grids, the overhead becomes much more - at which point the first approach might be more beneficial - although predicting the work loads on finer levels becomes more difficult as it will be dependent on the shape of each grid and whether or not it is split between one or more processors...
     119 * The problem is that due to processor speeds not increasing - higher resolution simulations will have to run with fewer and fewer zones per core per level to complete in the same wall time... which is the very limit when both approaches become difficult
     120
     121 * A hybrid approach would be to try to give processors grids on every level, but not at the expense of artifically fragmenting grids.  One could imagine a round robin approach to grid distribution ignoring the sizes - as long as a single processor does not get overloaded... Each processor would calculate a list of processors and a list of desired work loads as well as maximum work loads that could be used to distribute its locally created children.  It would then try to distribute its children using the desired work loads, but with the ability to use the maximum work loads to avoid fragmentation.  Like a local knapsack approach but with stretchable sacks...  One could also perform a global knapsack algorithm provided the number of processors was small...
     122
     123== Calculating the desired and maximum work loads ==
     124
     125 * The desired work load can be calculated by trying to balance the projected remaining work load for all coarser levels
     126 * The maximum work load can be calculated by trying to balance the projected remaining work load for all levels - assuming each processor does not get any finer level work load.
     127
     128
     129
     130=== More on the max work load ===
     131
    100132 What is the maximum combined work we can give each processor
    101133
    102134 How much work can this processor take on level n+1 before it will hold up synchronizations at the next level n, n-1, etc...
    103 == Consider 2 levels of AMR ==
     135=== Consider 2 levels of AMR ===
    104136 * And we are distributing level 0 and we have an estimate for the total level 1 workload
    105137 * If we give a single processor a level 0 load > (level 0 + level 1)/MPI_NP then it will keep other processors waiting at the next level 0 sync.