wiki:RunningAnAstroBearJob

Version 12 (modified by trac, 12 years ago) ( diff )

Running An AstroBEAR Simulation

NOTE: This page assumes that you have compiled AstroBEAR and set up a problem directory, and that you are currently logged into the cluster on which you want to run your simulation.


Single-Processor

Note to Rice Cluster Users: The Rice cluster manages jobs using a special scheduler; for Rice cluster usage instructions, click here.

First, you need to find a free node to run your job on your cluster. Type ssh NODE_NAME and check the list of running processes using the top command. For instance, to check the 10th node on orda:

ssh orda10
top

If you don't see anyone using the node (look for usernames other than root) then the node is free. NOTE: multi-core systems can have multiple processes running; you can run xbear on a dual-core node with one other user process and neither one of you will suffer any loss of performance. Consequently, xbear users are encouraged to "stack" many single-processor jobs onto the same node to reduce multi-processor job loads.

If the node is not free (i.e., number of user processes ≥ number of cores), then log out and move on to the next one in the cluster. Cluster node sequences can be found here.

While logged into a a free node, move into your problem directory and type:

nohup ./xbear > outfile.out &
tail -f outfile.out 
  • nohup keeps your job running even if your connection to the cluster gets closed
  • > outfile.out pipes xbear's output to the outfile.out file
  • & backgrounds the process, opening the terminal to further input
  • tail -f prints new lines of xbear output as they are written to outfile.out.

Nothing requires you to name the executable "xbear" for every job. The command sequence:

mv xbear newbear

Will rename xbear to newbear, and the commands above are easily modified to match the new executable name. This is useful if you have multiple simulations running at once and want to check the status of one using the top or ps commands.


Multi-Processor

Note to Rice Cluster Users: The Rice cluster manages jobs using a special scheduler; for instructions on how to use the Rice cluster, click here.

First, decide how many processors you want to use. This will probably be determined by problem complexity, desired computational speed, and the number of free nodes.

Type ssh NODE_NAME and check the list of running processes using the top command. For instance, to check the 10th node on orda:

ssh orda10
top

If you don't see anyone using the node (look for usernames other than root) then the node is free. Multi-core processors can have multiple processes running on them; be sure to compare the number of cores on a cluster's CPUs to the number of user processes running on it.

If the node is not free (i.e., number of user processes ≥ number of cores), then log out and move on to the next one in the cluster. Cluster node sequences can be found here.

Once you have decided which nodes to use, modify host.def in the problem directory. On nova, the contents of a host.def file might look like this:

#nova cpu=2
#nova201 cpu=2
#nova202 cpu=2
#nova203 cpu=2
#nova204 cpu=2
nova205 cpu=2
nova206 cpu=1
#nova207 cpu=2
#nova208 cpu=2
#nova209 cpu=2
#nova210 cpu=2
#nova211 cpu=2
#nova212 cpu=2
#nova213 cpu=2
#nova214 cpu=2
#nova301 cpu=2
#nova302 cpu=2
#nova303 cpu=2
#nova304 cpu=2
#nova305 cpu=2
#nova306 cpu=2
#nova307 cpu=2
#nova308 cpu=2
#nova309 cpu=2
#nova310 cpu=2
#nova311 cpu=2
#nova312 cpu=2
#nova313 cpu=2
#nova314 cpu=2

Lines prefixed with # are commented out; these nodes will not be used. To change the selected nodes, simply uncomment the ones you want and comment the ones you don't want. Since nova has two processors on each node, the cpu=2 line must be present to use both of them. Note that nova206 has cpu=1, indicating someone else probably used one of the processors on nova206 at the time.

To start running mpibear, login to one of the nodes you will be using and move to the problem directory. Type the following commands:

lamboot host.def
nohup mpirun -np number_of_processors ./mpibear > firstrun.out &
tail -f firstrun.out

replacing number_of_processors with the number of processors to use.

Important: make sure to enter lamboot host.def before any multi-processor run. This starts communication between the processors you intend to use, as specified by host.def. Without the host.def parameter, lamboot will assume all processes must run on the one node you're currently logged into, resulting in severely reduced performance for that node.

To terminate your parallel run, type:

wipe host.def

from the same directory where you started the run (you can do so from any node running your job). This will terminate all the mpibear processes you have started on the nodes specified in host.def. This is especially important to do if you are terminating in the middle of a run; if you only kill one process then you will leave several processes hanging on any other nodes you were using.

Note: See TracWiki for help on using the wiki.