Version 12 (modified by 12 years ago) ( diff ) | ,
---|
Running An AstroBEAR Simulation
NOTE: This page assumes that you have compiled AstroBEAR and set up a problem directory, and that you are currently logged into the cluster on which you want to run your simulation.
Single-Processor
Note to Rice Cluster Users: The Rice cluster manages jobs using a special scheduler; for Rice cluster usage instructions, click here.
First, you need to find a free node to run your job on your cluster. Type ssh
NODE_NAME
and check the list of running processes using the top
command. For instance, to check the 10th node on orda
:
ssh orda10 top
If you don't see anyone using the node (look for usernames other than root
) then the node is free. NOTE: multi-core systems can have multiple processes running; you can run xbear
on a dual-core node with one other user process and neither one of you will suffer any loss of performance. Consequently, xbear
users are encouraged to "stack" many single-processor jobs onto the same node to reduce multi-processor job loads.
If the node is not free (i.e., number of user processes ≥ number of cores), then log out and move on to the next one in the cluster. Cluster node sequences can be found here.
While logged into a a free node, move into your problem directory and type:
nohup ./xbear > outfile.out & tail -f outfile.out
nohup
keeps your job running even if your connection to the cluster gets closed> outfile.out
pipesxbear
's output to theoutfile.out
file&
backgrounds the process, opening the terminal to further inputtail -f
prints new lines ofxbear
output as they are written tooutfile.out
.
Nothing requires you to name the executable "xbear
" for every job. The command sequence:
mv xbear newbear
Will rename xbear
to newbear
, and the commands above are easily modified to match the new executable name. This is useful if you have multiple simulations running at once and want to check the status of one using the top
or ps
commands.
Multi-Processor
Note to Rice Cluster Users: The Rice cluster manages jobs using a special scheduler; for instructions on how to use the Rice cluster, click here.
First, decide how many processors you want to use. This will probably be determined by problem complexity, desired computational speed, and the number of free nodes.
Type
ssh
NODE_NAME
and check the list of running processes using thetop
command. For instance, to check the 10th node onorda
:
ssh orda10 top
If you don't see anyone using the node (look for usernames other than root
) then the node is free. Multi-core processors can have multiple processes running on them; be sure to compare the number of cores on a cluster's CPUs to the number of user processes running on it.
If the node is not free (i.e., number of user processes ≥ number of cores), then log out and move on to the next one in the cluster. Cluster node sequences can be found here.
Once you have decided which nodes to use, modify host.def
in the problem directory. On nova
, the contents of a host.def
file might look like this:
#nova cpu=2 #nova201 cpu=2 #nova202 cpu=2 #nova203 cpu=2 #nova204 cpu=2 nova205 cpu=2 nova206 cpu=1 #nova207 cpu=2 #nova208 cpu=2 #nova209 cpu=2 #nova210 cpu=2 #nova211 cpu=2 #nova212 cpu=2 #nova213 cpu=2 #nova214 cpu=2 #nova301 cpu=2 #nova302 cpu=2 #nova303 cpu=2 #nova304 cpu=2 #nova305 cpu=2 #nova306 cpu=2 #nova307 cpu=2 #nova308 cpu=2 #nova309 cpu=2 #nova310 cpu=2 #nova311 cpu=2 #nova312 cpu=2 #nova313 cpu=2 #nova314 cpu=2
Lines prefixed with #
are commented out; these nodes will not be used. To change the selected nodes, simply uncomment the ones you want and comment the ones you don't want. Since nova
has two processors on each node, the cpu=2
line must be present to use both of them. Note that nova206
has cpu=1
, indicating someone else probably used one of the processors on nova206
at the time.
To start running mpibear
, login to one of the nodes you will be using and move to the problem directory. Type the following commands:
lamboot host.def nohup mpirun -np number_of_processors ./mpibear > firstrun.out & tail -f firstrun.out
replacing number_of_processors
with the number of processors to use.
Important: make sure to enter lamboot host.def
before any multi-processor run. This starts communication between the processors you intend to use, as specified by host.def
. Without the host.def
parameter, lamboot
will assume all processes must run on the one node you're currently logged into, resulting in severely reduced performance for that node.
To terminate your parallel run, type:
wipe host.def
from the same directory where you started the run (you can do so from any node running your job). This will terminate all the mpibear
processes you have started on the nodes specified in host.def
. This is especially important to do if you are terminating in the middle of a run; if you only kill one process then you will leave several processes hanging on any other nodes you were using.