TORQUE

Links
TORQUE Admin Manual

Austrian Grid Wiki

Howto Build a Basic Gentoo Beowulf Cluster

Specify Compute Nodes
Syntax of nodes file:

The [:ts] option marks the node as timeshared. Timeshared nodes are listed by the server in the node status report, but the server does not allocate jobs to them.

The [np=] option specifies the maximum number of processes a node is allowed to run.

The node processor count can be automatically detected by the TORQUE server if auto_node_np is set to TRUE. This can be set using the command qmgr -c "set server auto_node_np = True". Setting auto_node_np to TRUE overwrites the value of np set in $TORQUECFG/server_priv/nodes.

The [properties] option allows you to specify arbitrary strings to identify the node. Property strings are alphanumeric characters only and must begin with an alphabetic character.

Comment lines are allowed in the nodes file if the first non-white space character is the pound sign (#).

Creating a passwordless key for all nodes
Since your home directory is mounted across all nodes, you only need to create one key in your home directory and it will automatically be present on all nodes due to the NFS mounted nature of your $HOME. Here is the sequence to perform:

The ssh-keygen command will prompt for a passphrase, don't enter anything since we don't want one to log onto the nodes. We then add the newly generated key to the authorized_keys:

Now we must log onto all the nodes so that their unique signature is added to our ssh configuration. To make the process simpler, we can loop the process as such:

This will log you onto each nodes and get the hostname value (we use hostname so that ssh is only used to launch a simple command and doesn't actually open a session on the node). Here is an example output, note that some of the nodes aren't available (ssh: node20: Name or service not known) and some of them were already registered (they simply return their hostname):

Changing Node State
A common task is to prevent jobs from running on a particular node by marking it offline with pbsnodes -o nodename. Once a node has been marked offline, the scheduler will no longer consider it available for new jobs. Simply use pbsnodes -c nodename when the node is returned to service.

Also useful is pbsnodes -l which lists all nodes with an interesting state, such as down, unknown, or offline. This provides a quick glance at nodes that migth be having a problem.

Limit real memory use
To limit real memory use and have the job killed if going over this limit do:

Using tracejob to Locate Job Failures
The tracejob utility extracts job status and job events from accounting records, mom log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.

Syntax

tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n ] [-f filter_type] 

-p : path to PBS_SERVER_HOME -w : number of columns of your terminal -n : number of days in the past to look for job(s) [default 1] -f : filter out types of log entries, multiple -f's can be specified error, system, admin, job, job_usage, security, sched, debug, debug2, or absolute numeric hex equivalent -z : toggle filtering excessive messages -c : what message count is considered excessive -a : don't use accounting log files -s : don't use server log files -l : don't use scheduler log files -m : don't use mom log files -q : quiet mode - hide all error messages -v : verbose mode - show more error messages

How do I exclude a node from running a job?
Sometimes one need to be able to exclude a node from running a job.

Sometimes it is useful to exclude a specific node from running your jobs. This can be due to hardware or software problems on that node. For instance the node seems to have problems with the interconnect.

The simplest way to do this is to submit a dummy job to this node:

Then this job will be running a sleep job for 600 seconds and you can submit your real job afterwards that will run on other nodes. This will cost you some cpu hours off your quota, but let us know and we will refund this to you later.

Hold and resume jobs
To suspend a job in the queue:

To resume a suspended job in the queue: