Cluster Usage Notes

Becoming root

To become root, type:

 su -

then type the password. Note the "-" in "su -" is important. It causes root's startup files to be invoked (i.e., just typing su will not work).

Adding a user account

To add a user account use the command:

 useradd -u uid user

where uid is an integer either supplied by OISM or 1 greater than previous UID's found in /etc/passwd file.

On burn use:

 useradd -d /home4/user -u uid user

(since the home directory is in a different location)

Next, use the following commands to set and synchronize the password:

 passwd user
 pwconv
 passsync

Add information to /etc/passwd about the user you just added (so in a year we'll know who the account was for). For example, change

 jdoe:x:12345:18660::/home/jdoe:/bin/bash

to

 jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash

See /etc/passwd for other examples. Then type passsync to update your changes.

Adding a samba account

Type the following command to add a Samba account (used to access directories on blaze or burn from a Windows PC):

 smbpasswd -a user

Removing a user account

Edit the /etc/passwd file to remove the user's password entry

Update the passwd file by typing:

 pwconv
 passsync

Remove the user's files with:

 cd /home
 rm -r user_name

Setting Up ssh keys

Type:

 ssh-keygen -t rsa

cd into the directory .ssh and type:

 cat id_rsa.pub >> authorized_keys2

Managing nodes (rebooting, powering up and/or down)

Rebooting a node

If a node becomes unresponsive (say blaze001) type the following command as root

placeholder for power reset command

Rebooting the cluster

login to the blaze console as root placeholder for instructions to reboot cluster

Powering down the cluster

login to the blaze console as root, perform the following steps. Login in directly as root . DO NOT LOGIN FROM YOUR USER ACCOUNT. (The POWER_OFF.sh script kill's all user processes so your user shell used to login as root would be killed)
cd /usr/local/bin
./POWER_OFF.sh
power_off
press power button on UPSs located at the bottom of each cabinet (for now just blaze and burn cabinets)
Turn off four circuit breakers on right side of room (back of cabinets)
Turn off AC unit (this can be done at any time)

Powering up the cluster (reverse of power down instructions)

Turn on four circuit breakers on right side of room (back of cabinets). Wait a few minutes to give the network switches a chance to boot up (they are computers too!)
press power button on UPSs located at the bottom of each cabinet (for now just blaze and burn cabinets)
login to the blaze console as root, perform the following steps. Login in directly as root . DO NOT LOGIN FROM YOUR USER ACCOUNT. (The POWER_OFF.sh script kill's all user processes so your user shell used to login as root would be killed)
cd /usr/local/bin
./POWER_ON.sh
Turn on AC unit (this can be done at any time)

Setting clocks to the correct time

On blaze, run the following script as root:

/usr/local/bin/RESET_CLOCK.sh

Troubleshooting nodes and taking them off the batch queue

For example, to take blaze011 off of the default (batch) queue, perform the following steps:

Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines

 node011 np=8 16g compute

to

 node011 np=8 16g testqueue

Create a new queue (mine is called testing_queue) using the following commands:

qmgr -c "create queue testing_queue" qmgr -c "set queue testing_queue queue_type = Execution" qmgr -c "set queue testing_queue resources_default.neednodes = testqueue" qmgr -c "set queue testing_queue resources_default.nodes = 1" qmgr -c "set queue testing_queue enabled = True" qmgr -c "set queue testing_queue started = True"

Restart pbs_server and maui

Important! You should restart the PBS server with the following commands (NOT pbs_server stop and NOT pbs_server restart), or you risk stopping jobs that are currently running:

 qterm -t quick
 /etc/init.d/pbs_server start
 /etc/init.d/maui restart

Now, blaze011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:

qfds.sh -r -q testing_queue casename.fds

Useful queuing commands list available queues: qmgr -c 'p s'
delete a queue: qmgr -c 'delete queue fire60s'

make sure the following is used when setting up torque qmgr -c "set server scheduling = True"

Queues

blaze

Note: only batch is available now.

batch - blaze001->blaze029 - original blaze queue
batch2 - blaze036->blaze71 - nodes in 2nd blaze cabinet
batch3 - blaze072->blaze107 - nodes in 3rd blaze cabinet
batch4 - blaze108->blaze119 - leftover nodes

burn

batch - burn001->burn036

Torque configuration changes

A file named /etc/sysconfig/pbs_mom containing

 #!/bin/bash
 ulimit -s unlimited

was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.

Restarting ganglia

restart gmond and gmetad on head node with

  /etc/init.d/gmond restart
  /etc/init.d/gmetad restart

restart gmond on all nodes with (gmetad does not run on compute nodes)

/usr/local/bin/ganglia_restart.sh

##Fixing cluster problems

Make a User's home directory readable

  chmod 755 ~username

Make all files in a directory tree readable (ie accessible to everyone)

cd to "one" level above the directory you wish to make readable and type the following command. If you don't "own" the directory, you'll need to be root

  chmod +r -R directory_name

Samba is not working

see if the samba daemon is running by typing:

  ps -el | grep smb

in a command shell. If you don't see anything (or even if you do) type as root:

  /etc/init.d/smb restart

to restart the daemon

Checking disk usage

To see how much space is used by the dircectory named dir, type:

  du -ks dir

To see how much space is used by all files/directories in the current directory, type:

  du -ks `ls`

Change email From addresses

edit /etc/postmap/canonical by adding lines of the form: [email protected] [email protected]

type the following commands (on firevis):

postmap /etc/init.d/canonical /etc/init.d/postfix restart

Provide feedback

Saved searches

Use saved searches to filter your results more quickly