-
Notifications
You must be signed in to change notification settings - Fork 179
Cluster Usage Notes
- To become root, type:
su -
then type the password. Note the "-" in "su -" is important. It causes root's startup files to be invoked (i.e., just typing su will not work).
- To add a user account use the command:
useradd -u uid user
where uid is an integer either supplied by OISM or 1 greater than previous UID's found in /etc/passwd
file.
On burn use:
useradd -d /home4/user -u uid user
(since the home directory is in a different location)
- Next, use the following commands to set and synchronize the password:
passwd user
pwconv
passsync
- Add information to
/etc/passwd
about the user you just added (so in a year we'll know who the account was for). For example, change
jdoe:x:12345:18660::/home/jdoe:/bin/bash
to
jdoe:x:12345:18660:John Doe (NIST):/home/jdoe:/bin/bash
See /etc/passwd for other examples. Then type passsync
to update your changes.
- Adding a samba account
Type the following command to add a Samba account (used to access directories on blaze or burn from a Windows PC):
smbpasswd -a user
Edit the /etc/passwd
file to remove the user's password entry
Update the passwd file by typing:
pwconv
passsync
Remove the user's files with:
cd /home
rm -r user_name
- Type:
ssh-keygen -t rsa
- cd into the directory .ssh and type:
cat id_rsa.pub >> authorized_keys2
If a node becomes unresponsive (say blaze001) type the following command as root
placeholder for power reset command
- login to the blaze console as root placeholder for instructions to reboot cluster
-
login to the blaze console as root, perform the following steps. Login in directly as root . DO NOT LOGIN FROM YOUR USER ACCOUNT. (The POWER_OFF.sh script kill's all user processes so your user shell used to login as root would be killed)
-
cd /usr/local/bin
-
./POWER_OFF.sh
-
power_off
-
press power button on UPSs located at the bottom of each cabinet (for now just blaze and burn cabinets)
-
Turn off four circuit breakers on right side of room (back of cabinets)
-
Turn off AC unit (this can be done at any time)
-
Turn on four circuit breakers on right side of room (back of cabinets). Wait a few minutes to give the network switches a chance to boot up (they are computers too!)
-
press power button on UPSs located at the bottom of each cabinet (for now just blaze and burn cabinets)
-
login to the blaze console as root, perform the following steps. Login in directly as root . DO NOT LOGIN FROM YOUR USER ACCOUNT. (The POWER_OFF.sh script kill's all user processes so your user shell used to login as root would be killed)
-
cd /usr/local/bin
-
./POWER_ON.sh
-
Turn on AC unit (this can be done at any time)
On blaze, run the following script as root:
/usr/local/bin/RESET_CLOCK.sh
For example, to take blaze011 off of the default (batch) queue, perform the following steps:
- Edit /var/spool/torque/server_priv/nodes on blaze, and change the following lines
node011 np=8 16g compute
to
node011 np=8 16g testqueue
- Create a new queue (mine is called testing_queue) using the following commands:
qmgr -c "create queue testing_queue" qmgr -c "set queue testing_queue queue_type = Execution" qmgr -c "set queue testing_queue resources_default.neednodes = testqueue" qmgr -c "set queue testing_queue resources_default.nodes = 1" qmgr -c "set queue testing_queue enabled = True" qmgr -c "set queue testing_queue started = True"
- Restart pbs_server and maui
Important! You should restart the PBS server with the following commands (NOT pbs_server stop and NOT pbs_server restart), or you risk stopping jobs that are currently running:
qterm -t quick
/etc/init.d/pbs_server start
/etc/init.d/maui restart
Now, blaze011 will no longer accept jobs submitted to the default queue (batch), but you have to explicitly call the testing_queue like:
qfds.sh -r -q testing_queue casename.fds
-
Useful queuing commands list available queues: qmgr -c 'p s'
delete a queue: qmgr -c 'delete queue fire60s'make sure the following is used when setting up torque qmgr -c "set server scheduling = True"
Note: only batch is available now.
- batch - blaze001->blaze029 - original blaze queue
- batch2 - blaze036->blaze71 - nodes in 2nd blaze cabinet
- batch3 - blaze072->blaze107 - nodes in 3rd blaze cabinet
- batch4 - blaze108->blaze119 - leftover nodes
- batch - burn001->burn036
A file named /etc/sysconfig/pbs_mom containing
#!/bin/bash
ulimit -s unlimited
was added to each compute node on the burn and blaze cluster. This was to ensure that unlimited stack was available to each node of an openmpi job.
- restart gmond and gmetad on head node with
/etc/init.d/gmond restart
/etc/init.d/gmetad restart
- restart gmond on all nodes with (gmetad does not run on compute nodes)
/usr/local/bin/ganglia_restart.sh
##Fixing cluster problems
- Make a User's home directory readable
chmod 755 ~username
- Make all files in a directory tree readable (ie accessible to everyone)
cd to "one" level above the directory you wish to make readable and type the following command. If you don't "own" the directory, you'll need to be root
chmod +r -R directory_name
-
Samba is not working
see if the samba daemon is running by typing:
ps -el | grep smb
in a command shell. If you don't see anything (or even if you do) type as root:
/etc/init.d/smb restart
to restart the daemon
-
Checking disk usage
To see how much space is used by the dircectory named dir, type:
du -ks dir
To see how much space is used by all files/directories in the current directory, type:
du -ks `ls`
edit /etc/postmap/canonical by adding lines of the form: [email protected] [email protected]
type the following commands (on firevis):
postmap /etc/init.d/canonical /etc/init.d/postfix restart