SCG Primer¶
Welcome to SCG! Here is some information that should help you use the SCG Informatics Cluster efficiently. If you are looking to make an account, see how to request an account.
If you would like to contribute information, please add a pull request!
Table of Contents¶
- Other resources
- Login nodes
- Directories
- Compute partitions
- Software modules on SCG
- Installing software on SCG
- SCG OnDemand
- Data access
- Sharing lab data in Oak storage
- SLURM basics
- Miscellaneous tidbits
Other resources¶
- SCG docs: https://login.scg.stanford.edu/
- SCG Slack: http://srcc.slack.com If you use SCG, I highly recommend joining this Slack Workspace. You are likely to get responses from SCG employees or other members of the SCG community if you post a question or error that a Google search didn’t help you answer. It’s also the place to post software installation requests (
#scg-software-requests
channel). More information about this workspace can be found here.
Login nodes¶
You start on a login node every time you log in to SCG. Login nodes are the same nodes as the interactive partition, but they still have limited resources (16GB of memory and a restricted number of processes) until you start a session in the interactive partition. Login nodes are meant for navigating directories, starting screen
/tmux
sessions, and other non-intenstive, minor processes.
Because there are several login nodes and screen
/tmux
sessions are only available on the node on which they were initialized, I highly recommend adding an alias to your LOCAL ~/.bashrc
or ~/.bash_profile
to set a persistent login node (it doesn’t matter which one). That way, you will always log into the same login node whenever you connect to SCG, and your screen
/tmux
sessions will always be where you expect them. For example:
# replace SUNETID with *your* SUNET ID
echo 'alias scg="ssh SUNETID@login04.scg.stanford.edu"' >> ~/.bash_profile
There are four login nodes, so adding any ONE of the following statements to ~/.bash_profile
will serve the same purpose as the command above:
# replace SUNETID with *your* SUNET ID
# only use ONE of these commands
echo 'alias scg="ssh SUNETID@login01.scg.stanford.edu"' >> ~/.bash_profile
echo 'alias scg="ssh SUNETID@login02.scg.stanford.edu"' >> ~/.bash_profile
echo 'alias scg="ssh SUNETID@login03.scg.stanford.edu"' >> ~/.bash_profile
The first time you add this line to your ~/bash*
file, you have to run source ~/.bash_profile
for the alias to register. After that, the scg
command will be recognized every time you start Terminal. Then just run scg
to log in to SCG.
Note:
~/.bash_profile
and~/.bashrc
are only automaticallysource
d if your shell isbash
. If your shell iszsh
, add this line to~/.zprofile
instead. If you are not sure which shell you are using, runecho $SHELL
.Note: Sometimes a single login node is being rebooted, and if that is the node you’ve specified in your alias above, you will have trouble connecting to SCG using the
scg
command while others do not. In this case, just log in using the standardssh SUNETID@login.scg.stanford.edu
to direct you to an available login node (though you will not see anyscreen
/tmux
sessions you started on the other node). When the node is done rebooting, yourscg
command will work again.
Directories¶
Home directory (~/SUNETID
)¶
Home directory quota is fixed at 32GB. Just about the only thing that should be there is software that you install.
Lab directory¶
This is either /labs/[LAB]
or /oak/stanford/groups/[LAB]
depending on your PI. Most of your files should be in ${LAB_DIR}/SUNETID
. You may have to make this directory yourself when you first join a group on SCG.
Check the lab quota with lfs quota -hg scg_lab_[LAB] /oak
OR lfs quota -hg oak_[LAB] /oak
(for /labs/[LAB]
and /oak/stanford/groups/[LAB]
, respectively).
Scratch space¶
Every node on has SCG (both login and batch) has scratch space mounted at /tmp
. This scratch is available only to that node, and the files you create there are visible only to your user. You have to copy anything you want to save back to Oak if you want to be able to access it from other nodes or allow other users to see it.
There is no quota in this scratch space, but storage is not unlimited. Files will automatically be deleted when space starts to run out, with the oldest ones being deleted first. Do not use /tmp
for long term storage. The main reason to use this is for temporary files you don’t care about that may take up a lot of space, and because it is local, meaning that any operations you do on files in /tmp
don’t have to go over the network like they do for Oak. This can lead to a big improvement for some disk intensive operations like alignment and variant/peak calling.
Compute partitions¶
FREE interactive
partition¶
16 cores, 128GB total are available for all running interactive jobs PER PERSON. You can split up those resources any way you would like. Jobs in the interactive partition are limited to 24 hours.
Launch a process in the interactive
partition after logging in. There are two ways to do this:
Interactive session¶
- Start a
screen
/tmux
session (you should do this for any process that you expect to take more than a minute in case your connection to SCG is interrupted) - Launch a job in the
interactive
partitionbash sdev -c 1 -m 20 -t 24:00:00
sdev
is a shortcut forsrun
, which is a SLURM command-c
: number of cores-m
: memory in GB-t
: time (formatDD-HH:MM:SS
)
Once the resources are allocated, you essentially get ssh
-ed into a new bash session with the requested resources. Then you can start running scripts (almost) just like you would locally on your laptop.
Note: Use
srun
instead ofsdev
to bill project (instead of lab) accounts, e.g.srun -p batch -A prj_[PROJECT] -t 1:00:00 -c 2 --mem-per-cpu=8 --pty /bin/bash
Submit a job to the interactive
partition with sbatch
¶
If you are running polished code or a standard pipeline that you don’t need to worry about debugging, you may prefer to submit a job to the interactive
job queue using sbatch
.
First, write an sbatch
script, e.g. test_sbatch.sh
:
#!/bin/bash
# See `man sbatch` or https://slurm.schedmd.com/sbatch.html for descriptions
# of sbatch options.
#SBATCH --job-name=test
#SBATCH --cpus-per-task=1
#SBATCH --partition=interactive
#SBATCH --account=default
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=5G
# by default, log files are written to the pwd
set -e
module load miniconda/3
python some_python_script.py # this is the process I want to run with the sbatch script
Then submit the job using sbatch test_sbatch.sh
. It is critical that the --partition
flag is set to interactive
if you don’t want to get billed.
Note that --cpus-per-task
is equivalent to the number of processes running in parallel.
BILLED batch
partition¶
The batch
partition its ideal for parallelizations beyond 16 cores or processes requiring more than 128GB of memory. Billing is done for number of CPUs used * actual run time - SCG does not charge for memory usage (but if you request too much your job might never run).
Use the labstats
command to see how much CPU time each lab user has racked up during the billed month.
Max resource requests on the batch
partition are 1 TB memory and 48 CPUs. Memory per CPU is node-specific.
You submit jobs to the batch
partition the same way, with a couple of added flags:
Interactive session ON THE BILLED batch
PARTITION¶
The only reason you would do this is if you are out of resources on the interactive
partition or you want to run a job with >128GB interactively (unlikely).
1. Start a screen
/tmux
session (you should do this for any process that you expect to take more than a minute in case your connection to SCG is interrupted)
2. Launch a job in the batch
partition
bash
sdev -c 1 -m 20 -t 24:00:00 -a [LAB_ACCOUNT] -p batch
- sdev
is a shortcut for srun
, which is a SLURM command
- -c
: number of cores
- -m
: memory in GB - bump this up if you get a core dump or out-of-memory error
- -t
: time (format DD-HH:MM:SS
)
- -a
: account (not sure what this is? run scgwhoami
when logged into SCG and look under Available SLURM Accounts
)
- -p
: partition, either interactive
(default, free) or batch
(billed; requires -a
)
Note: Use
srun
instead ofsdev
to bill project (instead of lab) accounts, e.g.srun -p batch -A prj_[PROJECT] -t 1:00:00 -c 2 --mem-per-cpu=8 --pty /bin/bash
Submit a job to the BILLED batch
session with sbatch
¶
If you are running polished code or a standard pipeline that you don’t need to worry about debugging, you may prefer to submit a job to the batch
job queue using sbatch
.
First, write an sbatch
script, e.g. test_sbatch.sh
:
#!/bin/bash
# See `man sbatch` or https://slurm.schedmd.com/sbatch.html for descriptions
# of sbatch options.
#SBATCH --job-name=test
#SBATCH --cpus-per-task=1
#SBATCH --partition=batch
#SBATCH --account=[LAB_ACCOUNT]
#SBATCH --time=12:00:00
#SBATCH --mem-per-cpu=5G
# by default, log files are written to the pwd
set -e
module load miniconda/3
python some_python_script.py # this is the process I want to run with the sbatch script
Then submit the job using sbatch test_sbatch.sh
.
Note that --cpus-per-task
is equivalent to the number of processes running in parallel.
nih_s10
NIH Supercomputer¶
This system has its own partition. You can run jobs with many CPUs and lots of memory, and it also has Nvidia GPUs for CUDA-accelerated software (typically deep learning or molecular dynamics). While it is free to use, it is very busy so it usually has a long wait time, and it frequently suffers from extended downtime due to hardware instability. You can read more about it here.
Job arrays¶
If you need to run a large number of jobs it is possible to use a job array with sbatch
to automate this: https://slurm.schedmd.com/job_array.html. If you need to do this it is likely to be expensive, so make sure you know what you’re doing.
Software modules on SCG¶
module load <module>
is your friend. If you can Google the bioinformatics tool, chances are SCG has already installed the software, and it’s loaded in a module. if you see command not found
, try loading a module.
Some module
commands:
- module avail
: Get a (long) list of existing modules; use arrow keys to scroll; q
to exit
- module keyword <keyword>
: Search for modules containing a keyword; use arrow keys to scroll; q
to exit
- module unload <module>
: Unload a module
- module purge
: Unload all modules (i.e. revert back to login state)
-
Reproducibility note: We know that computational reproducibility is very vital for scientific rigor and advancement. Using the same version of software across batches of data and reporting the bioinformatics software version in the manuscript is very important. Make a habit to use module load <module>/<version>
in your scripts.
How can you find files associated with a module after you load it?
Almost all module add the path to the modules programs/scripts to the PATH
variable, so this will show you that entry:
echo $PATH | tr ':' '\n'
A few of the most critical ones, for example:
- BEFORE trying to run python
:
bash
module load miniconda/2 #python2
module load miniconda/3 #python3
- BEFORE trying to run R
or Rscript
:
bash
module load r/3.6
module load r/3.5
- To load outdated modules, like older versions of R, run module load legacy
first; then, for example, module load r/3.4
Installing software on SCG¶
Always check if a module exists before installing software on SCG. If you’re really sure you need to install it, there are a few ways to do it:
-
For
R
packages, see Installing R packages -
For
python
modules:- Load the
python
module corresponding to the version in which you want to install the package (e.g.module load miniconda/3
) - Use
pip install --user MODULE
to install a module locally. Note you will need to load the same module before trying to import this module in the future (e.g.module load miniconda/3
)
- Load the
-
For anything else: Install it in your home directory (
/home/[SUNETID]
, also called$HOME
) or in aSOFTWARE
subfolder in your lab directory. Your home directory is recommended because the file system where home directories live is much faster than other file systems like the one used by/oak
. This is especially important for tools that do a lot of file I/O, like conda/mamba/micromamba. Therefore, you will see a dramatic improvement in performance if you install conda/mamba/micromamba in$HOME
instead of/oak
.
For particularly tricky installations, or just for anything you think might be useful for anyone else on SCG, add a software installation request to the #software-install-requests
channel in SCG’s Slack Workspace. I tend to install software myself and also add a request to the channel for anything that’s not already installed on SCG.
Options for more advanced users include Linuxbrew (https://docs.brew.sh/Homebrew-on-Linux), your own Miniconda installation (https://docs.conda.io/en/latest/miniconda.html), and local::lib
for Perl packages (but really, try to avoid Perl if possible!)
SCG OnDemand¶
I <3 SCG OnDemand. If off-campus, you must be connected to the Stanford VPN to access OnDemand.
- Files
tab lets you do file I/O in your home, lab, or project paths
- Interactive Apps
lets you run RStudio, Jupyter Notebooks, and other tools interactively while using SCG file systems and compute resources
Interactive RStudio¶
You can specify the following environmental variables to set the default working directory for RStudio sessions: RSTUDIO_DATA_HOME
or RSTUDIO_CONFIG_HOME
.
Installing R packages¶
It is not recommended to install packages within OnDemand’s RStudio. Trying to do so will often result in an error (see here and here and here).
Instead, log in to SCG in your terminal, load the appropriate version of R with module load r/[version]
, start the R prompt on the command line by running R
, and install the package in R. For example, to install data.table
in R v4.0.3, do the following:
ssh SUNETID@login.scg.stanford.edu # or use your alias
module load r/4.0.3
R
> install.packages("data.table")
> q()
exit
Then you will be able to load the package in your RStudio session running R v4.0.3 using library(data.table)
. Note if the library has already been loaded, you will have to restart R before loading the library to load the most updated version.
Data access¶
Mount SCG locally with samba
¶
Mounting files locally means you can directly edit SCG files in a desktop text editor instead of using a command line text editor on SCG. One way to do this is with samba
. Find platform-specific instructions about how to mount SCG with samba
here. To mount native SCG files (e.g /labs
or /projects
), use smb://samba.scg.stanford.edu/
. If off-campus, you must be connected to the Stanford VPN to use samba
.
For Montgomery Lab users: to mount Montgomery Lab Oak storage (e.g.
/oak/stanford/groups/smontgom
), usesmb://oak-smb-smontgom.stanford.edu
Sharing lab data in Oak storage¶
If you are the owner of the folder you want to share, you can change permissions for groups and individuals using setfacl
. See details in the links below. If you are not the owner of the folder, send an email to scg-action@lists.stanford.edu with your request, CCing your PI.
Sharing data only with specific members in your group/lab¶
See this thread: https://srcc.slack.com/archives/C8CNSTB88/p1591307072353500
Sharing data with SCG users outside of your group/lab¶
See this thread: https://srcc.slack.com/archives/C8CNSTB88/p1594668112463400
SLURM basics¶
SLURM is the job scheduler that SCG uses. See SCG documentation of SLURM basics here. Here are a few commands to know:
squeue
¶
squeue -u SUNETID
gives a list of the jobs currently running under your name (interactive
or batch
partition. you can use the -p
flag to specify, e.g. squeue -u username -p batch
)
sacct
¶
sacct -j JOBID
, for example, tells you the amount of resources used for recent jobs.
This is particularly important if you are planning to run a large number of jobs on the BILLED batch
partition - run one job first, see how many resources it required, and limit the resources you request for your batch of jobs. This is ESPECIALLY important for number of CPUs requested since SCG charges per CPU hour. If you request 2 CPUs but your process only uses 1, you still pay for those 2 CPUs as long as your job runs. SCG charges based on actual CPU time, NOT total time requested (i.e. request as much time as you want, with the knowledge that longer jobs will sit in the queue longer, but be thoughtful with your CPU requests).
eg: sacct -u username -o JOBID,JobName,MaxRSS,nCPUs,CPUTime --starttime 03/20
Use the -o
flag to change the format. e.g.:
- MaxRSS
: Roughly the memory required (request somewhat more than this)
- nCPUs
: Number of CPUs requested
- CPUTime
: Time billed by SCG
Use the --starttime
flag to change the time range in which to show job stats. By default, sacct
will only show stats for jobs completed since 12:00 AM that day. Extend the time window with this flag (format: MM/DD[/YY]-HH:MM[:SS]
).
See other fields here: https://slurm.schedmd.com/sacct.html
seff
¶
seff JOBID
provides a nice summary of a completed job, e.g. resources used and state (completed, failed, out of memory, etc.). I now find this more useful than sacct
.
scancel
¶
Kill a job. Use -j JOBID
for a single job or -u SUNETID
for ALL of your jobs.
Other SLURM tips and tricks¶
echo $SLURM_JOB_ID
will tell you if you’re inside a job (yes, you might forget when you get lost in the layers of screen sessions and interactive jobs)- If a running job looks “stuck”, i.e. it’s been actively running much longer than you would expect it to,
ssh
into the node it’s running on andhtop
to see if it looks active. If there are processes running but with very low CPU usage, something may have gone wrong. Runscontrol requeue JOBID
to kill and resubmit the job to the queue - To attach to a node running with
srun
(to see what processes are running, for example), runsrun --jobid JOBID --pty bash -l
. This is like SSHing into the node that the job is running on.exit
will end that SSH session but not your srun-initiated jobs on that node. (Alternatively, you can actually justssh
into the node displayed fromsqueue
.)
Miscellaneous tidbits¶
- Because of NFS things, I recommend adding a
--latency-wait
flag to your calls tosnakemake
pipelines. This means the pipeline will wait up to the specified number of seconds for a file to appear before aborting with an error. - Related to the point above, if you get a
File doesn't exist
error when running a tool/pipeline in/oak
or/labs
, try using/tmp
as your working directory instead. i.e., do all the heavy file I/O to/tmp
, and move the final outputs to more permanent storage when the process completes. Note that/tmp
is node-specific, unlike the file storage you’re used to, so you will have to log back into the same node to retrieve files generated in/tmp
if you log out of SCG before transferring results (see more about/tmp
scratch space here). You can usesacct
to figure out which node a job was run on (see SLURM basics) - See this thread in SCG Slack if you would like to keep track of resources for an interactive job.
- See these instructions for how to mount SCG directories locally with Samba.
- Globus is another option for transferring files (https://www.globus.org) - it does not require 2-factor authentication! As of April 2020, Globus can now be used for some of the cloud. See this announcement for details.
- rclone (https://rclone.org) makes it possible to transfer data from SCG to Box (PHI approved!), Google Drive, Dropbox, Google Cloud etc. See this tutorial for more info.
- SCG is NOT PHI-approved.