GPU Cluster Usage Guide¶
This guide explains how to run GPU-accelerated jobs on the SCG cluster. It covers partition policies, GPU requests, example job scripts, and best practices.
Available Hardware¶
GPU Hardware Comparison¶
The table below lists all the different GPU hardware that is in the SCG cluster. It is ordered from most to least powerful:
| GPU Model | Memory | Location | How many? |
|---|---|---|---|
| Nvidia H200 | 141 GB | Dell XE9680 | 8 |
| Nvidia A100 | 40/80 GB | Batch node | 2 |
| Nvidia P100 | 16 GB | SGI UV300 | 4 |
| Login GPU | Small | Login nodes | 4 |
GPU Partitions¶
Partitions in Slurm divide up the SCG cluster into machines with similar resource configurations. Every job submitted to the SCG cluster needs to be directed to a specific partition.
| GPU Model | Partition | Max Walltime | Max GPUs | Purpose |
|---|---|---|---|---|
| Nvidia H200 | gpu_short |
4 hours | 1 | Quick tests / debugging |
| Nvidia H200 | gpu_normal |
24 hours | 2 | Standard research runs |
| Nvidia H200 | gpu_long |
72 hours | 2 | Long model training or pipelines |
| Nvidia A100 | batch |
336 hours | 2 | Standard research runs |
| Nvidia P100 | nih_s10 |
168 hours | 4 | GPU use in combination with large memory or many CPUs |
| Login GPU | interactive |
24 hours | 1 | Testing and debugging whether jobs run at all |
Note: Access to GPU partitions requires Full Tier access, except for partitions nih_s10 and interactive.
Requesting GPUs¶
Required SBATCH Directives¶
Every GPU job must include:
#SBATCH --account=<account>
#SBATCH --partition=<partition_name>
#SBATCH --gres=gpu:<number>
#SBATCH --time=<walltime>
Example 1: Single GPU (Debugging)¶
#!/bin/bash
#SBATCH --job-name=gpu_debug
#SBATCH --account=<account>
#SBATCH --partition=gpu_short
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=gpu_debug_%j.out
module load cuda/13.0
nvidia-smi
python test_gpu_script.py
Example 2: Multi-GPU Training (Single Node)¶
#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
module load cuda/13.0
# Environment variables for multi-GPU
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN
# Distributed training
srun python -m torch.distributed.run \
--nproc_per_node=2 \
train.py
Interactive GPU Sessions¶
Request an interactive session for development and debugging:
srun --account=<account> --partition=gpu_short --gres=gpu:1 --time=02:00:00 --pty bash
Inside the session:
module load cuda/13.0
nvidia-smi
python my_script.py
Using Containers¶
Containers must be executed with GPU support using the --nv flag:
# Pull a container from NVIDIA GPU Cloud
singularity pull docker://nvcr.io/nvidia/pytorch:24.12-py3
# Run with GPU support
singularity exec --nv pytorch_24.12-py3.sif python my_script.py
Monitoring GPU Usage¶
During Job Execution¶
Find your job’s compute node and SSH to it:
# Find your job's node
squeue -u $USER -o "%i %N"
# SSH to that node
ssh <node_name>
# Real-time monitoring (updates every second)
watch -n 1 nvidia-smi
Automated Logging in Job Scripts¶
Add GPU monitoring to your job script:
# Start background monitoring (add BEFORE your main command)
nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv -l 10 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!
# Your main work
python train.py
# Stop monitoring
kill $GPU_MON_PID 2>/dev/null || true
Data Loading Best Practices¶
Use Node-Local Storage for I/O-Intensive Jobs¶
On SCG, per-job scratch is:
/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID
Copy datasets to this location for faster access:
#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
mkdir -p "$LOCAL_DATA"
# Copy data to local SSD/NVMe
echo "Copying data to local scratch..."
rsync -a /labs/<lab>/my_dataset/ "$LOCAL_DATA/my_dataset/"
echo "Data copy complete"
# Run training pointing to local data
python train.py --data_dir $LOCAL_DATA/my_dataset
# Cleanup
rm -rf $LOCAL_DATA
Recommended PyTorch DataLoader for Training¶
from torch.utils.data import DataLoader
train_loader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Usually <= --cpus-per-task
pin_memory=True, # Faster CPU→GPU transfer
persistent_workers=True, # Keep workers alive between epochs
prefetch_factor=2 # Prefetch batches
)
Complete Training Pipeline Example¶
Production-ready job script with all best practices:
#!/bin/bash
#SBATCH --job-name=train_resnet
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
# Load environment
module load cuda/13.0
source ~/myenv/bin/activate
# Environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN
# Create directories
mkdir -p checkpoints
# Copy data to local storage for faster I/O
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
echo "Copying dataset to $LOCAL_DATA..."
mkdir -p $LOCAL_DATA
rsync -a /labs/<lab>/datasets/imagenet/ $LOCAL_DATA/imagenet
echo "Data copy complete at $(date)"
# Start GPU monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used \
--format=csv -l 30 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!
# Run distributed training
echo "Starting training at $(date)"
srun python -m torch.distributed.run \
--nproc_per_node=2 \
train.py \
--data_dir $LOCAL_DATA/imagenet \
--checkpoint_dir checkpoints \
--epochs 90 \
--batch_size 128
echo "Training completed at $(date)"
# Cleanup
kill $GPU_MON_PID 2>/dev/null || true
rm -rf /local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID
Common Issues & Troubleshooting¶
Job stuck in queue?¶
Check partition status and current usage:
sinfo -p gpu_normal
squeue -p gpu_normal
Out of Memory (OOM) errors?¶
- Reduce batch size in your training script
- Check memory usage:
nvidia-smishows 141GB available per H200
GPU not detected in job?¶
Verify allocation and module loading:
# Check GPU visibility
echo $CUDA_VISIBLE_DEVICES
nvidia-smi
# Verify CUDA module loaded
module list
Poor GPU utilization (<50%)?¶
CPU bottleneck:
- Increase
--cpus-per-task(recommend 8-16 CPUs per GPU)
I/O bottleneck:
- Copy data to local node storage first (see Data Loading section)
- Increase DataLoader
num_workers - Enable data prefetching
Best Practices¶
1. Choose the Right Partition¶
- Use
gpu_shortonly for quick tests (<4 hours) - Use
gpu_normalfor standard training runs - Use
gpu_longfor extended training (>24 hours) - Use
batchfor longer jobs and medium GPU performance - Use
nih_s10if your job needs large amounts of CPUs and/or memory - Use
interactiveto test if the GPU library is working at all in your system.
2. Request Only What You Need¶
- Request only the GPUs you’ll actually use
- Jobs wait in queue until resources are available
- Requesting fewer GPUs = faster queue times
3. Verify GPU Access¶
Always include nvidia-smi in job scripts to verify GPU allocation:
nvidia-smi
echo "Allocated GPUs: $CUDA_VISIBLE_DEVICES"
4. Balance CPUs with GPUs¶
- Typical ratio: 8-16 CPUs per GPU
- Single GPU:
--cpus-per-task=8 - Two GPUs:
--cpus-per-task=16
5. Prefer Shorter Walltimes¶
- Shorter jobs start sooner in the queue
- Request only the time you need
- Use checkpointing to resume if time runs out
6. Monitor Resource Usage¶
- Always log GPU utilization
- Review logs after jobs complete
- Optimize based on actual usage patterns