GPU Cluster Usage Guide

This guide explains how to run GPU-accelerated jobs on the SCG cluster. It covers partition policies, GPU requests, example job scripts, and best practices.


Available Hardware

GPU Hardware Comparison

The table below lists all the different GPU hardware that is in the SCG cluster. It is ordered from most to least powerful:

GPU Model Memory Location How many?
Nvidia H200 141 GB Dell XE9680 8
Nvidia A100 40/80 GB Batch node 2
Nvidia P100 16 GB SGI UV300 4
Login GPU Small Login nodes 4

GPU Partitions

Partitions in Slurm divide up the SCG cluster into machines with similar resource configurations. Every job submitted to the SCG cluster needs to be directed to a specific partition.

GPU Model Partition Max Walltime Max GPUs Purpose
Nvidia H200 gpu_short 4 hours 1 Quick tests / debugging
Nvidia H200 gpu_normal 24 hours 2 Standard research runs
Nvidia H200 gpu_long 72 hours 2 Long model training or pipelines
Nvidia A100 batch 336 hours 2 Standard research runs
Nvidia P100 nih_s10 168 hours 4 GPU use in combination with large memory or many CPUs
Login GPU interactive 24 hours 1 Testing and debugging whether jobs run at all

Note: Access to GPU partitions requires Full Tier access, except for partitions nih_s10 and interactive.


Requesting GPUs

Required SBATCH Directives

Every GPU job must include:

#SBATCH --account=<account>
#SBATCH --partition=<partition_name>
#SBATCH --gres=gpu:<number>
#SBATCH --time=<walltime>

Example 1: Single GPU (Debugging)

#!/bin/bash
#SBATCH --job-name=gpu_debug
#SBATCH --account=<account>
#SBATCH --partition=gpu_short
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=gpu_debug_%j.out

module load cuda/13.0
nvidia-smi
python test_gpu_script.py

Example 2: Multi-GPU Training (Single Node)

#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

module load cuda/13.0

# Environment variables for multi-GPU
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN

# Distributed training
srun python -m torch.distributed.run \
    --nproc_per_node=2 \
    train.py

Interactive GPU Sessions

Request an interactive session for development and debugging:

srun --account=<account> --partition=gpu_short --gres=gpu:1 --time=02:00:00 --pty bash

Inside the session:

module load cuda/13.0
nvidia-smi
python my_script.py

Using Containers

Containers must be executed with GPU support using the --nv flag:

# Pull a container from NVIDIA GPU Cloud
singularity pull docker://nvcr.io/nvidia/pytorch:24.12-py3

# Run with GPU support
singularity exec --nv pytorch_24.12-py3.sif python my_script.py

Monitoring GPU Usage

During Job Execution

Find your job’s compute node and SSH to it:

# Find your job's node
squeue -u $USER -o "%i %N"

# SSH to that node
ssh <node_name>

# Real-time monitoring (updates every second)
watch -n 1 nvidia-smi

Automated Logging in Job Scripts

Add GPU monitoring to your job script:

# Start background monitoring (add BEFORE your main command)
nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total \
    --format=csv -l 10 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!

# Your main work
python train.py

# Stop monitoring 
kill $GPU_MON_PID 2>/dev/null || true

Data Loading Best Practices

Use Node-Local Storage for I/O-Intensive Jobs

On SCG, per-job scratch is:

/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID

Copy datasets to this location for faster access:

#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
mkdir -p "$LOCAL_DATA"

# Copy data to local SSD/NVMe
echo "Copying data to local scratch..."
rsync -a /labs/<lab>/my_dataset/ "$LOCAL_DATA/my_dataset/"
echo "Data copy complete"

# Run training pointing to local data
python train.py --data_dir $LOCAL_DATA/my_dataset

# Cleanup
rm -rf $LOCAL_DATA
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,              # Usually <= --cpus-per-task
    pin_memory=True,            # Faster CPU→GPU transfer
    persistent_workers=True,    # Keep workers alive between epochs
    prefetch_factor=2           # Prefetch batches
)

Complete Training Pipeline Example

Production-ready job script with all best practices:

#!/bin/bash
#SBATCH --job-name=train_resnet
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

# Load environment
module load cuda/13.0
source ~/myenv/bin/activate

# Environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN

# Create directories
mkdir -p checkpoints

# Copy data to local storage for faster I/O
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
echo "Copying dataset to $LOCAL_DATA..."
mkdir -p $LOCAL_DATA
rsync -a /labs/<lab>/datasets/imagenet/ $LOCAL_DATA/imagenet
echo "Data copy complete at $(date)"

# Start GPU monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used \
    --format=csv -l 30 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!

# Run distributed training
echo "Starting training at $(date)"
srun python -m torch.distributed.run \
    --nproc_per_node=2 \
    train.py \
    --data_dir $LOCAL_DATA/imagenet \
    --checkpoint_dir checkpoints \
    --epochs 90 \
    --batch_size 128

echo "Training completed at $(date)"

# Cleanup
kill $GPU_MON_PID 2>/dev/null || true
rm -rf /local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID

Common Issues & Troubleshooting

Job stuck in queue?

Check partition status and current usage:

sinfo -p gpu_normal
squeue -p gpu_normal

Out of Memory (OOM) errors?

  • Reduce batch size in your training script
  • Check memory usage: nvidia-smi shows 141GB available per H200

GPU not detected in job?

Verify allocation and module loading:

# Check GPU visibility
echo $CUDA_VISIBLE_DEVICES
nvidia-smi

# Verify CUDA module loaded
module list

Poor GPU utilization (<50%)?

CPU bottleneck:

  • Increase --cpus-per-task (recommend 8-16 CPUs per GPU)

I/O bottleneck:

  • Copy data to local node storage first (see Data Loading section)
  • Increase DataLoader num_workers
  • Enable data prefetching

Best Practices

1. Choose the Right Partition

  • Use gpu_short only for quick tests (<4 hours)
  • Use gpu_normal for standard training runs
  • Use gpu_long for extended training (>24 hours)
  • Use batch for longer jobs and medium GPU performance
  • Use nih_s10 if your job needs large amounts of CPUs and/or memory
  • Use interactive to test if the GPU library is working at all in your system.

2. Request Only What You Need

  • Request only the GPUs you’ll actually use
  • Jobs wait in queue until resources are available
  • Requesting fewer GPUs = faster queue times

3. Verify GPU Access

Always include nvidia-smi in job scripts to verify GPU allocation:

nvidia-smi
echo "Allocated GPUs: $CUDA_VISIBLE_DEVICES"

4. Balance CPUs with GPUs

  • Typical ratio: 8-16 CPUs per GPU
  • Single GPU: --cpus-per-task=8
  • Two GPUs: --cpus-per-task=16

5. Prefer Shorter Walltimes

  • Shorter jobs start sooner in the queue
  • Request only the time you need
  • Use checkpointing to resume if time runs out

6. Monitor Resource Usage

  • Always log GPU utilization
  • Review logs after jobs complete
  • Optimize based on actual usage patterns