GPU Cluster Usage Guide

This guide explains how to run GPU-accelerated jobs on our HPC cluster with NVIDIA H200 GPUs. It covers partition policies, GPU requests, example job scripts, and best practices.


Available Hardware

Dell PowerEdge XE9680

  • CPU: 64 cores
  • Memory: 2.2 TB RAM
  • GPUs: 8 × NVIDIA H200 (141 GB HBM3 each)
  • Partitions: gpu_short, gpu_normal, gpu_long

Additional GPU Resources:

  • Login nodes: Small GPU for compiling and testing applications
  • UV300: 4 × NVIDIA Tesla P100 GPUs (partition: nih_s10)
  • Batch partition node: 2 × NVIDIA Tesla A100 GPUs (partition: batch)

GPU Hardware Comparison

GPU Model Memory Location Max GPUs/Job
H200 141 GB XE9680 2
A100 40/80 GB Batch node 2
P100 16 GB UV300 4
Login GPU Small Login nodes Testing only

GPU Partitions

Partition Max Walltime Max GPUs Purpose
gpu_short 4 hours 1 Quick tests / debugging
gpu_normal 24 hours 2 Standard research runs
gpu_long 72 hours 2 Long model training or pipelines

Note: Access to GPU partitions requires access to the batch partition.


Requesting GPUs

Required SBATCH Directives

Every GPU job must include:

#SBATCH --account=<account>
#SBATCH --partition=<partition_name>
#SBATCH --gres=gpu:<number>
#SBATCH --time=<walltime>

Partition Selection

For H200 GPUs (Dell XE9680): - gpu_short - Quick tests (4 hours, 1 GPU max) - gpu_normal - Standard runs (24 hours, 2 GPUs max) - gpu_long - Long training (72 hours, 2 GPUs max)

For P100 GPUs (UV300): - nih_s10 - Up to 4 GPUs

For A100 GPUs: - batch - Up to 2 GPUs


Example 1: Single GPU (Debugging)

#!/bin/bash
#SBATCH --job-name=gpu_debug
#SBATCH --account=<account>
#SBATCH --partition=gpu_short
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=gpu_debug_%j.out

module load cuda/13.0
nvidia-smi
python test_gpu_script.py

Example 2: Multi-GPU Training (Single Node)

#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

module load cuda/13.0

# Environment variables for multi-GPU
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN

# Distributed training
srun python -m torch.distributed.run \
    --nproc_per_node=2 \
    train.py

Interactive GPU Sessions

Request an interactive session for development and debugging:

srun --account=<account> --partition=gpu_short --gres=gpu:1 --time=02:00:00 --pty bash

Inside the session:

module load cuda/13.0
nvidia-smi
python my_script.py

Using Containers

Containers must be executed with GPU support using the --nv flag:

# Pull a container from NVIDIA GPU Cloud
singularity pull docker://nvcr.io/nvidia/pytorch:24.12-py3

# Run with GPU support
singularity exec --nv pytorch_24.12-py3.sif python my_script.py

Monitoring GPU Usage

During Job Execution

Find your job’s compute node and SSH to it:

# Find your job's node
squeue -u $USER -o "%i %N"

# SSH to that node
ssh <node_name>

# Real-time monitoring (updates every second)
watch -n 1 nvidia-smi

Automated Logging in Job Scripts

Add GPU monitoring to your job script:

# Start background monitoring (add BEFORE your main command)
nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total \
    --format=csv -l 10 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!

# Your main work
python train.py

# Stop monitoring 
kill $GPU_MON_PID 2>/dev/null || true

Data Loading Best Practices

Use Node-Local Storage for I/O-Intensive Jobs

On SCG, per-job scratch is:

/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID

Copy datasets to this location for faster access:

#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
mkdir -p "$LOCAL_DATA"

# Copy data to local SSD/NVMe
echo "Copying data to local scratch..."
rsync -a /labs/<lab>/my_dataset/ "$LOCAL_DATA/my_dataset/"
echo "Data copy complete"

# Run training pointing to local data
python train.py --data_dir $LOCAL_DATA/my_dataset

# Cleanup
rm -rf $LOCAL_DATA
from torch.utils.data import DataLoader

train_loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=8,              # Usually <= --cpus-per-task
    pin_memory=True,            # Faster CPU→GPU transfer
    persistent_workers=True,    # Keep workers alive between epochs
    prefetch_factor=2           # Prefetch batches
)

Complete Training Pipeline Example

Production-ready job script with all best practices:

#!/bin/bash
#SBATCH --job-name=train_resnet
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out

# Load environment
module load cuda/13.0
source ~/myenv/bin/activate

# Environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN

# Create directories
mkdir -p checkpoints

# Copy data to local storage for faster I/O
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
echo "Copying dataset to $LOCAL_DATA..."
mkdir -p $LOCAL_DATA
rsync -a /labs/<lab>/datasets/imagenet/ $LOCAL_DATA/imagenet
echo "Data copy complete at $(date)"

# Start GPU monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used \
    --format=csv -l 30 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!

# Run distributed training
echo "Starting training at $(date)"
srun python -m torch.distributed.run \
    --nproc_per_node=2 \
    train.py \
    --data_dir $LOCAL_DATA/imagenet \
    --checkpoint_dir checkpoints \
    --epochs 90 \
    --batch_size 128

echo "Training completed at $(date)"

# Cleanup
kill $GPU_MON_PID 2>/dev/null || true
rm -rf /local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID

Common Issues & Troubleshooting

Job stuck in queue?

Check partition status and current usage:

sinfo -p gpu_normal
squeue -p gpu_normal

Out of Memory (OOM) errors?

  • Reduce batch size in your training script
  • Check memory usage: nvidia-smi shows 141GB available per H200

GPU not detected in job?

Verify allocation and module loading:

# Check GPU visibility
echo $CUDA_VISIBLE_DEVICES
nvidia-smi

# Verify CUDA module loaded
module list

Poor GPU utilization (<50%)?

CPU bottleneck: - Increase --cpus-per-task (recommend 8-16 CPUs per GPU)

I/O bottleneck: - Copy data to local node storage first (see Data Loading section) - Increase DataLoader num_workers - Enable data prefetching


Best Practices

1. Choose the Right Partition

  • Use gpu_short only for quick tests (<4 hours)
  • Use gpu_normal for standard training runs
  • Use gpu_long for extended training (>24 hours)

2. Request Only What You Need

  • Request only the GPUs you’ll actually use
  • Jobs wait in queue until resources are available
  • Requesting fewer GPUs = faster queue times

3. Verify GPU Access

Always include nvidia-smi in job scripts to verify GPU allocation:

nvidia-smi
echo "Allocated GPUs: $CUDA_VISIBLE_DEVICES"

4. Balance CPUs with GPUs

  • Typical ratio: 8-16 CPUs per GPU
  • Single GPU: --cpus-per-task=8
  • Two GPUs: --cpus-per-task=16

5. Prefer Shorter Walltimes

  • Shorter jobs start sooner in the queue
  • Request only the time you need
  • Use checkpointing to resume if time runs out

6. Monitor Resource Usage

  • Always log GPU utilization
  • Review logs after jobs complete
  • Optimize based on actual usage patterns