GPU Cluster Usage Guide¶
This guide explains how to run GPU-accelerated jobs on our HPC cluster with NVIDIA H200 GPUs. It covers partition policies, GPU requests, example job scripts, and best practices.
Available Hardware¶
Dell PowerEdge XE9680
- CPU: 64 cores
- Memory: 2.2 TB RAM
- GPUs: 8 × NVIDIA H200 (141 GB HBM3 each)
- Partitions:
gpu_short,gpu_normal,gpu_long
Additional GPU Resources:
- Login nodes: Small GPU for compiling and testing applications
- UV300: 4 × NVIDIA Tesla P100 GPUs (partition:
nih_s10) - Batch partition node: 2 × NVIDIA Tesla A100 GPUs (partition:
batch)
GPU Hardware Comparison¶
| GPU Model | Memory | Location | Max GPUs/Job |
|---|---|---|---|
| H200 | 141 GB | XE9680 | 2 |
| A100 | 40/80 GB | Batch node | 2 |
| P100 | 16 GB | UV300 | 4 |
| Login GPU | Small | Login nodes | Testing only |
GPU Partitions¶
| Partition | Max Walltime | Max GPUs | Purpose |
|---|---|---|---|
gpu_short |
4 hours | 1 | Quick tests / debugging |
gpu_normal |
24 hours | 2 | Standard research runs |
gpu_long |
72 hours | 2 | Long model training or pipelines |
Note: Access to GPU partitions requires access to the batch partition.
Requesting GPUs¶
Required SBATCH Directives¶
Every GPU job must include:
#SBATCH --account=<account>
#SBATCH --partition=<partition_name>
#SBATCH --gres=gpu:<number>
#SBATCH --time=<walltime>
Partition Selection¶
For H200 GPUs (Dell XE9680):
- gpu_short - Quick tests (4 hours, 1 GPU max)
- gpu_normal - Standard runs (24 hours, 2 GPUs max)
- gpu_long - Long training (72 hours, 2 GPUs max)
For P100 GPUs (UV300):
- nih_s10 - Up to 4 GPUs
For A100 GPUs:
- batch - Up to 2 GPUs
Example 1: Single GPU (Debugging)¶
#!/bin/bash
#SBATCH --job-name=gpu_debug
#SBATCH --account=<account>
#SBATCH --partition=gpu_short
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=04:00:00
#SBATCH --output=gpu_debug_%j.out
module load cuda/13.0
nvidia-smi
python test_gpu_script.py
Example 2: Multi-GPU Training (Single Node)¶
#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
module load cuda/13.0
# Environment variables for multi-GPU
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN
# Distributed training
srun python -m torch.distributed.run \
--nproc_per_node=2 \
train.py
Interactive GPU Sessions¶
Request an interactive session for development and debugging:
srun --account=<account> --partition=gpu_short --gres=gpu:1 --time=02:00:00 --pty bash
Inside the session:
module load cuda/13.0
nvidia-smi
python my_script.py
Using Containers¶
Containers must be executed with GPU support using the --nv flag:
# Pull a container from NVIDIA GPU Cloud
singularity pull docker://nvcr.io/nvidia/pytorch:24.12-py3
# Run with GPU support
singularity exec --nv pytorch_24.12-py3.sif python my_script.py
Monitoring GPU Usage¶
During Job Execution¶
Find your job’s compute node and SSH to it:
# Find your job's node
squeue -u $USER -o "%i %N"
# SSH to that node
ssh <node_name>
# Real-time monitoring (updates every second)
watch -n 1 nvidia-smi
Automated Logging in Job Scripts¶
Add GPU monitoring to your job script:
# Start background monitoring (add BEFORE your main command)
nvidia-smi --query-gpu=timestamp,index,name,utilization.gpu,utilization.memory,memory.used,memory.total \
--format=csv -l 10 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!
# Your main work
python train.py
# Stop monitoring
kill $GPU_MON_PID 2>/dev/null || true
Data Loading Best Practices¶
Use Node-Local Storage for I/O-Intensive Jobs¶
On SCG, per-job scratch is:
/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID
Copy datasets to this location for faster access:
#!/bin/bash
#SBATCH --job-name=multi_gpu_train
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
mkdir -p "$LOCAL_DATA"
# Copy data to local SSD/NVMe
echo "Copying data to local scratch..."
rsync -a /labs/<lab>/my_dataset/ "$LOCAL_DATA/my_dataset/"
echo "Data copy complete"
# Run training pointing to local data
python train.py --data_dir $LOCAL_DATA/my_dataset
# Cleanup
rm -rf $LOCAL_DATA
Recommended PyTorch DataLoader for Training¶
from torch.utils.data import DataLoader
train_loader = DataLoader(
dataset,
batch_size=32,
num_workers=8, # Usually <= --cpus-per-task
pin_memory=True, # Faster CPU→GPU transfer
persistent_workers=True, # Keep workers alive between epochs
prefetch_factor=2 # Prefetch batches
)
Complete Training Pipeline Example¶
Production-ready job script with all best practices:
#!/bin/bash
#SBATCH --job-name=train_resnet
#SBATCH --account=<account>
#SBATCH --partition=gpu_normal
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=1-00:00:00
#SBATCH --output=train_%j.out
# Load environment
module load cuda/13.0
source ~/myenv/bin/activate
# Environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=WARN
# Create directories
mkdir -p checkpoints
# Copy data to local storage for faster I/O
LOCAL_DATA=/local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID/data
echo "Copying dataset to $LOCAL_DATA..."
mkdir -p $LOCAL_DATA
rsync -a /labs/<lab>/datasets/imagenet/ $LOCAL_DATA/imagenet
echo "Data copy complete at $(date)"
# Start GPU monitoring
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used \
--format=csv -l 30 -f gpu_${SLURM_JOB_ID}.csv &
GPU_MON_PID=$!
# Run distributed training
echo "Starting training at $(date)"
srun python -m torch.distributed.run \
--nproc_per_node=2 \
train.py \
--data_dir $LOCAL_DATA/imagenet \
--checkpoint_dir checkpoints \
--epochs 90 \
--batch_size 128
echo "Training completed at $(date)"
# Cleanup
kill $GPU_MON_PID 2>/dev/null || true
rm -rf /local/scratch/$SLURM_JOB_USER/slrmtmp.$SLURM_JOBID
Common Issues & Troubleshooting¶
Job stuck in queue?¶
Check partition status and current usage:
sinfo -p gpu_normal
squeue -p gpu_normal
Out of Memory (OOM) errors?¶
- Reduce batch size in your training script
- Check memory usage:
nvidia-smishows 141GB available per H200
GPU not detected in job?¶
Verify allocation and module loading:
# Check GPU visibility
echo $CUDA_VISIBLE_DEVICES
nvidia-smi
# Verify CUDA module loaded
module list
Poor GPU utilization (<50%)?¶
CPU bottleneck:
- Increase --cpus-per-task (recommend 8-16 CPUs per GPU)
I/O bottleneck:
- Copy data to local node storage first (see Data Loading section)
- Increase DataLoader num_workers
- Enable data prefetching
Best Practices¶
1. Choose the Right Partition¶
- Use
gpu_shortonly for quick tests (<4 hours) - Use
gpu_normalfor standard training runs - Use
gpu_longfor extended training (>24 hours)
2. Request Only What You Need¶
- Request only the GPUs you’ll actually use
- Jobs wait in queue until resources are available
- Requesting fewer GPUs = faster queue times
3. Verify GPU Access¶
Always include nvidia-smi in job scripts to verify GPU allocation:
nvidia-smi
echo "Allocated GPUs: $CUDA_VISIBLE_DEVICES"
4. Balance CPUs with GPUs¶
- Typical ratio: 8-16 CPUs per GPU
- Single GPU:
--cpus-per-task=8 - Two GPUs:
--cpus-per-task=16
5. Prefer Shorter Walltimes¶
- Shorter jobs start sooner in the queue
- Request only the time you need
- Use checkpointing to resume if time runs out
6. Monitor Resource Usage¶
- Always log GPU utilization
- Review logs after jobs complete
- Optimize based on actual usage patterns