(十三)32天GPU测试从入门到精通-ResNet50 训练测试day11
ResNet50 是深度学习领域最经典的卷积神经网络之一,自 2015 年提出以来,一直是图像分类任务的基准模型和GPU 性能测试的标准工作负载。在 GPU 服务器测试中,ResNet50 训练测试具有重要意义
·
目录
引言
ResNet50 是深度学习领域最经典的卷积神经网络之一,自 2015 年提出以来,一直是图像分类任务的基准模型和GPU 性能测试的标准工作负载。
在 GPU 服务器测试中,ResNet50 训练测试具有重要意义:
- 为什么选择 ResNet50 作为基准? 模型成熟、实现优化好、结果可对比
- 如何测量训练性能? images/sec 是核心指标
- 混合精度能提升多少? 通常 2-3 倍速度提升
- 多卡扩展效率如何评估? 线性扩展比是关键
这些问题都指向一个核心主题:ResNet50 训练测试。
ResNet50 作为 Benchmark 的优势
┌─────────────────────────────────────────────────┐
│ ResNet50 作为测试基准的优势 │
├─────────────────────────────────────────────────┤
│ │
│ 模型特性: │
│ ├── 结构稳定:50 层,2560 万参数 │
│ ├── 计算适中:41 亿 FLOPs (224x224 输入) │
│ ├── 显存占用:单卡约 2-4GB (batch=256) │
│ └── 代表性强:典型 CNN 架构 │
│ │
│ 生态成熟: │
│ ├── 框架支持:PyTorch/TF/MXNet 原生支持 │
│ ├── 优化充分:cuDNN/TensorRT 深度优化 │
│ ├── 数据易得:ImageNet 标准数据集 │
│ └── 结果可比:业界标准参考值 │
│ │
│ 测试价值: │
│ ├── 计算性能:卷积、BN、激活函数 │
│ ├── 显存带宽:大量特征图传输 │
│ ├── 多卡通信:梯度同步、数据并行 │
│ └── 端到端:完整训练流程 │
│ │
└─────────────────────────────────────────────────┘
性能参考值
┌────────────────────────────────────────────────────────────────────┐
│ ResNet50 训练性能参考 (ImageNet, batch=256) │
├──────────────┬─────────────┬─────────────┬─────────────┬──────────┤
│ GPU 配置 │ FP32 │ AMP │ 扩展效率 │ 备注 │
│ │ (img/s) │ (img/s) │ (%) │ │
├──────────────┼─────────────┼─────────────┼─────────────┼──────────┤
│ 1x H100 │ 550-600 │ 900-1000 │ - │ │
│ 1x A100 SXM │ 280-320 │ 500-550 │ - │ │
│ 1x A100 PCIe │ 260-300 │ 450-500 │ - │ │
│ 8x H100 │ 4000-4400 │ 6500-7200 │ 90-95 │ │
│ 8x A100 SXM │ 2000-2300 │ 3600-4000 │ 88-93 │ │
│ 8x A100 PCIe │ 1800-2100 │ 3200-3600 │ 85-90 │ │
│ 8x V100 SXM │ 800-950 │ 1400-1600 │ 85-90 │ │
└──────────────┴─────────────┴─────────────┴─────────────┴──────────┘
注:性能数据受 CPU、内存、存储、网络等因素影响,仅供参考
ResNet50 模型介绍
模型架构
┌─────────────────────────────────────────────────┐
│ ResNet50 架构 │
├─────────────────────────────────────────────────┤
│ │
│ 输入:224×224×3 RGB 图像 │
│ │
│ Conv1: 7×7, 64, stride=2 → 112×112×64 │
│ MaxPool: 3×3, stride=2 → 56×56×64 │
│ │
│ Conv2_x: 56×56×64 → 56×56×256 │
│ ├── Bottleneck ×3 │
│ └── 1×1,64 → 3×3,64 → 1×1,256 (+ shortcut) │
│ │
│ Conv3_x: 56×56×256 → 28×28×512 │
│ ├── Bottleneck ×4 │
│ └── stride=2 降维 │
│ │
│ Conv4_x: 28×28×512 → 14×14×1024 │
│ ├── Bottleneck ×6 │
│ └── stride=2 降维 │
│ │
│ Conv5_x: 14×14×1024 → 7×7×2048 │
│ ├── Bottleneck ×3 │
│ └── stride=2 降维 │
│ │
│ AvgPool: 7×7 → 1×1 │
│ FC: 2048 → 1000 (ImageNet 类别) │
│ │
│ 总参数量:25.6M │
│ 总 FLOPs:4.1G (224×224 输入) │
│ │
└─────────────────────────────────────────────────┘
PyTorch 实现
#!/usr/bin/env python3
# resnet50_model.py - ResNet50 模型定义
import torch
import torch.nn as nn
from torchvision.models import resnet50, ResNet50_Weights
def create_resnet50(pretrained=False, num_classes=1000):
"""创建 ResNet50 模型"""
if pretrained:
weights = ResNet50_Weights.IMAGENET1K_V2
model = resnet50(weights=weights)
else:
model = resnet50(weights=None)
# 修改分类器 (如需)
if num_classes != 1000:
model.fc = nn.Linear(model.fc.in_features, num_classes)
return model
def print_model_info(model):
"""打印模型信息"""
# 参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"总参数量:{total_params / 1e6:.2f}M")
print(f"可训练参数:{trainable_params / 1e6:.2f}M")
# 计算 FLOPs (需要安装 thop)
try:
from thop import profile
input_tensor = torch.randn(1, 3, 224, 224)
macs, params = profile(model, inputs=(input_tensor,))
print(f"MACs: {macs / 1e9:.2f}G")
print(f"FLOPs: {2 * macs / 1e9:.2f}G")
except ImportError:
print("安装 thop 以计算 FLOPs: pip install thop")
if __name__ == "__main__":
model = create_resnet50(pretrained=False)
print_model_info(model)
# 测试前向传播
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()
input_tensor = torch.randn(1, 3, 224, 224, device=device)
with torch.no_grad():
output = model(input_tensor)
print(f"输入形状:{input_tensor.shape}")
print(f"输出形状:{output.shape}")
训练环境配置
软件环境
#!/bin/bash
# setup_training_env.sh - 训练环境配置
echo "=========================================="
echo " ResNet50 训练环境配置"
echo "=========================================="
# 1. 创建虚拟环境
echo ""
echo "[1/5] 创建 Python 虚拟环境..."
python3 -m venv /opt/resnet-training-env
source /opt/resnet-training-env/bin/activate
# 2. 安装 PyTorch
echo ""
echo "[2/5] 安装 PyTorch..."
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu123
# 3. 安装训练依赖
echo ""
echo "[3/5] 安装训练依赖..."
pip install tensorboard wandb tqdm pandas matplotlib
# 4. 安装性能分析工具
echo ""
echo "[4/5] 安装性能分析工具..."
pip install torch_tb_profiler nvidia-ml-py psutil
# 5. 验证安装
echo ""
echo "[5/5] 验证安装..."
python3 -c "
import torch
import torchvision
print(f'PyTorch: {torch.__version__}')
print(f'Torchvision: {torchvision.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'cuDNN: {torch.backends.cudnn.version()}')
print(f'GPU 可用:{torch.cuda.is_available()}')
print(f'GPU 数量:{torch.cuda.device_count()}')
if torch.cuda.is_available():
print(f'GPU 型号:{torch.cuda.get_device_name(0)}')
"
echo ""
echo "=========================================="
echo " 环境配置完成"
echo "=========================================="
echo ""
echo "激活环境:source /opt/resnet-training-env/bin/activate"
数据集准备
#!/bin/bash
# prepare_imagenet.sh - ImageNet 数据集准备
echo "=========================================="
echo " ImageNet 数据集准备"
echo "=========================================="
DATA_DIR=${DATA_DIR:-/data/imagenet}
echo ""
echo "数据集目录:$DATA_DIR"
echo ""
# 1. 创建目录结构
echo "[1/4] 创建目录结构..."
mkdir -p $DATA_DIR/train
mkdir -p $DATA_DIR/val
# 2. 下载数据集 (需要 ImageNet 账号)
echo ""
echo "[2/4] 下载数据集..."
echo "请从 https://image-net.org 下载以下文件:"
echo " - ILSVRC2012_img_train.tar"
echo " - ILSVRC2012_img_val.tar"
echo ""
echo "解压命令:"
echo " tar xvf ILSVRC2012_img_train.tar -C $DATA_DIR/train"
echo " tar xvf ILSVRC2012_img_val.tar -C $DATA_DIR/val"
# 3. 准备验证集标签
echo ""
echo "[3/4] 准备验证集标签..."
cat << 'EOF' > $DATA_DIR/prepare_val.py
import os
import shutil
from pathlib import Path
val_dir = Path('/data/imagenet/val')
val_img_dir = val_dir / 'images'
val_img_dir.mkdir(exist_ok=True)
# 读取 synset 映射
synset_map = {}
with open('ILSVRC2012_validation_ground_truth.txt') as f:
for i, line in enumerate(f, 1):
synset_id = line.strip()
if synset_id not in synset_map:
synset_map[synset_id] = len(synset_map) + 1
class_dir = val_img_dir / f'{synset_id:08d}'
class_dir.mkdir(exist_ok=True)
img_name = f'ILSVRC2012_val_{i:08d}.JPEG'
shutil.move(val_dir / img_name, class_dir / img_name)
print(f"验证集准备完成:{len(synset_map)} 类")
EOF
# 4. 使用 NVIDIA DALI 加速数据加载 (可选)
echo ""
echo "[4/4] 安装 NVIDIA DALI (可选,加速数据加载)..."
pip install nvidia-dali-cuda120
echo ""
echo "=========================================="
echo " 数据集准备完成"
echo "=========================================="
训练配置
#!/usr/bin/env python3
# training_config.py - 训练配置
from dataclasses import dataclass
from typing import Optional
@dataclass
class TrainingConfig:
"""训练配置"""
# 数据配置
data_dir: str = '/data/imagenet'
image_size: int = 224
workers: int = 8 # DataLoader 工作进程数
# 模型配置
model: str = 'resnet50'
pretrained: bool = False
num_classes: int = 1000
# 训练配置
batch_size: int = 256 # 每卡 batch size
epochs: int = 90
lr: float = 0.1
momentum: float = 0.9
weight_decay: float = 1e-4
lr_scheduler: str = 'cosine' # step/cosine
# 混合精度
amp: bool = True
amp_dtype: str = 'float16' # float16/bfloat16
# 分布式配置
distributed: bool = True
world_size: int = 8
sync_bn: bool = True # 同步 BatchNorm
# 日志配置
log_dir: str = 'logs'
tensorboard: bool = True
wandb: bool = False
print_freq: int = 10
# Checkpoint
checkpoint_dir: str = 'checkpoints'
save_freq: int = 10
resume: Optional[str] = None
# 默认配置
default_config = TrainingConfig()
def print_config(config: TrainingConfig):
"""打印配置"""
print("="*60)
print("训练配置")
print("="*60)
for key, value in config.__dict__.items():
print(f"{key}: {value}")
print("="*60)
if __name__ == "__main__":
print_config(default_config)
单机多卡训练性能
训练脚本
#!/usr/bin/env python3
# resnet50_training.py - ResNet50 分布式训练脚本
import os
import time
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, DistributedSampler
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
from torchvision.models import resnet50, ResNet50_Weights
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
def setup_distributed(rank, world_size):
"""初始化分布式环境"""
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group('nccl', rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
return dist.get_rank(), dist.get_world_size()
def cleanup_distributed():
"""清理分布式环境"""
dist.destroy_process_group()
def create_data_loader(rank, world_size, config):
"""创建数据加载器"""
# 数据增强
train_transform = transforms.Compose([
transforms.RandomResizedCrop(config.image_size),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
# 数据集
train_dataset = datasets.ImageFolder(
os.path.join(config.data_dir, 'train'),
transform=train_transform
)
# 分布式 Sampler
sampler = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
# DataLoader
train_loader = DataLoader(
train_dataset,
batch_size=config.batch_size,
shuffle=False,
num_workers=config.workers,
pin_memory=True,
sampler=sampler,
persistent_workers=True if config.workers > 0 else False
)
return train_loader
def train_epoch(model, train_loader, criterion, optimizer, scaler, config, rank, epoch):
"""训练一个 epoch"""
model.train()
sampler = train_loader.sampler
sampler.set_epoch(epoch)
total_samples = 0
correct_samples = 0
total_loss = 0.0
start_time = time.time()
progress_bar = tqdm(train_loader, disable=(rank != 0))
for batch_idx, (images, targets) in enumerate(progress_bar):
images = images.cuda(non_blocking=True)
targets = targets.cuda(non_blocking=True)
optimizer.zero_grad()
# 混合精度训练
if config.amp:
with autocast(dtype=torch.float16 if config.amp_dtype == 'float16' else torch.bfloat16):
outputs = model(images)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
outputs = model(images)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
# 统计
total_loss += loss.item()
_, predicted = outputs.max(1)
total_samples += targets.size(0)
correct_samples += predicted.eq(targets).sum().item()
# 进度显示
if rank == 0 and batch_idx % config.print_freq == 0:
progress = batch_idx / len(train_loader)
elapsed = time.time() - start_time
images_per_sec = (batch_idx * config.batch_size * world_size) / elapsed
progress_bar.set_description(
f'Epoch {epoch} [{progress:.1%}] '
f'Loss: {total_loss/(batch_idx+1):.4f} '
f'Acc: {100.*correct_samples/total_samples:.2f}% '
f'({images_per_sec:.0f} img/s)'
)
# 计算 epoch 统计
epoch_loss = total_loss / len(train_loader)
epoch_acc = 100. * correct_samples / total_samples
epoch_time = time.time() - start_time
images_per_sec = (total_samples * world_size) / epoch_time
return epoch_loss, epoch_acc, images_per_sec
def main_worker(rank, world_size, config):
"""单卡训练主函数"""
# 初始化分布式
setup_distributed(rank, world_size)
if rank == 0:
print(f"使用 {world_size} 张 GPU 训练")
print(f"每卡 batch size: {config.batch_size}")
print(f"总 batch size: {config.batch_size * world_size}")
# 创建模型
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2 if config.pretrained else None)
# 同步 BatchNorm
if config.sync_bn:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = model.cuda(rank)
model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
# 损失函数
criterion = nn.CrossEntropyLoss().cuda(rank)
# 优化器
optimizer = optim.SGD(
model.parameters(),
lr=config.lr,
momentum=config.momentum,
weight_decay=config.weight_decay
)
# 学习率调度器
if config.lr_scheduler == 'cosine':
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=config.epochs)
else:
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# 混合精度 scaler
scaler = GradScaler() if config.amp else None
# 数据加载器
train_loader = create_data_loader(rank, world_size, config)
# 训练循环
best_acc = 0.0
for epoch in range(config.epochs):
if rank == 0:
print(f"\nEpoch {epoch+1}/{config.epochs}")
# 训练
train_loss, train_acc, images_per_sec = train_epoch(
model, train_loader, criterion, optimizer, scaler, config, rank, epoch
)
# 更新学习率
scheduler.step()
if rank == 0:
print(f"Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%, Speed: {images_per_sec:.0f} img/s")
# 保存 checkpoint
if (epoch + 1) % config.save_freq == 0 or train_acc > best_acc:
checkpoint = {
'epoch': epoch,
'model_state_dict': model.module.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'acc': train_acc,
}
torch.save(checkpoint, os.path.join(config.checkpoint_dir, f'checkpoint_epoch{epoch+1}.pt'))
if train_acc > best_acc:
best_acc = train_acc
torch.save(checkpoint, os.path.join(config.checkpoint_dir, 'checkpoint_best.pt'))
cleanup_distributed()
def main():
parser = argparse.ArgumentParser(description='ResNet50 训练')
parser.add_argument('--data-dir', type=str, default='/data/imagenet')
parser.add_argument('--batch-size', type=int, default=256)
parser.add_argument('--epochs', type=int, default=90)
parser.add_argument('--workers', type=int, default=8)
parser.add_argument('--amp', action='store_true', default=True)
parser.add_argument('--pretrained', action='store_true', default=False)
parser.add_argument('--sync-bn', action='store_true', default=True)
config = parser.parse_args()
world_size = torch.cuda.device_count()
print(f"检测到 {world_size} 张 GPU")
# 启动分布式训练
mp.spawn(main_worker, args=(world_size, config), nprocs=world_size, join=True)
if __name__ == "__main__":
main()
运行训练
#!/bin/bash
# run_resnet50_training.sh - 运行 ResNet50 训练
echo "=========================================="
echo " ResNet50 分布式训练"
echo "=========================================="
# 配置
NUM_GPUS=${NUM_GPUS:-8}
BATCH_SIZE=${BATCH_SIZE:-256}
AMP=${AMP:-true}
echo ""
echo "训练配置:"
echo " GPU 数量:$NUM_GPUS"
echo " 每卡 batch size: $BATCH_SIZE"
echo " 混合精度:$AMP"
echo ""
# 使用 torchrun 启动
torchrun \
--nproc_per_node=$NUM_GPUS \
--nnodes=1 \
--node_rank=0 \
--master_addr=localhost \
--master_port=29500 \
resnet50_training.py \
--data-dir /data/imagenet \
--batch-size $BATCH_SIZE \
--epochs 90 \
--workers 8 \
--amp \
--pretrained \
--sync-bn
echo ""
echo "=========================================="
echo " 训练完成"
echo "=========================================="
混合精度训练
混合精度原理
┌─────────────────────────────────────────────────┐
│ 混合精度训练原理 │
├─────────────────────────────────────────────────┤
│ │
│ 精度类型: │
│ ├── FP32 (单精度): 32 位,高精度,慢速 │
│ ├── FP16 (半精度): 16 位,低精度,快速 │
│ └── BF16 (Brain Float): 16 位,动态范围大 │
│ │
│ 混合精度策略: │
│ ├── 前向传播:FP16 计算 │
│ ├── 损失计算:FP32 (避免精度损失) │
│ ├── 反向传播:FP16 梯度计算,FP32 梯度累加 │
│ └── 权重更新:FP32 (主权重副本) │
│ │
│ 优势: │
│ ├── 速度提升:2-3 倍 (Tensor Core) │
│ ├── 显存节省:约 50% │
│ ├── 支持更大 batch size │
│ └── 精度相当:与 FP32 训练相当 │
│ │
│ 注意事项: │
│ ├── 损失缩放 (Loss Scaling) 防止梯度下溢 │
│ ├── 主权重副本 (Master Weights) FP32 存储 │
│ └── 梯度检查:检测 Inf/NaN │
│ │
└─────────────────────────────────────────────────┘
混合精度性能对比
#!/usr/bin/env python3
# amp_benchmark.py - 混合精度性能对比
import torch
import torch.nn as nn
import time
from torchvision.models import resnet50
def benchmark_training(model, device, batch_size, iterations, use_amp=False):
"""基准训练测试"""
model.train()
model = model.to(device)
# 创建假数据
images = torch.randn(batch_size, 3, 224, 224, device=device)
targets = torch.randint(0, 1000, (batch_size,), device=device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# 混合精度 scaler
scaler = torch.cuda.amp.GradScaler() if use_amp else None
# 预热
for _ in range(10):
optimizer.zero_grad()
if use_amp:
with torch.cuda.amp.autocast():
outputs = model(images)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
outputs = model(images)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
# 正式测试
start_time = time.time()
for _ in range(iterations):
optimizer.zero_grad()
if use_amp:
with torch.cuda.amp.autocast():
outputs = model(images)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
outputs = model(images)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
elapsed = time.time() - start_time
images_per_sec = (batch_size * iterations) / elapsed
return elapsed, images_per_sec
def main():
device = torch.device('cuda')
print(f"GPU: {torch.cuda.get_device_name(device)}")
print()
batch_sizes = [64, 128, 256, 512]
iterations = 100
print("="*70)
print("混合精度性能对比 (ResNet50)")
print("="*70)
print()
for batch_size in batch_sizes:
print(f"Batch Size: {batch_size}")
print("-"*70)
# FP32
model_fp32 = resnet50()
time_fp32, ips_fp32 = benchmark_training(
model_fp32, device, batch_size, iterations, use_amp=False
)
print(f" FP32: {time_fp32:.2f}s, {ips_fp32:.0f} img/s")
# AMP (FP16)
model_amp = resnet50()
time_amp, ips_amp = benchmark_training(
model_amp, device, batch_size, iterations, use_amp=True
)
print(f" AMP: {time_amp:.2f}s, {ips_amp:.0f} img/s")
# 提升
speedup = ips_amp / ips_fp32
print(f" 提升:{speedup:.2f}x ({(speedup-1)*100:.0f}%)")
print()
print("="*70)
if __name__ == "__main__":
main()
运行混合精度对比
# 运行混合精度性能对比
python3 amp_benchmark.py
# 示例输出:
# GPU: NVIDIA A100-SXM4-80GB
#
# ======================================================================
# 混合精度性能对比 (ResNet50)
# ======================================================================
#
# Batch Size: 64
# ----------------------------------------------------------------------
# FP32: 12.34s, 519 img/s
# AMP: 6.78s, 944 img/s
# 提升:1.82x (82%)
#
# Batch Size: 128
# ----------------------------------------------------------------------
# FP32: 6.45s, 992 img/s
# AMP: 3.52s, 1818 img/s
# 提升:1.83x (83%)
#
# Batch Size: 256
# ----------------------------------------------------------------------
# FP32: 3.34s, 1916 img/s
# AMP: 1.82s, 3516 img/s
# 提升:1.84x (84%)
#
# Batch Size: 512
# ----------------------------------------------------------------------
# FP32: 1.72s, 3721 img/s
# AMP: 0.95s, 6737 img/s
# 提升:1.81x (81%)
#
# ======================================================================
性能瓶颈分析
使用 PyTorch Profiler
#!/usr/bin/env python3
# training_profiler.py - 训练性能分析
import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
from torchvision.models import resnet50
def profile_training():
"""Profiling 训练过程"""
device = torch.device('cuda')
model = resnet50().to(device)
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()
# 假数据
images = torch.randn(256, 3, 224, 224, device=device)
targets = torch.randint(0, 1000, (256,), device=device)
# Profiling
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet50'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step in range(10):
with record_function("model_forward"):
outputs = model(images)
loss = criterion(outputs, targets)
with record_function("model_backward"):
loss.backward()
optimizer.step()
optimizer.zero_grad()
prof.step()
# 打印结果
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# 导出结果
prof.export_chrome_trace("trace.json")
print("Trace 已导出:trace.json")
print("使用 Chrome 浏览器查看:chrome://tracing")
if __name__ == "__main__":
profile_training()
瓶颈诊断
┌─────────────────────────────────────────────────┐
│ 训练性能瓶颈诊断 │
├─────────────────────────────────────────────────┤
│ │
│ GPU 利用率低 (<80%): │
│ ├── 可能原因:CPU/数据加载瓶颈 │
│ ├── 诊断:top/htop 查看 CPU 使用率 │
│ ├── 解决:增加 workers,使用 DALI │
│ └── 优化:prefetch_factor,pin_memory │
│ │
│ 显存带宽瓶颈: │
│ ├── 可能原因:频繁的显存访问 │
│ ├── 诊断:nsys profile 查看内存事务 │
│ ├── 解决:减少不必要的数据拷贝 │
│ └── 优化:融合操作,使用 inplace 操作 │
│ │
│ 通信瓶颈 (多卡): │
│ ├── 可能原因:梯度同步慢 │
│ ├── 诊断:NCCL_DEBUG 查看通信时间 │
│ ├── 解决:梯度累积,减少同步频率 │
│ └── 优化:梯度压缩,Overlap 通信计算 │
│ │
│ 计算瓶颈: │
│ ├── 可能原因:模型计算密集 │
│ ├── 诊断:nsys 查看 kernel 执行时间 │
│ ├── 解决:混合精度训练 │
│ └── 优化:算子融合,使用优化库 │
│ │
└─────────────────────────────────────────────────┘
实战:完整训练 Benchmark
一键测试脚本
#!/bin/bash
# resnet50_benchmark.sh - ResNet50 完整训练基准测试
set -e
echo "=========================================="
echo " ResNet50 训练基准测试"
echo "=========================================="
RESULTS_DIR="results/resnet50_$(date +%Y%m%d_%H%M%S)"
mkdir -p $RESULTS_DIR
NUM_GPUS=$(nvidia-smi -q | grep "Product Name" | wc -l)
BATCH_SIZE=256
echo ""
echo "测试配置:"
echo " GPU 数量:$NUM_GPUS"
echo " 每卡 batch size: $BATCH_SIZE"
echo " 总 batch size: $((BATCH_SIZE * NUM_GPUS))"
echo " 结果目录:$RESULTS_DIR"
echo ""
# 1. 环境检查
echo "=========================================="
echo "[1/5] 环境检查"
echo "=========================================="
python3 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPU 数量:{torch.cuda.device_count()}')
print(f'GPU 型号:{torch.cuda.get_device_name(0)}')
" | tee $RESULTS_DIR/env_info.txt
# 2. 模型信息
echo ""
echo "=========================================="
echo "[2/5] 模型信息"
echo "=========================================="
python3 -c "
from torchvision.models import resnet50
model = resnet50()
params = sum(p.numel() for p in model.parameters())
print(f'参数量:{params/1e6:.2f}M')
" | tee $RESULTS_DIR/model_info.txt
# 3. FP32 基准测试
echo ""
echo "=========================================="
echo "[3/5] FP32 基准测试"
echo "=========================================="
python3 amp_benchmark.py 2>&1 | tee $RESULTS_DIR/fp32_benchmark.log
# 4. 混合精度基准测试
echo ""
echo "=========================================="
echo "[4/5] 混合精度基准测试"
echo "=========================================="
# (已在 amp_benchmark.py 中包含)
# 5. 生成报告
echo ""
echo "=========================================="
echo "[5/5] 生成测试报告"
echo "=========================================="
cat << EOF > $RESULTS_DIR/benchmark_report.md
# ResNet50 训练基准测试报告
**测试日期:** $(date)
**GPU 配置:** $NUM_GPUS x $(nvidia-smi --query-gpu=name --format=csv,noheader)
## 环境信息
\`\`\`
$(cat $RESULTS_DIR/env_info.txt)
\`\`\`
## 模型信息
\`\`\`
$(cat $RESULTS_DIR/model_info.txt)
\`\`\`
## 性能结果
详见:fp32_benchmark.log
## 建议
根据测试结果调整训练配置
EOF
echo "报告已生成:$RESULTS_DIR/benchmark_report.md"
echo ""
echo "=========================================="
echo " 基准测试完成"
echo "=========================================="
echo ""
echo "结果目录:$RESULTS_DIR"
ls -la $RESULTS_DIR
常见问题排查
训练速度慢
# 问题:训练速度远低于预期
# 1. 检查 GPU 利用率
nvidia-smi dmon -s pucvmet -d 1 -c 10
# 2. 检查 CPU 使用率
top -bn1 | grep "Cpu(s)"
# 3. 检查数据加载
# 增加 workers 数量
# 使用 --workers 16
# 4. 检查存储 IO
iostat -x 1 5
# 5. 启用混合精度
# 添加 --amp 参数
# 6. 检查 NCCL
export NCCL_DEBUG=INFO
显存不足 (OOM)
# 问题:CUDA out of memory
# 1. 减小 batch size
# --batch-size 128
# 2. 使用梯度累积
# 模拟大 batch size
# 3. 启用混合精度
# --amp (节省约 50% 显存)
# 4. 检查显存泄漏
# nvidia-smi 查看显存使用
# 5. 清理缓存
torch.cuda.empty_cache()
多卡训练不加速
# 问题:多卡扩展效率低
# 1. 检查 NCCL 配置
export NCCL_DEBUG=INFO
# 2. 检查 GPU 拓扑
nvidia-smi topo -m
# 3. 检查网络
ibstat
# 4. 启用同步 BN
# --sync-bn
# 5. 调整梯度同步频率
# 梯度累积
总结
今天学到的内容
- ✅ ResNet50 模型介绍:架构、参数量、FLOPs
- ✅ 训练环境配置:软件、数据集、配置
- ✅ 单机多卡训练:分布式训练脚本、torchrun
- ✅ 混合精度训练:原理、性能对比、2-3 倍提升
- ✅ 性能瓶颈分析:PyTorch Profiler、诊断方法
- ✅ 实战演练:完整基准测试脚本
下一步
明天我们将学习 Day 12 - ResNet50 推理测试,了解:
- 推理引擎对比(TensorRT、ONNX Runtime)
- 批处理与延迟优化
- 量化测试(INT8/FP16)
- 吞吐量 benchmark
更多推荐
所有评论(0)