（十三）32天GPU测试从入门到精通-ResNet50 训练测试day11

ResNet50 是深度学习领域最经典的卷积神经网络之一，自 2015 年提出以来，一直是图像分类任务的基准模型和GPU 性能测试的标准工作负载。在 GPU 服务器测试中，ResNet50 训练测试具有重要意义

d1z888

234人浏览 · 2026-04-09 11:32:27

d1z888 · 2026-04-09 11:32:27 发布

引言

ResNet50 是深度学习领域最经典的卷积神经网络之一，自 2015 年提出以来，一直是图像分类任务的基准模型和GPU 性能测试的标准工作负载。

在 GPU 服务器测试中，ResNet50 训练测试具有重要意义：

为什么选择 ResNet50 作为基准？ 模型成熟、实现优化好、结果可对比
如何测量训练性能？ images/sec 是核心指标
混合精度能提升多少？ 通常 2-3 倍速度提升
多卡扩展效率如何评估？ 线性扩展比是关键

这些问题都指向一个核心主题：ResNet50 训练测试。

ResNet50 作为 Benchmark 的优势

┌─────────────────────────────────────────────────┐
│          ResNet50 作为测试基准的优势             │
├─────────────────────────────────────────────────┤
│                                                 │
│  模型特性:                                      │
│  ├── 结构稳定：50 层，2560 万参数               │
│  ├── 计算适中：41 亿 FLOPs (224x224 输入)       │
│  ├── 显存占用：单卡约 2-4GB (batch=256)        │
│  └── 代表性强：典型 CNN 架构                    │
│                                                 │
│  生态成熟:                                      │
│  ├── 框架支持：PyTorch/TF/MXNet 原生支持        │
│  ├── 优化充分：cuDNN/TensorRT 深度优化          │
│  ├── 数据易得：ImageNet 标准数据集              │
│  └── 结果可比：业界标准参考值                   │
│                                                 │
│  测试价值:                                      │
│  ├── 计算性能：卷积、BN、激活函数              │
│  ├── 显存带宽：大量特征图传输                  │
│  ├── 多卡通信：梯度同步、数据并行              │
│  └── 端到端：完整训练流程                      │
│                                                 │
└─────────────────────────────────────────────────┘

性能参考值

┌────────────────────────────────────────────────────────────────────┐
│                    ResNet50 训练性能参考 (ImageNet, batch=256)      │
├──────────────┬─────────────┬─────────────┬─────────────┬──────────┤
│   GPU 配置     │   FP32      │   AMP       │   扩展效率   │   备注    │
│              │  (img/s)    │  (img/s)    │   (%)       │          │
├──────────────┼─────────────┼─────────────┼─────────────┼──────────┤
│ 1x H100      │   550-600   │   900-1000  │    -        │          │
│ 1x A100 SXM  │   280-320   │   500-550   │    -        │          │
│ 1x A100 PCIe │   260-300   │   450-500   │    -        │          │
│ 8x H100      │  4000-4400  │  6500-7200  │   90-95     │          │
│ 8x A100 SXM  │  2000-2300  │  3600-4000  │   88-93     │          │
│ 8x A100 PCIe │  1800-2100  │  3200-3600  │   85-90     │          │
│ 8x V100 SXM  │   800-950   │  1400-1600  │   85-90     │          │
└──────────────┴─────────────┴─────────────┴─────────────┴──────────┘

注：性能数据受 CPU、内存、存储、网络等因素影响，仅供参考

ResNet50 模型介绍

模型架构

┌─────────────────────────────────────────────────┐
│          ResNet50 架构                           │
├─────────────────────────────────────────────────┤
│                                                 │
│  输入：224×224×3 RGB 图像                        │
│                                                 │
│  Conv1: 7×7, 64, stride=2 → 112×112×64         │
│  MaxPool: 3×3, stride=2 → 56×56×64             │
│                                                 │
│  Conv2_x: 56×56×64 → 56×56×256                 │
│  ├── Bottleneck ×3                             │
│  └── 1×1,64 → 3×3,64 → 1×1,256 (+ shortcut)    │
│                                                 │
│  Conv3_x: 56×56×256 → 28×28×512                │
│  ├── Bottleneck ×4                             │
│  └── stride=2 降维                              │
│                                                 │
│  Conv4_x: 28×28×512 → 14×14×1024               │
│  ├── Bottleneck ×6                             │
│  └── stride=2 降维                              │
│                                                 │
│  Conv5_x: 14×14×1024 → 7×7×2048                │
│  ├── Bottleneck ×3                             │
│  └── stride=2 降维                              │
│                                                 │
│  AvgPool: 7×7 → 1×1                            │
│  FC: 2048 → 1000 (ImageNet 类别)                │
│                                                 │
│  总参数量：25.6M                                │
│  总 FLOPs：4.1G (224×224 输入)                  │
│                                                 │
└─────────────────────────────────────────────────┘

PyTorch 实现

#!/usr/bin/env python3
# resnet50_model.py - ResNet50 模型定义

import torch
import torch.nn as nn
from torchvision.models import resnet50, ResNet50_Weights

def create_resnet50(pretrained=False, num_classes=1000):
    """创建 ResNet50 模型"""
    
    if pretrained:
        weights = ResNet50_Weights.IMAGENET1K_V2
        model = resnet50(weights=weights)
    else:
        model = resnet50(weights=None)
    
    # 修改分类器 (如需)
    if num_classes != 1000:
        model.fc = nn.Linear(model.fc.in_features, num_classes)
    
    return model

def print_model_info(model):
    """打印模型信息"""
    
    # 参数量
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    print(f"总参数量：{total_params / 1e6:.2f}M")
    print(f"可训练参数：{trainable_params / 1e6:.2f}M")
    
    # 计算 FLOPs (需要安装 thop)
    try:
        from thop import profile
        input_tensor = torch.randn(1, 3, 224, 224)
        macs, params = profile(model, inputs=(input_tensor,))
        print(f"MACs: {macs / 1e9:.2f}G")
        print(f"FLOPs: {2 * macs / 1e9:.2f}G")
    except ImportError:
        print("安装 thop 以计算 FLOPs: pip install thop")

if __name__ == "__main__":
    model = create_resnet50(pretrained=False)
    print_model_info(model)
    
    # 测试前向传播
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    model.eval()
    
    input_tensor = torch.randn(1, 3, 224, 224, device=device)
    with torch.no_grad():
        output = model(input_tensor)
    
    print(f"输入形状：{input_tensor.shape}")
    print(f"输出形状：{output.shape}")

训练环境配置

软件环境

#!/bin/bash
# setup_training_env.sh - 训练环境配置

echo "=========================================="
echo "  ResNet50 训练环境配置"
echo "=========================================="

# 1. 创建虚拟环境
echo ""
echo "[1/5] 创建 Python 虚拟环境..."
python3 -m venv /opt/resnet-training-env
source /opt/resnet-training-env/bin/activate

# 2. 安装 PyTorch
echo ""
echo "[2/5] 安装 PyTorch..."
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu123

# 3. 安装训练依赖
echo ""
echo "[3/5] 安装训练依赖..."
pip install tensorboard wandb tqdm pandas matplotlib

# 4. 安装性能分析工具
echo ""
echo "[4/5] 安装性能分析工具..."
pip install torch_tb_profiler nvidia-ml-py psutil

# 5. 验证安装
echo ""
echo "[5/5] 验证安装..."
python3 -c "
import torch
import torchvision
print(f'PyTorch: {torch.__version__}')
print(f'Torchvision: {torchvision.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'cuDNN: {torch.backends.cudnn.version()}')
print(f'GPU 可用：{torch.cuda.is_available()}')
print(f'GPU 数量：{torch.cuda.device_count()}')
if torch.cuda.is_available():
    print(f'GPU 型号：{torch.cuda.get_device_name(0)}')
"

echo ""
echo "=========================================="
echo "  环境配置完成"
echo "=========================================="
echo ""
echo "激活环境：source /opt/resnet-training-env/bin/activate"

数据集准备

#!/bin/bash
# prepare_imagenet.sh - ImageNet 数据集准备

echo "=========================================="
echo "  ImageNet 数据集准备"
echo "=========================================="

DATA_DIR=${DATA_DIR:-/data/imagenet}

echo ""
echo "数据集目录：$DATA_DIR"
echo ""

# 1. 创建目录结构
echo "[1/4] 创建目录结构..."
mkdir -p $DATA_DIR/train
mkdir -p $DATA_DIR/val

# 2. 下载数据集 (需要 ImageNet 账号)
echo ""
echo "[2/4] 下载数据集..."
echo "请从 https://image-net.org 下载以下文件:"
echo "  - ILSVRC2012_img_train.tar"
echo "  - ILSVRC2012_img_val.tar"
echo ""
echo "解压命令:"
echo "  tar xvf ILSVRC2012_img_train.tar -C $DATA_DIR/train"
echo "  tar xvf ILSVRC2012_img_val.tar -C $DATA_DIR/val"

# 3. 准备验证集标签
echo ""
echo "[3/4] 准备验证集标签..."
cat << 'EOF' > $DATA_DIR/prepare_val.py
import os
import shutil
from pathlib import Path

val_dir = Path('/data/imagenet/val')
val_img_dir = val_dir / 'images'
val_img_dir.mkdir(exist_ok=True)

# 读取 synset 映射
synset_map = {}
with open('ILSVRC2012_validation_ground_truth.txt') as f:
    for i, line in enumerate(f, 1):
        synset_id = line.strip()
        if synset_id not in synset_map:
            synset_map[synset_id] = len(synset_map) + 1
        class_dir = val_img_dir / f'{synset_id:08d}'
        class_dir.mkdir(exist_ok=True)
        img_name = f'ILSVRC2012_val_{i:08d}.JPEG'
        shutil.move(val_dir / img_name, class_dir / img_name)

print(f"验证集准备完成：{len(synset_map)} 类")
EOF

# 4. 使用 NVIDIA DALI 加速数据加载 (可选)
echo ""
echo "[4/4] 安装 NVIDIA DALI (可选，加速数据加载)..."
pip install nvidia-dali-cuda120

echo ""
echo "=========================================="
echo "  数据集准备完成"
echo "=========================================="

训练配置

#!/usr/bin/env python3
# training_config.py - 训练配置

from dataclasses import dataclass
from typing import Optional

@dataclass
class TrainingConfig:
    """训练配置"""
    
    # 数据配置
    data_dir: str = '/data/imagenet'
    image_size: int = 224
    workers: int = 8  # DataLoader 工作进程数
    
    # 模型配置
    model: str = 'resnet50'
    pretrained: bool = False
    num_classes: int = 1000
    
    # 训练配置
    batch_size: int = 256  # 每卡 batch size
    epochs: int = 90
    lr: float = 0.1
    momentum: float = 0.9
    weight_decay: float = 1e-4
    lr_scheduler: str = 'cosine'  # step/cosine
    
    # 混合精度
    amp: bool = True
    amp_dtype: str = 'float16'  # float16/bfloat16
    
    # 分布式配置
    distributed: bool = True
    world_size: int = 8
    sync_bn: bool = True  # 同步 BatchNorm
    
    # 日志配置
    log_dir: str = 'logs'
    tensorboard: bool = True
    wandb: bool = False
    print_freq: int = 10
    
    # Checkpoint
    checkpoint_dir: str = 'checkpoints'
    save_freq: int = 10
    resume: Optional[str] = None

# 默认配置
default_config = TrainingConfig()

def print_config(config: TrainingConfig):
    """打印配置"""
    print("="*60)
    print("训练配置")
    print("="*60)
    for key, value in config.__dict__.items():
        print(f"{key}: {value}")
    print("="*60)

if __name__ == "__main__":
    print_config(default_config)

单机多卡训练性能

训练脚本

#!/usr/bin/env python3
# resnet50_training.py - ResNet50 分布式训练脚本

import os
import time
import argparse
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.utils.data import DataLoader, DistributedSampler
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
from torchvision.models import resnet50, ResNet50_Weights
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm

def setup_distributed(rank, world_size):
    """初始化分布式环境"""
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '29500'
    
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)
    
    return dist.get_rank(), dist.get_world_size()

def cleanup_distributed():
    """清理分布式环境"""
    dist.destroy_process_group()

def create_data_loader(rank, world_size, config):
    """创建数据加载器"""
    
    # 数据增强
    train_transform = transforms.Compose([
        transforms.RandomResizedCrop(config.image_size),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                           std=[0.229, 0.224, 0.225]),
    ])
    
    # 数据集
    train_dataset = datasets.ImageFolder(
        os.path.join(config.data_dir, 'train'),
        transform=train_transform
    )
    
    # 分布式 Sampler
    sampler = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
    
    # DataLoader
    train_loader = DataLoader(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        num_workers=config.workers,
        pin_memory=True,
        sampler=sampler,
        persistent_workers=True if config.workers > 0 else False
    )
    
    return train_loader

def train_epoch(model, train_loader, criterion, optimizer, scaler, config, rank, epoch):
    """训练一个 epoch"""
    
    model.train()
    sampler = train_loader.sampler
    sampler.set_epoch(epoch)
    
    total_samples = 0
    correct_samples = 0
    total_loss = 0.0
    
    start_time = time.time()
    
    progress_bar = tqdm(train_loader, disable=(rank != 0))
    
    for batch_idx, (images, targets) in enumerate(progress_bar):
        images = images.cuda(non_blocking=True)
        targets = targets.cuda(non_blocking=True)
        
        optimizer.zero_grad()
        
        # 混合精度训练
        if config.amp:
            with autocast(dtype=torch.float16 if config.amp_dtype == 'float16' else torch.bfloat16):
                outputs = model(images)
                loss = criterion(outputs, targets)
            
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        
        # 统计
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total_samples += targets.size(0)
        correct_samples += predicted.eq(targets).sum().item()
        
        # 进度显示
        if rank == 0 and batch_idx % config.print_freq == 0:
            progress = batch_idx / len(train_loader)
            elapsed = time.time() - start_time
            images_per_sec = (batch_idx * config.batch_size * world_size) / elapsed
            
            progress_bar.set_description(
                f'Epoch {epoch} [{progress:.1%}] '
                f'Loss: {total_loss/(batch_idx+1):.4f} '
                f'Acc: {100.*correct_samples/total_samples:.2f}% '
                f'({images_per_sec:.0f} img/s)'
            )
    
    # 计算 epoch 统计
    epoch_loss = total_loss / len(train_loader)
    epoch_acc = 100. * correct_samples / total_samples
    epoch_time = time.time() - start_time
    images_per_sec = (total_samples * world_size) / epoch_time
    
    return epoch_loss, epoch_acc, images_per_sec

def main_worker(rank, world_size, config):
    """单卡训练主函数"""
    
    # 初始化分布式
    setup_distributed(rank, world_size)
    
    if rank == 0:
        print(f"使用 {world_size} 张 GPU 训练")
        print(f"每卡 batch size: {config.batch_size}")
        print(f"总 batch size: {config.batch_size * world_size}")
    
    # 创建模型
    model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2 if config.pretrained else None)
    
    # 同步 BatchNorm
    if config.sync_bn:
        model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
    
    model = model.cuda(rank)
    model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
    
    # 损失函数
    criterion = nn.CrossEntropyLoss().cuda(rank)
    
    # 优化器
    optimizer = optim.SGD(
        model.parameters(),
        lr=config.lr,
        momentum=config.momentum,
        weight_decay=config.weight_decay
    )
    
    # 学习率调度器
    if config.lr_scheduler == 'cosine':
        scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=config.epochs)
    else:
        scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
    
    # 混合精度 scaler
    scaler = GradScaler() if config.amp else None
    
    # 数据加载器
    train_loader = create_data_loader(rank, world_size, config)
    
    # 训练循环
    best_acc = 0.0
    
    for epoch in range(config.epochs):
        if rank == 0:
            print(f"\nEpoch {epoch+1}/{config.epochs}")
        
        # 训练
        train_loss, train_acc, images_per_sec = train_epoch(
            model, train_loader, criterion, optimizer, scaler, config, rank, epoch
        )
        
        # 更新学习率
        scheduler.step()
        
        if rank == 0:
            print(f"Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}%, Speed: {images_per_sec:.0f} img/s")
            
            # 保存 checkpoint
            if (epoch + 1) % config.save_freq == 0 or train_acc > best_acc:
                checkpoint = {
                    'epoch': epoch,
                    'model_state_dict': model.module.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'scheduler_state_dict': scheduler.state_dict(),
                    'acc': train_acc,
                }
                torch.save(checkpoint, os.path.join(config.checkpoint_dir, f'checkpoint_epoch{epoch+1}.pt'))
                if train_acc > best_acc:
                    best_acc = train_acc
                    torch.save(checkpoint, os.path.join(config.checkpoint_dir, 'checkpoint_best.pt'))
    
    cleanup_distributed()

def main():
    parser = argparse.ArgumentParser(description='ResNet50 训练')
    parser.add_argument('--data-dir', type=str, default='/data/imagenet')
    parser.add_argument('--batch-size', type=int, default=256)
    parser.add_argument('--epochs', type=int, default=90)
    parser.add_argument('--workers', type=int, default=8)
    parser.add_argument('--amp', action='store_true', default=True)
    parser.add_argument('--pretrained', action='store_true', default=False)
    parser.add_argument('--sync-bn', action='store_true', default=True)
    
    config = parser.parse_args()
    
    world_size = torch.cuda.device_count()
    
    print(f"检测到 {world_size} 张 GPU")
    
    # 启动分布式训练
    mp.spawn(main_worker, args=(world_size, config), nprocs=world_size, join=True)

if __name__ == "__main__":
    main()

运行训练

#!/bin/bash
# run_resnet50_training.sh - 运行 ResNet50 训练

echo "=========================================="
echo "  ResNet50 分布式训练"
echo "=========================================="

# 配置
NUM_GPUS=${NUM_GPUS:-8}
BATCH_SIZE=${BATCH_SIZE:-256}
AMP=${AMP:-true}

echo ""
echo "训练配置:"
echo "  GPU 数量：$NUM_GPUS"
echo "  每卡 batch size: $BATCH_SIZE"
echo "  混合精度：$AMP"
echo ""

# 使用 torchrun 启动
torchrun \
    --nproc_per_node=$NUM_GPUS \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    resnet50_training.py \
    --data-dir /data/imagenet \
    --batch-size $BATCH_SIZE \
    --epochs 90 \
    --workers 8 \
    --amp \
    --pretrained \
    --sync-bn

echo ""
echo "=========================================="
echo "  训练完成"
echo "=========================================="

混合精度训练

混合精度原理

┌─────────────────────────────────────────────────┐
│          混合精度训练原理                        │
├─────────────────────────────────────────────────┤
│                                                 │
│  精度类型:                                      │
│  ├── FP32 (单精度): 32 位，高精度，慢速          │
│  ├── FP16 (半精度): 16 位，低精度，快速          │
│  └── BF16 (Brain Float): 16 位，动态范围大       │
│                                                 │
│  混合精度策略:                                  │
│  ├── 前向传播：FP16 计算                        │
│  ├── 损失计算：FP32 (避免精度损失)              │
│  ├── 反向传播：FP16 梯度计算，FP32 梯度累加      │
│  └── 权重更新：FP32 (主权重副本)                │
│                                                 │
│  优势:                                          │
│  ├── 速度提升：2-3 倍 (Tensor Core)             │
│  ├── 显存节省：约 50%                          │
│  ├── 支持更大 batch size                       │
│  └── 精度相当：与 FP32 训练相当                  │
│                                                 │
│  注意事项:                                      │
│  ├── 损失缩放 (Loss Scaling) 防止梯度下溢        │
│  ├── 主权重副本 (Master Weights) FP32 存储       │
│  └── 梯度检查：检测 Inf/NaN                    │
│                                                 │
└─────────────────────────────────────────────────┘

混合精度性能对比

#!/usr/bin/env python3
# amp_benchmark.py - 混合精度性能对比

import torch
import torch.nn as nn
import time
from torchvision.models import resnet50

def benchmark_training(model, device, batch_size, iterations, use_amp=False):
    """基准训练测试"""
    
    model.train()
    model = model.to(device)
    
    # 创建假数据
    images = torch.randn(batch_size, 3, 224, 224, device=device)
    targets = torch.randint(0, 1000, (batch_size,), device=device)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    
    # 混合精度 scaler
    scaler = torch.cuda.amp.GradScaler() if use_amp else None
    
    # 预热
    for _ in range(10):
        optimizer.zero_grad()
        if use_amp:
            with torch.cuda.amp.autocast():
                outputs = model(images)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
    torch.cuda.synchronize()
    
    # 正式测试
    start_time = time.time()
    
    for _ in range(iterations):
        optimizer.zero_grad()
        if use_amp:
            with torch.cuda.amp.autocast():
                outputs = model(images)
                loss = criterion(outputs, targets)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            outputs = model(images)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
    
    torch.cuda.synchronize()
    elapsed = time.time() - start_time
    
    images_per_sec = (batch_size * iterations) / elapsed
    
    return elapsed, images_per_sec

def main():
    device = torch.device('cuda')
    print(f"GPU: {torch.cuda.get_device_name(device)}")
    print()
    
    batch_sizes = [64, 128, 256, 512]
    iterations = 100
    
    print("="*70)
    print("混合精度性能对比 (ResNet50)")
    print("="*70)
    print()
    
    for batch_size in batch_sizes:
        print(f"Batch Size: {batch_size}")
        print("-"*70)
        
        # FP32
        model_fp32 = resnet50()
        time_fp32, ips_fp32 = benchmark_training(
            model_fp32, device, batch_size, iterations, use_amp=False
        )
        print(f"  FP32:   {time_fp32:.2f}s, {ips_fp32:.0f} img/s")
        
        # AMP (FP16)
        model_amp = resnet50()
        time_amp, ips_amp = benchmark_training(
            model_amp, device, batch_size, iterations, use_amp=True
        )
        print(f"  AMP:    {time_amp:.2f}s, {ips_amp:.0f} img/s")
        
        # 提升
        speedup = ips_amp / ips_fp32
        print(f"  提升：{speedup:.2f}x ({(speedup-1)*100:.0f}%)")
        print()
    
    print("="*70)

if __name__ == "__main__":
    main()

运行混合精度对比

# 运行混合精度性能对比
python3 amp_benchmark.py

# 示例输出:
# GPU: NVIDIA A100-SXM4-80GB
#
# ======================================================================
# 混合精度性能对比 (ResNet50)
# ======================================================================
#
# Batch Size: 64
# ----------------------------------------------------------------------
#   FP32:   12.34s, 519 img/s
#   AMP:    6.78s, 944 img/s
#   提升：1.82x (82%)
#
# Batch Size: 128
# ----------------------------------------------------------------------
#   FP32:   6.45s, 992 img/s
#   AMP:    3.52s, 1818 img/s
#   提升：1.83x (83%)
#
# Batch Size: 256
# ----------------------------------------------------------------------
#   FP32:   3.34s, 1916 img/s
#   AMP:    1.82s, 3516 img/s
#   提升：1.84x (84%)
#
# Batch Size: 512
# ----------------------------------------------------------------------
#   FP32:   1.72s, 3721 img/s
#   AMP:    0.95s, 6737 img/s
#   提升：1.81x (81%)
#
# ======================================================================

性能瓶颈分析

使用 PyTorch Profiler

#!/usr/bin/env python3
# training_profiler.py - 训练性能分析

import torch
import torch.nn as nn
from torch.profiler import profile, record_function, ProfilerActivity
from torchvision.models import resnet50

def profile_training():
    """Profiling 训练过程"""
    
    device = torch.device('cuda')
    model = resnet50().to(device)
    model.train()
    
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    
    # 假数据
    images = torch.randn(256, 3, 224, 224, device=device)
    targets = torch.randint(0, 1000, (256,), device=device)
    
    # Profiling
    with profile(
        activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=1),
        on_trace_ready=torch.profiler.tensorboard_trace_handler('./log/resnet50'),
        record_shapes=True,
        profile_memory=True,
        with_stack=True
    ) as prof:
        
        for step in range(10):
            with record_function("model_forward"):
                outputs = model(images)
                loss = criterion(outputs, targets)
            
            with record_function("model_backward"):
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
            
            prof.step()
    
    # 打印结果
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
    
    # 导出结果
    prof.export_chrome_trace("trace.json")
    print("Trace 已导出：trace.json")
    print("使用 Chrome 浏览器查看：chrome://tracing")

if __name__ == "__main__":
    profile_training()

瓶颈诊断

┌─────────────────────────────────────────────────┐
│          训练性能瓶颈诊断                        │
├─────────────────────────────────────────────────┤
│                                                 │
│  GPU 利用率低 (<80%):                           │
│  ├── 可能原因：CPU/数据加载瓶颈                 │
│  ├── 诊断：top/htop 查看 CPU 使用率              │
│  ├── 解决：增加 workers，使用 DALI              │
│  └── 优化：prefetch_factor，pin_memory         │
│                                                 │
│  显存带宽瓶颈:                                  │
│  ├── 可能原因：频繁的显存访问                  │
│  ├── 诊断：nsys profile 查看内存事务            │
│  ├── 解决：减少不必要的数据拷贝                │
│  └── 优化：融合操作，使用 inplace 操作           │
│                                                 │
│  通信瓶颈 (多卡):                               │
│  ├── 可能原因：梯度同步慢                      │
│  ├── 诊断：NCCL_DEBUG 查看通信时间              │
│  ├── 解决：梯度累积，减少同步频率              │
│  └── 优化：梯度压缩，Overlap 通信计算           │
│                                                 │
│  计算瓶颈:                                      │
│  ├── 可能原因：模型计算密集                    │
│  ├── 诊断：nsys 查看 kernel 执行时间             │
│  ├── 解决：混合精度训练                        │
│  └── 优化：算子融合，使用优化库                 │
│                                                 │
└─────────────────────────────────────────────────┘

实战：完整训练 Benchmark

一键测试脚本

#!/bin/bash
# resnet50_benchmark.sh - ResNet50 完整训练基准测试

set -e

echo "=========================================="
echo "  ResNet50 训练基准测试"
echo "=========================================="

RESULTS_DIR="results/resnet50_$(date +%Y%m%d_%H%M%S)"
mkdir -p $RESULTS_DIR

NUM_GPUS=$(nvidia-smi -q | grep "Product Name" | wc -l)
BATCH_SIZE=256

echo ""
echo "测试配置:"
echo "  GPU 数量：$NUM_GPUS"
echo "  每卡 batch size: $BATCH_SIZE"
echo "  总 batch size: $((BATCH_SIZE * NUM_GPUS))"
echo "  结果目录：$RESULTS_DIR"
echo ""

# 1. 环境检查
echo "=========================================="
echo "[1/5] 环境检查"
echo "=========================================="
python3 -c "
import torch
print(f'PyTorch: {torch.__version__}')
print(f'CUDA: {torch.version.cuda}')
print(f'GPU 数量：{torch.cuda.device_count()}')
print(f'GPU 型号：{torch.cuda.get_device_name(0)}')
" | tee $RESULTS_DIR/env_info.txt

# 2. 模型信息
echo ""
echo "=========================================="
echo "[2/5] 模型信息"
echo "=========================================="
python3 -c "
from torchvision.models import resnet50
model = resnet50()
params = sum(p.numel() for p in model.parameters())
print(f'参数量：{params/1e6:.2f}M')
" | tee $RESULTS_DIR/model_info.txt

# 3. FP32 基准测试
echo ""
echo "=========================================="
echo "[3/5] FP32 基准测试"
echo "=========================================="
python3 amp_benchmark.py 2>&1 | tee $RESULTS_DIR/fp32_benchmark.log

# 4. 混合精度基准测试
echo ""
echo "=========================================="
echo "[4/5] 混合精度基准测试"
echo "=========================================="
# (已在 amp_benchmark.py 中包含)

# 5. 生成报告
echo ""
echo "=========================================="
echo "[5/5] 生成测试报告"
echo "=========================================="

cat << EOF > $RESULTS_DIR/benchmark_report.md
# ResNet50 训练基准测试报告

**测试日期:** $(date)
**GPU 配置:** $NUM_GPUS x $(nvidia-smi --query-gpu=name --format=csv,noheader)

## 环境信息

\`\`\`
$(cat $RESULTS_DIR/env_info.txt)
\`\`\`

## 模型信息

\`\`\`
$(cat $RESULTS_DIR/model_info.txt)
\`\`\`

## 性能结果

详见：fp32_benchmark.log

## 建议

根据测试结果调整训练配置
EOF

echo "报告已生成：$RESULTS_DIR/benchmark_report.md"

echo ""
echo "=========================================="
echo "  基准测试完成"
echo "=========================================="
echo ""
echo "结果目录：$RESULTS_DIR"
ls -la $RESULTS_DIR

常见问题排查

训练速度慢

# 问题：训练速度远低于预期

# 1. 检查 GPU 利用率
nvidia-smi dmon -s pucvmet -d 1 -c 10

# 2. 检查 CPU 使用率
top -bn1 | grep "Cpu(s)"

# 3. 检查数据加载
# 增加 workers 数量
# 使用 --workers 16

# 4. 检查存储 IO
iostat -x 1 5

# 5. 启用混合精度
# 添加 --amp 参数

# 6. 检查 NCCL
export NCCL_DEBUG=INFO

显存不足 (OOM)

# 问题：CUDA out of memory

# 1. 减小 batch size
# --batch-size 128

# 2. 使用梯度累积
# 模拟大 batch size

# 3. 启用混合精度
# --amp (节省约 50% 显存)

# 4. 检查显存泄漏
# nvidia-smi 查看显存使用

# 5. 清理缓存
torch.cuda.empty_cache()

多卡训练不加速

# 问题：多卡扩展效率低

# 1. 检查 NCCL 配置
export NCCL_DEBUG=INFO

# 2. 检查 GPU 拓扑
nvidia-smi topo -m

# 3. 检查网络
ibstat

# 4. 启用同步 BN
# --sync-bn

# 5. 调整梯度同步频率
# 梯度累积

总结

今天学到的内容

✅ ResNet50 模型介绍：架构、参数量、FLOPs
✅ 训练环境配置：软件、数据集、配置
✅ 单机多卡训练：分布式训练脚本、torchrun
✅ 混合精度训练：原理、性能对比、2-3 倍提升
✅ 性能瓶颈分析：PyTorch Profiler、诊断方法
✅ 实战演练：完整基准测试脚本

下一步

明天我们将学习 Day 12 - ResNet50 推理测试，了解：

推理引擎对比（TensorRT、ONNX Runtime）
批处理与延迟优化
量化测试（INT8/FP16）
吞吐量 benchmark

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git