Qwen-Image-2512-Pixel-Art-LoRA部署教程:GPU温度监控与过热保护配置

1. 引言

如果你正在使用Qwen-Image-2512-Pixel-Art-LoRA模型生成像素艺术,可能会遇到一个常见问题:长时间生成高分辨率图像时,GPU温度飙升,风扇狂转,甚至触发降频保护,导致生成速度变慢。这就像让一台高性能跑车在炎炎夏日里持续全速行驶,却没有给它装水温表和散热系统一样危险。

今天这篇文章,我要和你分享一套完整的GPU温度监控与过热保护配置方案。这不是简单的理论讲解,而是我经过多次实践验证的实战经验。通过这套方案,你可以:

  • 实时监控:随时查看GPU的温度、功耗、显存使用情况
  • 智能预警:在温度达到危险阈值前收到提醒
  • 自动保护:温度过高时自动降低负载或暂停任务
  • 性能优化:找到温度与性能的最佳平衡点

无论你是个人创作者还是团队开发者,这套方案都能让你的像素艺术生成过程更加稳定、安全。下面,我就带你一步步搭建这个“GPU健康管家”。

2. 为什么需要GPU温度监控?

2.1 GPU过热的风险

在深入配置之前,我们先要明白为什么要这么重视GPU温度。很多人觉得“显卡耐热,温度高点没事”,但实际情况并非如此。

硬件损伤风险

  • 长期高温:持续80°C以上的工作温度会加速电子元件老化
  • 热应力:温度剧烈变化导致焊点开裂、芯片脱焊
  • 风扇磨损:高温下风扇需要更高转速,寿命大幅缩短

性能影响

  • 自动降频:现代GPU都有温度保护机制,超过阈值会自动降低频率
  • 生成卡顿:降频后生成时间可能延长30%-50%
  • 系统不稳定:极端情况下可能导致驱动崩溃、系统重启

实际案例: 我遇到过一位游戏开发者,他用RTX 4090连续生成1280×1280的像素场景,3小时后GPU温度达到88°C,生成时间从15秒延长到25秒。检查后发现,显卡已经触发了温度墙,频率从默认的2520MHz降到了2100MHz。

2.2 Qwen-Image-2512-Pixel-Art-LoRA的特殊性

这个像素艺术模型对GPU的压力比普通图像生成模型更大,原因有三:

  1. 高分辨率需求:像素艺术需要清晰锐利的边缘,往往需要1024×1024甚至更高分辨率
  2. LoRA权重加载:额外的1.1GB LoRA权重增加了显存和计算压力
  3. 长时间连续生成:创作过程往往是批量生成、反复调整,GPU持续高负载

了解这些风险后,我们就能明白温度监控不是“可有可无”的选项,而是保障创作效率和硬件安全的必要措施。

3. 环境准备与基础监控

3.1 检查当前GPU状态

在开始配置之前,我们先看看你的GPU当前是什么状态。打开终端,运行以下命令:

# 安装必要的监控工具
pip install pynvml

# 创建一个简单的监控脚本
cat > gpu_status.py << 'EOF'
import pynvml
import time

def get_gpu_status():
    pynvml.nvmlInit()
    
    try:
        device_count = pynvml.nvmlDeviceGetCount()
        print(f"检测到 {device_count} 个GPU设备")
        print("=" * 50)
        
        for i in range(device_count):
            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
            
            # 获取设备名称
            name = pynvml.nvmlDeviceGetName(handle)
            
            # 获取温度
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            
            # 获取功耗
            power = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0  # 转换为瓦特
            power_limit = pynvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
            
            # 获取显存使用
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            mem_used = mem_info.used / 1024**3  # 转换为GB
            mem_total = mem_info.total / 1024**3
            
            # 获取利用率
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            
            print(f"GPU {i}: {name.decode('utf-8')}")
            print(f"  温度: {temp}°C")
            print(f"  功耗: {power:.1f}W / {power_limit:.1f}W")
            print(f"  显存: {mem_used:.1f}GB / {mem_total:.1f}GB")
            print(f"  利用率: GPU {util.gpu}%, 显存 {util.memory}%")
            print("-" * 30)
            
    finally:
        pynvml.nvmlShutdown()

if __name__ == "__main__":
    get_gpu_status()
EOF

# 运行监控脚本
python gpu_status.py

这个脚本会显示你GPU的当前状态。正常待机状态下,温度应该在40-60°C之间。如果待机温度就超过70°C,说明散热可能有问题。

3.2 安装系统级监控工具

除了Python脚本,我们还需要系统级的监控工具,它们更稳定、功能更全面。

对于Ubuntu/Debian系统

# 安装nvtop(类似htop的GPU监控工具)
sudo apt update
sudo apt install nvtop -y

# 安装gpustat(轻量级监控)
pip install gpustat

# 安装温度监控工具
sudo apt install lm-sensors -y
sudo sensors-detect  # 按照提示选择默认选项
sudo service kmod start

对于CentOS/RHEL系统

# 启用EPEL仓库
sudo yum install epel-release -y

# 安装监控工具
sudo yum install nvtop lm_sensors -y
sudo sensors-detect
sudo systemctl start lm_sensors

安装完成后,你可以用这些命令实时监控:

# 使用nvtop(交互式界面,按q退出)
nvtop

# 使用gpustat(简洁输出)
gpustat -i 1  # 每秒刷新一次

# 查看传感器温度
sensors

现在,基础监控环境已经搭建好了。但手动监控太麻烦,我们需要自动化方案。

4. 实时温度监控系统搭建

4.1 创建自动化监控脚本

手动查看温度不现实,我们需要一个后台运行的监控服务。创建一个完整的监控脚本:

# 创建监控脚本目录
mkdir -p ~/gpu_monitor
cd ~/gpu_monitor

# 创建主监控脚本
cat > monitor_gpu.py << 'EOF'
#!/usr/bin/env python3
"""
GPU温度监控与告警脚本
实时监控GPU状态,温度过高时自动告警并采取保护措施
"""

import pynvml
import time
import logging
import json
from datetime import datetime
import smtplib
from email.mime.text import MIMEText
from threading import Thread
import subprocess
import os

class GPUMonitor:
    def __init__(self, config_file="config.json"):
        """初始化监控器"""
        pynvml.nvmlInit()
        
        # 加载配置
        self.config = self.load_config(config_file)
        
        # 设置日志
        self.setup_logging()
        
        # 获取GPU数量
        self.device_count = pynvml.nvmlDeviceGetCount()
        self.logger.info(f"初始化GPU监控,检测到 {self.device_count} 个GPU设备")
        
        # 监控状态
        self.monitoring = True
        self.alert_sent = {i: False for i in range(self.device_count)}
        
    def load_config(self, config_file):
        """加载配置文件"""
        default_config = {
            "check_interval": 2,  # 检查间隔(秒)
            "temperature_thresholds": {
                "warning": 75,     # 警告温度(°C)
                "critical": 82,    # 严重温度(°C)
                "shutdown": 88     # 关机保护温度(°C)
            },
            "power_threshold": 0.9,  # 功耗阈值(百分比)
            "memory_threshold": 0.9, # 显存阈值(百分比)
            "log_file": "gpu_monitor.log",
            "data_file": "gpu_history.json",
            "enable_email_alert": False,
            "email_config": {
                "smtp_server": "smtp.gmail.com",
                "smtp_port": 587,
                "sender_email": "",
                "sender_password": "",
                "receiver_email": ""
            },
            "enable_auto_protection": True,
            "protection_actions": {
                "warning": "reduce_load",      # 警告时降低负载
                "critical": "pause_generation", # 严重时暂停生成
                "shutdown": "kill_process"     # 超温时终止进程
            }
        }
        
        # 如果配置文件存在,则加载
        if os.path.exists(config_file):
            with open(config_file, 'r') as f:
                user_config = json.load(f)
                default_config.update(user_config)
        
        # 保存配置(方便修改)
        with open(config_file, 'w') as f:
            json.dump(default_config, f, indent=2)
            
        return default_config
    
    def setup_logging(self):
        """设置日志系统"""
        log_format = '%(asctime)s - %(levelname)s - %(message)s'
        logging.basicConfig(
            level=logging.INFO,
            format=log_format,
            handlers=[
                logging.FileHandler(self.config["log_file"]),
                logging.StreamHandler()
            ]
        )
        self.logger = logging.getLogger(__name__)
    
    def get_gpu_status(self, device_index):
        """获取单个GPU的状态"""
        try:
            handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
            
            # 基础信息
            name = pynvml.nvmlDeviceGetName(handle).decode('utf-8')
            
            # 温度信息
            temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
            
            # 功耗信息
            power_usage = pynvml.nvmlDeviceGetPowerUsage(handle) / 1000.0
            power_limit = pynvml.nvmlDeviceGetPowerManagementLimit(handle) / 1000.0
            power_percent = (power_usage / power_limit) * 100 if power_limit > 0 else 0
            
            # 显存信息
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
            mem_used = mem_info.used / 1024**3
            mem_total = mem_info.total / 1024**3
            mem_percent = (mem_used / mem_total) * 100
            
            # 利用率
            util = pynvml.nvmlDeviceGetUtilizationRates(handle)
            
            # 风扇速度
            fan_speed = pynvml.nvmlDeviceGetFanSpeed(handle)
            
            # 时钟频率
            clock_graphics = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_GRAPHICS)
            clock_memory = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_MEM)
            
            return {
                "name": name,
                "temperature": temp,
                "power_usage": power_usage,
                "power_limit": power_limit,
                "power_percent": power_percent,
                "memory_used": mem_used,
                "memory_total": mem_total,
                "memory_percent": mem_percent,
                "gpu_utilization": util.gpu,
                "memory_utilization": util.memory,
                "fan_speed": fan_speed,
                "clock_graphics": clock_graphics,
                "clock_memory": clock_memory,
                "timestamp": datetime.now().isoformat()
            }
            
        except Exception as e:
            self.logger.error(f"获取GPU {device_index} 状态失败: {e}")
            return None
    
    def check_thresholds(self, status):
        """检查是否超过阈值"""
        alerts = []
        
        # 温度检查
        temp = status["temperature"]
        thresholds = self.config["temperature_thresholds"]
        
        if temp >= thresholds["shutdown"]:
            alerts.append(("shutdown", f"温度严重超标: {temp}°C ≥ {thresholds['shutdown']}°C"))
        elif temp >= thresholds["critical"]:
            alerts.append(("critical", f"温度过高: {temp}°C ≥ {thresholds['critical']}°C"))
        elif temp >= thresholds["warning"]:
            alerts.append(("warning", f"温度警告: {temp}°C ≥ {thresholds['warning']}°C"))
        
        # 功耗检查
        if status["power_percent"] >= self.config["power_threshold"] * 100:
            alerts.append(("warning", f"功耗过高: {status['power_percent']:.1f}%"))
            
        # 显存检查
        if status["memory_percent"] >= self.config["memory_threshold"] * 100:
            alerts.append(("warning", f"显存占用过高: {status['memory_percent']:.1f}%"))
            
        return alerts
    
    def take_protection_action(self, device_index, alert_level, message):
        """采取保护措施"""
        if not self.config["enable_auto_protection"]:
            return
            
        actions = self.config["protection_actions"]
        
        if alert_level in actions:
            action = actions[alert_level]
            self.logger.warning(f"GPU {device_index}: {message},执行保护措施: {action}")
            
            if action == "reduce_load":
                # 降低生成负载(减少批次大小或分辨率)
                self.reduce_generation_load()
                
            elif action == "pause_generation":
                # 暂停生成任务
                self.pause_generation()
                
            elif action == "kill_process":
                # 终止相关进程
                self.kill_related_processes()
    
    def reduce_generation_load(self):
        """降低生成负载"""
        # 这里可以集成到你的生成脚本中
        # 例如:降低批次大小、减少分辨率、增加生成间隔
        self.logger.info("正在降低生成负载...")
        # 具体实现取决于你的应用架构
    
    def pause_generation(self):
        """暂停生成任务"""
        self.logger.info("暂停所有生成任务...")
        # 发送暂停信号给生成进程
        # 具体实现取决于你的应用架构
    
    def kill_related_processes(self):
        """终止相关进程"""
        self.logger.info("终止高风险进程...")
        # 查找并终止占用GPU的进程
        try:
            # 查找使用GPU的Python进程
            result = subprocess.run(
                ["nvidia-smi", "--query-compute-apps=pid", "--format=csv,noheader"],
                capture_output=True, text=True
            )
            
            if result.stdout:
                pids = set(result.stdout.strip().split('\n'))
                for pid in pids:
                    if pid:
                        self.logger.warning(f"终止进程 PID: {pid}")
                        subprocess.run(["kill", "-9", pid])
        except Exception as e:
            self.logger.error(f"终止进程失败: {e}")
    
    def send_email_alert(self, device_index, alerts):
        """发送邮件告警"""
        if not self.config["enable_email_alert"]:
            return
            
        email_config = self.config["email_config"]
        if not all([email_config["sender_email"], email_config["sender_password"], email_config["receiver_email"]]):
            return
            
        try:
            subject = f"GPU监控告警 - 设备 {device_index}"
            body = f"GPU {device_index} 出现以下告警:\n\n"
            for level, msg in alerts:
                body += f"- {level.upper()}: {msg}\n"
            
            body += f"\n时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"
            
            msg = MIMEText(body)
            msg['Subject'] = subject
            msg['From'] = email_config["sender_email"]
            msg['To'] = email_config["receiver_email"]
            
            with smtplib.SMTP(email_config["smtp_server"], email_config["smtp_port"]) as server:
                server.starttls()
                server.login(email_config["sender_email"], email_config["sender_password"])
                server.send_message(msg)
                
            self.logger.info(f"已发送邮件告警: {alerts}")
            
        except Exception as e:
            self.logger.error(f"发送邮件失败: {e}")
    
    def save_history(self, status_list):
        """保存历史数据"""
        try:
            history = []
            if os.path.exists(self.config["data_file"]):
                with open(self.config["data_file"], 'r') as f:
                    history = json.load(f)
            
            # 只保留最近1000条记录
            history.extend(status_list)
            if len(history) > 1000:
                history = history[-1000:]
                
            with open(self.config["data_file"], 'w') as f:
                json.dump(history, f, indent=2)
                
        except Exception as e:
            self.logger.error(f"保存历史数据失败: {e}")
    
    def monitor_loop(self):
        """监控主循环"""
        self.logger.info("开始GPU监控...")
        
        while self.monitoring:
            try:
                status_list = []
                current_time = datetime.now()
                
                for i in range(self.device_count):
                    status = self.get_gpu_status(i)
                    if status:
                        status_list.append(status)
                        
                        # 检查阈值
                        alerts = self.check_thresholds(status)
                        
                        # 记录状态
                        log_msg = (f"GPU {i}: {status['name']} | "
                                 f"温度: {status['temperature']}°C | "
                                 f"功耗: {status['power_usage']:.1f}W | "
                                 f"显存: {status['memory_used']:.1f}GB/{status['memory_total']:.1f}GB")
                        self.logger.info(log_msg)
                        
                        # 处理告警
                        if alerts:
                            for alert_level, message in alerts:
                                self.logger.warning(f"GPU {i} 告警: {message}")
                                
                                # 发送邮件告警(如果是第一次告警)
                                if not self.alert_sent[i]:
                                    self.send_email_alert(i, alerts)
                                    self.alert_sent[i] = True
                                
                                # 采取保护措施
                                self.take_protection_action(i, alert_level, message)
                        else:
                            # 恢复正常,重置告警状态
                            self.alert_sent[i] = False
                
                # 保存历史数据(每分钟保存一次)
                if current_time.second < 2:  # 每分钟的前2秒保存
                    self.save_history(status_list)
                
                # 等待下一次检查
                time.sleep(self.config["check_interval"])
                
            except KeyboardInterrupt:
                self.logger.info("收到停止信号,正在退出...")
                self.monitoring = False
            except Exception as e:
                self.logger.error(f"监控循环出错: {e}")
                time.sleep(5)
    
    def stop(self):
        """停止监控"""
        self.monitoring = False
        pynvml.nvmlShutdown()
        self.logger.info("GPU监控已停止")

def main():
    """主函数"""
    monitor = GPUMonitor()
    
    try:
        # 启动监控线程
        monitor_thread = Thread(target=monitor.monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()
        
        # 保持主线程运行
        while monitor.monitoring:
            time.sleep(1)
            
    except KeyboardInterrupt:
        monitor.stop()
    except Exception as e:
        monitor.logger.error(f"主程序出错: {e}")
        monitor.stop()

if __name__ == "__main__":
    main()
EOF

# 创建配置文件
cat > config.json << 'EOF'
{
  "check_interval": 2,
  "temperature_thresholds": {
    "warning": 75,
    "critical": 82,
    "shutdown": 88
  },
  "power_threshold": 0.9,
  "memory_threshold": 0.9,
  "log_file": "gpu_monitor.log",
  "data_file": "gpu_history.json",
  "enable_email_alert": false,
  "email_config": {
    "smtp_server": "smtp.gmail.com",
    "smtp_port": 587,
    "sender_email": "",
    "sender_password": "",
    "receiver_email": ""
  },
  "enable_auto_protection": true,
  "protection_actions": {
    "warning": "reduce_load",
    "critical": "pause_generation",
    "shutdown": "kill_process"
  }
}
EOF

# 给脚本执行权限
chmod +x monitor_gpu.py

这个监控脚本功能非常全面:

  1. 实时监控:每2秒检查一次GPU状态
  2. 多级告警:警告(75°C)、严重(82°C)、关机保护(88°C)
  3. 自动保护:温度过高时自动采取保护措施
  4. 历史记录:保存监控数据,方便分析
  5. 邮件通知:支持邮件告警(需要配置)

4.2 启动监控服务

现在让我们启动监控服务,并设置为系统服务,这样它就能在后台自动运行:

# 创建系统服务文件
sudo tee /etc/systemd/system/gpu-monitor.service << 'EOF'
[Unit]
Description=GPU Temperature Monitor Service
After=network.target

[Service]
Type=simple
User=$USER
WorkingDirectory=/home/$USER/gpu_monitor
ExecStart=/usr/bin/python3 /home/$USER/gpu_monitor/monitor_gpu.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

# 替换用户名
sudo sed -i "s/\$USER/$USER/g" /etc/systemd/system/gpu-monitor.service

# 重新加载systemd配置
sudo systemctl daemon-reload

# 启动服务
sudo systemctl start gpu-monitor

# 设置开机自启
sudo systemctl enable gpu-monitor

# 查看服务状态
sudo systemctl status gpu-monitor

# 查看监控日志
tail -f /home/$USER/gpu_monitor/gpu_monitor.log

服务启动后,你可以随时查看GPU状态:

# 查看实时状态
sudo systemctl status gpu-monitor

# 查看监控日志
tail -f ~/gpu_monitor/gpu_monitor.log

# 停止服务
sudo systemctl stop gpu-monitor

# 重启服务
sudo systemctl restart gpu-monitor

4.3 创建Web监控面板

如果你想要一个更直观的Web界面来查看GPU状态,可以创建一个简单的Flask应用:

# 安装Flask
pip install flask

# 创建Web应用
cat > web_dashboard.py << 'EOF'
from flask import Flask, render_template_string, jsonify
import json
from datetime import datetime, timedelta
import os

app = Flask(__name__)

# 监控数据文件路径
DATA_FILE = "gpu_history.json"

def load_gpu_history():
    """加载GPU历史数据"""
    if not os.path.exists(DATA_FILE):
        return []
    
    with open(DATA_FILE, 'r') as f:
        return json.load(f)

def get_recent_data(hours=1):
    """获取最近N小时的数据"""
    history = load_gpu_history()
    if not history:
        return []
    
    cutoff_time = datetime.now() - timedelta(hours=hours)
    
    recent_data = []
    for entry in history[-500:]:  # 最多取500条记录
        entry_time = datetime.fromisoformat(entry['timestamp'].replace('Z', '+00:00'))
        if entry_time > cutoff_time:
            recent_data.append(entry)
    
    return recent_data

def calculate_statistics(data):
    """计算统计信息"""
    if not data:
        return {}
    
    temps = [d['temperature'] for d in data]
    powers = [d['power_usage'] for d in data]
    memories = [d['memory_used'] for d in data]
    
    return {
        'avg_temp': sum(temps) / len(temps),
        'max_temp': max(temps),
        'min_temp': min(temps),
        'avg_power': sum(powers) / len(powers),
        'avg_memory': sum(memories) / len(memories),
        'data_points': len(data)
    }

@app.route('/')
def dashboard():
    """显示监控仪表板"""
    recent_data = get_recent_data(hours=1)
    stats = calculate_statistics(recent_data)
    
    # 简单的HTML模板
    html_template = '''
    <!DOCTYPE html>
    <html>
    <head>
        <title>GPU监控仪表板</title>
        <meta charset="utf-8">
        <meta name="viewport" content="width=device-width, initial-scale=1">
        <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
        <style>
            body { font-family: Arial, sans-serif; margin: 20px; background: #f5f5f5; }
            .container { max-width: 1200px; margin: 0 auto; }
            .header { background: #2c3e50; color: white; padding: 20px; border-radius: 5px; }
            .stats-grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 20px; margin: 20px 0; }
            .stat-card { background: white; padding: 20px; border-radius: 5px; box-shadow: 0 2px 5px rgba(0,0,0,0.1); }
            .stat-value { font-size: 24px; font-weight: bold; color: #2c3e50; }
            .stat-label { color: #7f8c8d; margin-top: 5px; }
            .chart-container { background: white; padding: 20px; border-radius: 5px; box-shadow: 0 2px 5px rgba(0,0,0,0.1); margin: 20px 0; }
            .warning { color: #e74c3c; }
            .normal { color: #27ae60; }
            .info { color: #3498db; }
        </style>
    </head>
    <body>
        <div class="container">
            <div class="header">
                <h1>🔍 GPU温度监控仪表板</h1>
                <p>实时监控GPU状态,预防过热问题</p>
            </div>
            
            <div class="stats-grid">
                <div class="stat-card">
                    <div class="stat-value {{ 'warning' if stats.avg_temp > 75 else 'normal' }}">
                        {{ "%.1f"|format(stats.avg_temp) }}°C
                    </div>
                    <div class="stat-label">平均温度</div>
                </div>
                
                <div class="stat-card">
                    <div class="stat-value {{ 'warning' if stats.max_temp > 82 else 'normal' }}">
                        {{ "%.1f"|format(stats.max_temp) }}°C
                    </div>
                    <div class="stat-label">最高温度</div>
                </div>
                
                <div class="stat-card">
                    <div class="stat-value info">
                        {{ "%.1f"|format(stats.avg_power) }}W
                    </div>
                    <div class="stat-label">平均功耗</div>
                </div>
                
                <div class="stat-card">
                    <div class="stat-value info">
                        {{ "%.1f"|format(stats.avg_memory) }}GB
                    </div>
                    <div class="stat-label">平均显存使用</div>
                </div>
            </div>
            
            <div class="chart-container">
                <h2>温度变化趋势</h2>
                <canvas id="tempChart" height="100"></canvas>
            </div>
            
            <div class="chart-container">
                <h2>功耗变化趋势</h2>
                <canvas id="powerChart" height="100"></canvas>
            </div>
            
            <div style="text-align: center; margin-top: 30px; color: #7f8c8d;">
                <p>最后更新: {{ update_time }}</p>
                <p>数据点: {{ stats.data_points }} 条 | 时间范围: 最近1小时</p>
            </div>
        </div>
        
        <script>
            // 从后端获取数据
            fetch('/api/data')
                .then(response => response.json())
                .then(data => {
                    // 温度图表
                    const tempCtx = document.getElementById('tempChart').getContext('2d');
                    new Chart(tempCtx, {
                        type: 'line',
                        data: {
                            labels: data.labels,
                            datasets: [{
                                label: 'GPU温度 (°C)',
                                data: data.temperatures,
                                borderColor: '#e74c3c',
                                backgroundColor: 'rgba(231, 76, 60, 0.1)',
                                tension: 0.4
                            }]
                        },
                        options: {
                            responsive: true,
                            scales: {
                                y: {
                                    beginAtZero: false,
                                    title: {
                                        display: true,
                                        text: '温度 (°C)'
                                    }
                                }
                            }
                        }
                    });
                    
                    // 功耗图表
                    const powerCtx = document.getElementById('powerChart').getContext('2d');
                    new Chart(powerCtx, {
                        type: 'line',
                        data: {
                            labels: data.labels,
                            datasets: [{
                                label: 'GPU功耗 (W)',
                                data: data.powers,
                                borderColor: '#3498db',
                                backgroundColor: 'rgba(52, 152, 219, 0.1)',
                                tension: 0.4
                            }]
                        },
                        options: {
                            responsive: true,
                            scales: {
                                y: {
                                    beginAtZero: true,
                                    title: {
                                        display: true,
                                        text: '功耗 (W)'
                                    }
                                }
                            }
                        }
                    });
                });
        </script>
    </body>
    </html>
    '''
    
    return render_template_string(
        html_template,
        stats=stats,
        update_time=datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    )

@app.route('/api/data')
def api_data():
    """提供监控数据API"""
    recent_data = get_recent_data(hours=1)
    
    # 提取数据
    labels = []
    temperatures = []
    powers = []
    
    for entry in recent_data[-50:]:  # 最多显示50个点
        time_str = datetime.fromisoformat(
            entry['timestamp'].replace('Z', '+00:00')
        ).strftime('%H:%M:%S')
        labels.append(time_str)
        temperatures.append(entry['temperature'])
        powers.append(entry['power_usage'])
    
    return jsonify({
        'labels': labels,
        'temperatures': temperatures,
        'powers': powers
    })

@app.route('/api/status')
def api_status():
    """获取当前状态"""
    recent_data = get_recent_data(hours=1)
    if not recent_data:
        return jsonify({'status': 'no_data'})
    
    latest = recent_data[-1]
    stats = calculate_statistics(recent_data)
    
    return jsonify({
        'current': latest,
        'stats': stats,
        'timestamp': datetime.now().isoformat()
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)
EOF

# 创建Web服务文件
sudo tee /etc/systemd/system/gpu-web-dashboard.service << 'EOF'
[Unit]
Description=GPU Web Dashboard Service
After=network.target gpu-monitor.service
Requires=gpu-monitor.service

[Service]
Type=simple
User=$USER
WorkingDirectory=/home/$USER/gpu_monitor
ExecStart=/usr/bin/python3 /home/$USER/gpu_monitor/web_dashboard.py
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target
EOF

# 替换用户名
sudo sed -i "s/\$USER/$USER/g" /etc/systemd/system/gpu-web-dashboard.service

# 启动Web服务
sudo systemctl daemon-reload
sudo systemctl start gpu-web-dashboard
sudo systemctl enable gpu-web-dashboard

# 查看Web服务状态
sudo systemctl status gpu-web-dashboard

现在,你可以通过浏览器访问 http://你的服务器IP:5000 来查看GPU监控仪表板了。

5. 集成到Qwen-Image-2512-Pixel-Art-LoRA

5.1 修改生成脚本添加温度保护

现在我们需要把温度监控集成到实际的像素艺术生成过程中。修改你的生成脚本,在生成前检查GPU温度:

# 创建带温度保护的生成脚本
cat > generate_with_protection.py << 'EOF'
"""
带GPU温度保护的像素艺术生成脚本
在生成前检查GPU温度,温度过高时自动调整参数或等待
"""

import torch
from diffusers import StableDiffusionPipeline
import pynvml
import time
import logging
from datetime import datetime

class SafePixelArtGenerator:
    def __init__(self, model_path, lora_path, device="cuda"):
        """初始化生成器"""
        self.device = device
        self.logger = self.setup_logging()
        
        # 初始化GPU监控
        pynvml.nvmlInit()
        self.gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        
        # 温度阈值配置
        self.temp_config = {
            "safe_max": 75,      # 安全温度上限
            "warning_threshold": 80,  # 警告温度
            "critical_threshold": 85, # 严重温度
            "shutdown_threshold": 90  # 关机温度
        }
        
        # 加载模型
        self.logger.info("正在加载模型...")
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            safety_checker=None
        ).to(device)
        
        # 加载LoRA权重
        self.pipe.load_lora_weights(lora_path)
        self.logger.info("模型加载完成")
        
        # 生成统计
        self.stats = {
            "total_generations": 0,
            "total_time": 0,
            "temp_checks": 0,
            "temp_warnings": 0,
            "delays": 0
        }
    
    def setup_logging(self):
        """设置日志"""
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler('generation.log'),
                logging.StreamHandler()
            ]
        )
        return logging.getLogger(__name__)
    
    def check_gpu_temperature(self):
        """检查GPU温度"""
        try:
            temp = pynvml.nvmlDeviceGetTemperature(
                self.gpu_handle, 
                pynvml.NVML_TEMPERATURE_GPU
            )
            
            # 获取其他信息
            power_usage = pynvml.nvmlDeviceGetPowerUsage(self.gpu_handle) / 1000.0
            util = pynvml.nvmlDeviceGetUtilizationRates(self.gpu_handle)
            
            self.stats["temp_checks"] += 1
            
            return {
                "temperature": temp,
                "power": power_usage,
                "gpu_util": util.gpu,
                "memory_util": util.memory,
                "status": self.get_temperature_status(temp)
            }
            
        except Exception as e:
            self.logger.error(f"检查GPU温度失败: {e}")
            return None
    
    def get_temperature_status(self, temp):
        """获取温度状态"""
        if temp >= self.temp_config["shutdown_threshold"]:
            return "SHUTDOWN"
        elif temp >= self.temp_config["critical_threshold"]:
            return "CRITICAL"
        elif temp >= self.temp_config["warning_threshold"]:
            return "WARNING"
        elif temp >= self.temp_config["safe_max"]:
            return "ELEVATED"
        else:
            return "SAFE"
    
    def wait_for_cooling(self, current_temp, target_temp=70):
        """等待GPU冷却"""
        self.logger.warning(f"GPU温度过高: {current_temp}°C,等待冷却至{target_temp}°C以下")
        
        check_interval = 5  # 每5秒检查一次
        max_wait_time = 300  # 最多等待5分钟
        
        start_time = time.time()
        while time.time() - start_time < max_wait_time:
            status = self.check_gpu_temperature()
            if status and status["temperature"] < target_temp:
                self.logger.info(f"GPU已冷却至{status['temperature']}°C,继续生成")
                return True
            
            # 显示冷却进度
            elapsed = int(time.time() - start_time)
            self.logger.info(f"等待冷却中... 已等待{elapsed}秒,当前温度: {status['temperature']}°C")
            time.sleep(check_interval)
        
        self.logger.error(f"等待冷却超时,当前温度: {status['temperature']}°C")
        return False
    
    def adjust_generation_params(self, temp_status, original_params):
        """根据温度调整生成参数"""
        params = original_params.copy()
        
        if temp_status in ["WARNING", "CRITICAL"]:
            self.logger.warning(f"温度状态: {temp_status},自动调整生成参数")
            
            # 降低分辨率
            if params.get("height", 1024) > 512 or params.get("width", 1024) > 512:
                params["height"] = min(params.get("height", 1024), 512)
                params["width"] = min(params.get("width", 1024), 512)
                self.logger.info(f"分辨率调整为: {params['width']}x{params['height']}")
            
            # 减少生成步数
            if params.get("num_inference_steps", 30) > 10:
                params["num_inference_steps"] = max(10, params["num_inference_steps"] // 2)
                self.logger.info(f"生成步数调整为: {params['num_inference_steps']}")
            
            # 降低批次大小
            if params.get("num_images_per_prompt", 1) > 1:
                params["num_images_per_prompt"] = 1
                self.logger.info("批次大小调整为1")
        
        return params
    
    def safe_generate(self, prompt, **kwargs):
        """安全的图像生成方法"""
        # 默认参数
        default_params = {
            "height": 1024,
            "width": 1024,
            "num_inference_steps": 30,
            "guidance_scale": 4.0,
            "num_images_per_prompt": 1,
            "lora_scale": 1.0
        }
        
        # 更新用户参数
        params = default_params.copy()
        params.update(kwargs)
        
        # 检查GPU温度
        gpu_status = self.check_gpu_temperature()
        if not gpu_status:
            raise RuntimeError("无法获取GPU状态")
        
        self.logger.info(f"生成前GPU状态: {gpu_status['temperature']}°C, {gpu_status['power']:.1f}W")
        
        # 如果温度过高,等待冷却
        if gpu_status["status"] in ["CRITICAL", "SHUTDOWN"]:
            if not self.wait_for_cooling(gpu_status["temperature"]):
                raise RuntimeError(f"GPU温度过高: {gpu_status['temperature']}°C,无法安全生成")
        
        # 根据温度调整参数
        adjusted_params = self.adjust_generation_params(gpu_status["status"], params)
        
        # 添加像素艺术触发词
        full_prompt = f"Pixel Art, {prompt}"
        
        # 记录开始时间
        start_time = time.time()
        
        try:
            # 生成图像
            self.logger.info(f"开始生成: {prompt}")
            self.logger.info(f"生成参数: {adjusted_params}")
            
            with torch.no_grad():
                images = self.pipe(
                    prompt=full_prompt,
                    negative_prompt="blurry, low quality, realistic, photograph",
                    **adjusted_params
                ).images
            
            # 计算生成时间
            generation_time = time.time() - start_time
            
            # 更新统计
            self.stats["total_generations"] += 1
            self.stats["total_time"] += generation_time
            
            # 生成后检查温度
            post_status = self.check_gpu_temperature()
            if post_status:
                temp_increase = post_status["temperature"] - gpu_status["temperature"]
                self.logger.info(f"生成完成,耗时: {generation_time:.1f}秒")
                self.logger.info(f"温度变化: +{temp_increase:.1f}°C")
                self.logger.info(f"当前温度: {post_status['temperature']}°C")
            
            return images
            
        except Exception as e:
            self.logger.error(f"生成失败: {e}")
            raise
    
    def batch_generate_with_cooling(self, prompts, batch_size=2, cooling_interval=3):
        """批量生成,带冷却间隔"""
        results = []
        
        for i, prompt in enumerate(prompts):
            self.logger.info(f"处理第 {i+1}/{len(prompts)} 个提示词: {prompt}")
            
            try:
                # 生成前检查温度
                gpu_status = self.check_gpu_temperature()
                if gpu_status and gpu_status["status"] in ["WARNING", "CRITICAL"]:
                    self.logger.warning(f"温度警告: {gpu_status['temperature']}°C")
                    
                    # 增加冷却时间
                    extra_cooling = (gpu_status["temperature"] - 70) * 2  # 每超过1°C增加2秒
                    cooling_time = cooling_interval + extra_cooling
                    self.logger.info(f"增加冷却时间至 {cooling_time:.1f} 秒")
                    time.sleep(cooling_time)
                
                # 生成图像
                images = self.safe_generate(prompt)
                results.append((prompt, images))
                
                # 批次间冷却
                if i < len(prompts) - 1:
                    self.logger.info(f"批次间冷却 {cooling_interval} 秒...")
                    time.sleep(cooling_interval)
                    
            except Exception as e:
                self.logger.error(f"批量生成失败: {e}")
                results.append((prompt, None))
        
        return results
    
    def print_stats(self):
        """打印统计信息"""
        avg_time = self.stats["total_time"] / max(self.stats["total_generations"], 1)
        
        print("\n" + "="*50)
        print("生成统计信息:")
        print("="*50)
        print(f"总生成次数: {self.stats['total_generations']}")
        print(f"总生成时间: {self.stats['total_time']:.1f}秒")
        print(f"平均生成时间: {avg_time:.1f}秒/次")
        print(f"温度检查次数: {self.stats['temp_checks']}")
        print(f"温度警告次数: {self.stats['temp_warnings']}")
        print(f"冷却延迟次数: {self.stats['delays']}")
        print("="*50)
    
    def cleanup(self):
        """清理资源"""
        pynvml.nvmlShutdown()
        self.logger.info("资源清理完成")

# 使用示例
if __name__ == "__main__":
    # 初始化生成器
    generator = SafePixelArtGenerator(
        model_path="Qwen/Qwen-Image-2512",
        lora_path="prithivMLmods/Qwen-Image-2512-Pixel-Art-LoRA",
        device="cuda"
    )
    
    try:
        # 单次生成示例
        print("示例1: 单次生成")
        images = generator.safe_generate(
            "a pixel art knight with sword and shield, 8-bit style",
            height=1024,
            width=1024,
            num_inference_steps=20
        )
        
        # 保存图像
        if images:
            images[0].save("knight_pixel_art.png")
            print("图像已保存: knight_pixel_art.png")
        
        # 批量生成示例
        print("\n示例2: 批量生成(带冷却间隔)")
        prompts = [
            "pixel art castle on a hill, retro game style",
            "pixel art dragon flying in the sky, 16-bit style",
            "pixel art treasure chest with gems, simple pixel art"
        ]
        
        results = generator.batch_generate_with_cooling(prompts, cooling_interval=5)
        
        # 保存批量生成的图像
        for i, (prompt, images) in enumerate(results):
            if images:
                filename = f"batch_{i+1}.png"
                images[0].save(filename)
                print(f"保存: {filename} - {prompt}")
        
        # 打印统计信息
        generator.print_stats()
        
    finally:
        generator.cleanup()
EOF

这个脚本的核心改进:

  1. 生成前温度检查:每次生成前检查GPU温度
  2. 自动参数调整:温度过高时自动降低分辨率、减少步数
  3. 智能冷却:温度过高时等待冷却,冷却时间随温度动态调整
  4. 批量生成保护:批次间自动添加冷却间隔
  5. 详细日志:记录每次生成的温度变化和性能数据

5.2 创建温度感知的Gradio界面

如果你使用Gradio作为Web界面,可以集成温度监控功能:

# 创建带温度监控的Gradio界面
cat > gradio_app_with_monitor.py << 'EOF'
import gradio as gr
import torch
from diffusers import StableDiffusionPipeline
import pynvml
import time
from datetime import datetime
import json
import os

class TemperatureAwarePixelArtGenerator:
    def __init__(self):
        """初始化生成器"""
        self.device = "cuda"
        
        # 初始化GPU监控
        pynvml.nvmlInit()
        self.gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        
        # 温度配置
        self.temp_config = {
            "warning": 75,
            "critical": 82,
            "max_safe": 85
        }
        
        # 加载模型
        print("正在加载模型...")
        self.pipe = StableDiffusionPipeline.from_pretrained(
            "Qwen/Qwen-Image-2512",
            torch_dtype=torch.float16,
            safety_checker=None
        ).to(self.device)
        
        # 加载LoRA
        self.pipe.load_lora_weights("prithivMLmods/Qwen-Image-2512-Pixel-Art-LoRA")
        print("模型加载完成")
        
        # 生成历史
        self.generation_history = []
        
    def get_gpu_status(self):
        """获取GPU状态"""
        try:
            temp = pynvml.nvmlDeviceGetTemperature(self.gpu_handle, pynvml.NVML_TEMPERATURE_GPU)
            power = pynvml.nvmlDeviceGetPowerUsage(self.gpu_handle) / 1000.0
            util = pynvml.nvmlDeviceGetUtilizationRates(self.gpu_handle)
            
            # 获取风扇速度
            try:
                fan_speed = pynvml.nvmlDeviceGetFanSpeed(self.gpu_handle)
            except:
                fan_speed = 0
            
            # 获取时钟频率
            try:
                clock_graphics = pynvml.nvmlDeviceGetClockInfo(self.gpu_handle, pynvml.NVML_CLOCK_GRAPHICS)
                clock_memory = pynvml.nvmlDeviceGetClockInfo(self.gpu_handle, pynvml.NVML_CLOCK_MEM)
            except:
                clock_graphics = 0
                clock_memory = 0
            
            return {
                "temperature": temp,
                "power_watts": power,
                "gpu_utilization": util.gpu,
                "memory_utilization": util.memory,
                "fan_speed": fan_speed,
                "clock_graphics": clock_graphics,
                "clock_memory": clock_memory,
                "timestamp": datetime.now().isoformat()
            }
        except Exception as e:
            print(f"获取GPU状态失败: {e}")
            return None
    
    def get_temperature_color(self, temp):
        """根据温度获取颜色标识"""
        if temp >= self.temp_config["critical"]:
            return "🔴"  # 红色
        elif temp >= self.temp_config["warning"]:
            return "🟡"  # 黄色
        else:
            return "🟢"  # 绿色
    
    def check_temperature_safe(self):
        """检查温度是否安全"""
        status = self.get_gpu_status()
        if not status:
            return True, "无法获取温度状态"
        
        temp = status["temperature"]
        
        if temp >= self.temp_config["max_safe"]:
            return False, f"温度过高: {temp}°C ≥ {self.temp_config['max_safe']}°C"
        elif temp >= self.temp_config["critical"]:
            return False, f"温度严重: {temp}°C ≥ {self.temp_config['critical']}°C"
        elif temp >= self.temp_config["warning"]:
            return True, f"温度警告: {temp}°C ≥ {self.temp_config['warning']}°C,建议等待冷却"
        else:
            return True, f"温度正常: {temp}°C"
    
    def adjust_params_by_temperature(self, params, temp):
        """根据温度调整参数"""
        adjusted = params.copy()
        
        if temp >= self.temp_config["critical"]:
            # 严重高温,大幅降低参数
            adjusted["num_inference_steps"] = max(10, adjusted.get("num_inference_steps", 30) // 2)
            if adjusted.get("height", 1024) > 512:
                adjusted["height"] = 512
                adjusted["width"] = 512
            adjusted["num_images_per_prompt"] = 1
            
        elif temp >= self.temp_config["warning"]:
            # 警告温度,适度降低参数
            adjusted["num_inference_steps"] = max(15, adjusted.get("num_inference_steps", 30) - 10)
            if adjusted.get("height", 1024) > 768:
                adjusted["height"] = 768
                adjusted["width"] = 768
        
        return adjusted
    
    def generate_image(self, prompt, height=1024, width=1024, num_steps=30, 
                      guidance_scale=4.0, lora_scale=1.0, num_images=1, seed=-1):
        """生成图像"""
        # 检查温度
        is_safe, temp_message = self.check_temperature_safe()
        
        if not is_safe:
            return None, f"❌ {temp_message},生成已取消。请等待GPU冷却后再试。"
        
        # 获取当前温度
        gpu_status = self.get_gpu_status()
        if not gpu_status:
            return None, "无法获取GPU状态"
        
        current_temp = gpu_status["temperature"]
        temp_color = self.get_temperature_color(current_temp)
        
        # 根据温度调整参数
        params = {
            "height": height,
            "width": width,
            "num_inference_steps": num_steps,
            "guidance_scale": guidance_scale,
            "num_images_per_prompt": num_images,
            "lora_scale": lora_scale
        }
        
        if seed != -1:
            params["generator"] = torch.Generator(device=self.device).manual_seed(seed)
        
        # 温度过高时自动调整参数
        if current_temp >= self.temp_config["warning"]:
            params = self.adjust_params_by_temperature(params, current_temp)
            temp_message += f" 已自动调整参数: {params['height']}x{params['width']}, {params['num_inference_steps']}步"
        
        # 添加像素艺术触发词
        full_prompt = f"Pixel Art, {prompt}"
        
        # 记录开始时间
        start_time = time.time()
        
        try:
            # 生成图像
            with torch.no_grad():
                images = self.pipe(
                    prompt=full_prompt,
                    negative_prompt="blurry, low quality, realistic, photograph",
                    **params
                ).images
            
            # 计算生成时间
            generation_time = time.time() - start_time
            
            # 获取生成后温度
            post_status = self.get_gpu_status()
            post_temp = post_status["temperature"] if post_status else current_temp
            temp_increase = post_temp - current_temp
            
            # 保存生成记录
            record = {
                "prompt": prompt,
                "params": params,
                "temperature_before": current_temp,
                "temperature_after": post_temp,
                "temp_increase": temp_increase,
                "generation_time": generation_time,
                "timestamp": datetime.now().isoformat()
            }
            self.generation_history.append(record)
            
            # 保存历史记录到文件
            self.save_history()
            
            # 生成状态信息
            status_info = (
                f"{temp_color} 温度: {current_temp}°C → {post_temp}°C (Δ{temp_increase:+.1f}°C)\n"
                f"⏱️ 生成时间: {generation_time:.1f}秒\n"
                f"📏 分辨率: {params['height']}x{params['width']}\n"
                f"🔢 步数: {params['num_inference_steps']}\n"
                f"💾 显存占用: {gpu_status['memory_utilization']}%\n"
                f"⚡ 功耗: {gpu_status['power_watts']:.1f}W"
            )
            
            if current_temp >= self.temp_config["warning"]:
                status_info += f"\n⚠️ 注意: GPU温度较高,建议增加冷却间隔"
            
            return images[0] if images else None, status_info
            
        except Exception as e:
            error_msg = f"生成失败: {str(e)}"
            print(error_msg)
            return None, f"❌ {error_msg}"
    
    def save_history(self):
        """保存生成历史"""
        try:
            with open("generation_history.json", "w") as f:
                json.dump(self.generation_history[-100:], f, indent=2)  # 只保存最近100条
        except Exception as e:
            print(f"保存历史失败: {e}")
    
    def get_temperature_history(self):
        """获取温度历史"""
        if not self.generation_history:
            return []
        
        # 提取温度数据
        history = []
        for record in self.generation_history[-20:]:  # 最近20次
            history.append({
                "time": record["timestamp"][11:19],  # 只取时间部分
                "temp_before": record["temperature_before"],
                "temp_after": record["temperature_after"]
            })
        
        return history
    
    def get_stats(self):
        """获取统计信息"""
        if not self.generation_history:
            return "暂无生成记录"
        
        total = len(self.generation_history)
        avg_time = sum(r["generation_time"] for r in self.generation_history) / total
        avg_temp_increase = sum(r["temp_increase"] for r in self.generation_history) / total
        max_temp = max(r["temperature_after"] for r in self.generation_history)
        
        stats = (
            f"📊 生成统计:\n"
            f"总生成次数: {total}\n"
            f"平均生成时间: {avg_time:.1f}秒\n"
            f"平均温升: {avg_temp_increase:.1f}°C\n"
            f"最高温度: {max_temp:.1f}°C\n"
            f"最近生成: {self.generation_history[-1]['timestamp'][:19]}"
        )
        
        return stats

# 创建Gradio界面
def create_interface():
    """创建Gradio界面"""
    generator = TemperatureAwarePixelArtGenerator()
    
    with gr.Blocks(title="温度感知的像素艺术生成器", theme=gr.themes.Soft()) as demo:
        gr.Markdown("# 🎨 温度感知的像素艺术生成器")
        gr.Markdown("实时监控GPU温度,智能调整生成参数,防止过热")
        
        with gr.Row():
            with gr.Column(scale=2):
                # 温度监控面板
                with gr.Group():
                    gr.Markdown("### 🔍 GPU状态监控")
                    
                    temp_display = gr.Textbox(
                        label="当前温度状态",
                        value="正在获取温度...",
                        interactive=False
                    )
                    
                    gpu_stats = gr.Textbox(
                        label="GPU详细信息",
                        value="正在获取GPU信息...",
                        interactive=False,
                        lines=4
                    )
                    
                    update_btn = gr.Button("🔄 刷新状态", variant="secondary", size="sm")
                    
                    def update_status():
                        status = generator.get_gpu_status()
                        if not status:
                            return "无法获取GPU状态", "请检查GPU驱动"
                        
                        temp = status["temperature"]
                        temp_color = generator.get_temperature_color(temp)
                        is_safe, temp_msg = generator.check_temperature_safe()
                        
                        temp_display = f"{temp_color} {temp_msg}"
                        
                        stats_text = (
                            f"🌡️ 温度: {temp}°C\n"
                            f"⚡ 功耗: {status['power_watts']:.1f}W\n"
                            f"📊 GPU利用率: {status['gpu_utilization']}%\n"
                            f"💾 显存利用率: {status['memory_utilization']}%\n"
                            f"🌀 风扇速度: {status['fan_speed']}%\n"
                            f"⏱️ 核心频率: {status['clock_graphics']}MHz\n"
                            f"💿 显存频率: {status['clock_memory']}MHz"
                        )
                        
                        return temp_display, stats_text
                    
                    update_btn.click(
                        update_status,
                        outputs=[temp_display, gpu_stats]
                    )
                
                # 生成参数
                with gr.Group():
                    gr.Markdown("### ⚙️ 生成参数")
                    
                    prompt = gr.Textbox(
                        label="提示词",
                        value="a pixel art knight with sword and shield, 8-bit retro game style",
                        placeholder="描述你想要生成的像素艺术..."
                    )
                    
                    with gr.Row():
                        height = gr.Slider(
                            label="高度",
                            minimum=512,
                            maximum=1280,
                            value=1024,
                            step=64
                        )
                        
                        width = gr.Slider(
                            label="宽度",
                            minimum=512,
                            maximum=1280,
                            value=1024,
                            step=64
                        )
                    
                    with gr.Row():
                        num_steps = gr.Slider(
                            label="生成步数",
                            minimum=10,
                            maximum=50,
                            value=30,
                            step=5
                        )
                        
                        guidance_scale = gr.Slider(
                            label="引导比例",
                            minimum=1.0,
                            maximum=10.0,
                            value=4.0,
                            step=0.5
                        )
                    
                    with gr.Row():
                        lora_scale = gr.Slider(
                            label="LoRA强度",
                            minimum=0.0,
                            maximum=2.0,
                            value=1.0,
                            step=0.1
                        )
                        
                        seed = gr.Number(
                            label="种子",
                            value=-1,
                            precision=0
                        )
                    
                    num_images = gr.Slider(
                        label="生成数量",
                        minimum=1,
                        maximum=4,
                        value=1,
                        step=1
                    )
                
                # 生成按钮
                generate_btn = gr.Button("🚀 生成像素艺术", variant="primary", size="lg")
                
                # 温度警告
                temp_warning = gr.Markdown(
                    "⚠️ **温度安全提示**: 当GPU温度超过75°C时,系统会自动降低生成参数以保证安全。"
                )
            
            with gr.Column(scale=1):
                # 生成结果
                gr.Markdown("### 🖼️ 生成结果")
                
                output_image = gr.Image(
                    label="生成的像素艺术",
                    type="pil",
                    height=400
                )
                
                status_info = gr.Textbox(
                    label="生成状态",
                    value="等待生成...",
                    interactive=False,
                    lines=6
                )
                
                # 统计信息
                with gr.Group():
                    gr.Markdown("### 📈 生成统计")
                    
                    stats_display = gr.Textbox(
                        label="统计信息",
                        value="暂无数据",
                        interactive=False,
                        lines=5
                    )
                    
                    stats_btn = gr.Button("📊 更新统计", variant="secondary", size="sm")
                    
                    def update_stats():
                        return generator.get_stats()
                    
                    stats_btn.click(
                        update_stats,
                        outputs=stats_display
                    )
                
                # 温度历史
                with gr.Group():
                    gr.Markdown("### 📊 温度历史")
                    
                    temp_history = gr.Dataframe(
                        headers=["时间", "生成前温度", "生成后温度"],
                        value=[],
                        interactive=False,
                        height=200
                    )
                    
                    history_btn = gr.Button("🔄 刷新历史", variant="secondary", size="sm")
                    
                    def update_history():
                        history = generator.get_temperature_history()
                        if history:
                            # 转换为DataFrame格式
                            data = [
                                [h["time"], h["temp_before"], h["temp_after"]]
                                for h in history
                            ]
                            return data
                        return []
                    
                    history_btn.click(
                        update_history,
                        outputs=temp_history
                    )
        
        # 生成函数
        def generate_with_status(prompt, height, width, num_steps, guidance_scale, lora_scale, num_images, seed):
            # 先更新状态显示
            temp_status, gpu_info = update_status()
            
            # 生成图像
            image, status = generator.generate_image(
                prompt, height, width, num_steps, 
                guidance_scale, lora_scale, num_images, seed
            )
            
            # 更新统计和历史
            stats = generator.get_stats()
            history = generator.get_temperature_history()
            
            return image, status, stats, history
        
        # 绑定事件
        generate_btn.click(
            generate_with_status,
            inputs=[prompt, height, width, num_steps, guidance_scale, lora_scale, num_images, seed],
            outputs=[output_image, status_info, stats_display, temp_history]
        )
        
        # 初始化显示
        demo.load(
            update_status,
            outputs=[temp_display, gpu_stats]
        )
        
        # 添加一些使用提示
        gr.Markdown("""
        ### 💡 使用提示
        
        1. **温度监控**: 系统会实时监控GPU温度,温度过高时自动调整生成参数
        2. **参数调整**: 温度超过75°C时,系统会自动降低分辨率和生成步数
        3. **冷却建议**: 连续生成时,建议在批次间添加冷却间隔
        4. **最佳实践**: 
           - 单次生成后等待30秒再生成下一张
           - 批量生成时使用较低的分辨率(如768×768)
           - 避免在环境温度过高时连续生成
        5. **安全阈值**:
           - 🟢 安全: <75°C
           - 🟡 警告: 75-82°C
           - 🔴 危险: >82°C
        """)
    
    return demo, generator

# 启动应用
if __name__ == "__main__":
    demo, generator = create_interface()
    
    try:
        print("启动温度感知像素艺术生成器...")
        print("访问地址: http://localhost:7860")
        print("按 Ctrl+C 停止服务")
        
        demo.launch(
            server_name="0.0.0.0",
            server_port=7860,
            share=False
        )
    except KeyboardInterrupt:
        print("\n正在停止服务...")
    finally:
        # 清理资源
        pynvml.nvmlShutdown()
        print("资源清理完成")
EOF

这个Gradio界面提供了:

  1. 实时温度监控:显示当前GPU温度、功耗、利用率
  2. 智能参数调整:温度过高时自动降低生成参数
  3. 温度历史记录:查看每次生成的温度变化
  4. 生成统计:记录生成次数、平均时间、温度变化
  5. 安全提示:根据温度显示不同的警告级别

6. 高级优化与最佳实践

6.1 温度优化配置

除了监控,我们还可以主动优化GPU的工作环境:

# 创建GPU优化脚本
cat > optimize_gpu.sh << 'EOF'
#!/bin/bash

# GPU优化脚本
# 优化GPU设置以降低温度和功耗

echo "开始优化GPU设置..."

# 1. 设置GPU功耗限制(需要nvidia-smi)
if command -v nvidia-smi &> /dev/null; then
    echo "设置GPU功耗限制..."
    
    # 获取当前GPU数量
    GPU_COUNT=$(nvidia-smi --query-gpu=count --format=csv,noheader | head -1)
    
    for ((i=0; i<GPU_COUNT; i++)); do
        # 设置功耗限制为最大功耗的80%
        MAX_POWER=$(nvidia-smi -i $i --query-gpu=power.limit --format=csv,noheader | awk '{print $1}')
        TARGET_POWER=$(echo "$MAX_POWER * 0.8" | bc | cut -d. -f1)
        
        echo "GPU $i: 设置功耗限制为 ${TARGET_POWER}W (原${MAX_POWER}W)"
        sudo nvidia-smi -i $i -pl $TARGET_POWER
        
        # 设置性能模式为自适应
        sudo nvidia-smi -i $i -pm 1
        sudo nvidia-smi -i $i -acp 0
    done
fi

# 2. 优化系统电源管理
echo "优化系统电源管理..."

# 设置CPU调度器为性能模式
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then
    echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
fi

# 禁用USB自动挂起
if [ -f /sys/module/usbcore/parameters/autosuspend ]; then
    echo "-1" | sudo tee /sys/module/usbcore/parameters/autosuspend
fi

# 3. 优化内存管理
echo "优化内存管理..."

# 清理页面缓存
sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# 调整swappiness(减少交换)
if [ -f /proc/sys/vm/swappiness ]; then
    echo "10" | sudo tee /proc/sys/vm/swappiness
fi

# 4. 优化IO调度
echo "优化IO调度..."

# 设置IO调度器为deadline(适合SSD)
if command -v lsblk &> /dev/null; then
    for disk in $(lsblk -d -o name | grep -v NAME); do
        if [ -f "/sys/block/$disk/queue/scheduler" ]; then
            echo "deadline" | sudo tee "/sys/block/$disk/queue/scheduler" > /dev/null 2>&1
        fi
    done
fi

# 5. 创建温度监控别名
echo "创建监控别名..."

cat >> ~/.bashrc << 'ALIASES'

# GPU监控别名
alias gputemp='nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader'
alias gpupower='nvidia-smi --query-gpu=power.draw --format=csv,noheader'
alias gpumem='nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader'
alias gpustat='watch -n 1 "nvidia-smi --query-gpu=name,temperature.gpu,power.draw,memory.used,memory.total,utilization.gpu --format=csv"'
alias gpuwatch='watch -n 1 nvidia-smi'

# 快速温度检查
alias tempcheck='echo "GPU温度:" $(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)°C'

ALIASES

# 6. 创建自动清理脚本
cat > ~/cleanup_gpu.sh << 'CLEANUP'
#!/bin/bash

# GPU清理脚本
# 清理GPU内存和进程

echo "清理GPU内存..."

# 清理GPU进程
if command -v nvidia-smi &> /dev/null; then
    # 查找占用GPU的进程
    PIDS=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sort -u)
    
    if [ -n "$PIDS" ]; then
        echo "找到以下GPU进程:"
        nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
        
        read -p "是否终止这些进程? (y/n): " -n 1 -r
        echo
        if [[ $REPLY =~ ^[Yy]$ ]]; then
            for PID in $PIDS; do
Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐