通过nvidia-smi 实现gpu可用卡自动选择

由于现代计算环境通常配置了多个GPU，合理地选择和管理这些资源变得至关重要。：在共享计算环境中，其他用户的任务可能会占用GPU资源。该代码可以动态查询GPU状态，确保选择的GPU在实际使用时是最优的。：通过检查GPU的可用性和状态，该代码可以防止由于选择了不可用或性能较差的GPU而导致的运行时错误。：用户可以通过参数调整选择策略，如优先考虑内存大小或功耗，从而灵活适应不同的计算需求和环境。：多个G

狂奔solar

949人浏览 · 2024-11-10 14:14:29

狂奔solar · 2024-11-10 14:14:29 发布

为什么需要GPU选择代码

在深度学习和高性能计算中，GPU（图形处理单元）被广泛用于加速模型训练和推理。由于现代计算环境通常配置了多个GPU，合理地选择和管理这些资源变得至关重要。对于有多个训练任务在跑的机器使用一套管理代码有下面好处：

资源优化：多个GPU可用时，选择最空闲的GPU可以有效利用计算资源，避免因资源竞争导致的性能下降。
动态环境适应：在共享计算环境中，其他用户的任务可能会占用GPU资源。该代码可以动态查询GPU状态，确保选择的GPU在实际使用时是最优的。
简化管理：手动管理和选择GPU可能会繁琐且容易出错。这段代码自动化了这一过程，使得用户能够更专注于模型开发，而无需担心底层资源管理。
避免错误：通过检查GPU的可用性和状态，该代码可以防止由于选择了不可用或性能较差的GPU而导致的运行时错误。
灵活性：用户可以通过参数调整选择策略，如优先考虑内存大小或功耗，从而灵活适应不同的计算需求和环境。

检查GPU可用性

def check_gpus():
    if not torch.cuda.is_available():
        print('This script could only be used to manage NVIDIA GPUs,but no GPU found in your device')
        return False
    elif not 'NVIDIA System Management' in os.popen('nvidia-smi -h').read():
        print("'nvidia-smi' tool not found.")
        return False
    return True

功能：检查当前设备是否有可用的NVIDIA GPU。
逻辑：
- 使用 torch.cuda.is_available() 检查CUDA是否可用。
- 使用 os.popen('nvidia-smi -h') 检查 nvidia-smi 工具是否可用（此工具用于查询GPU状态）。
- 如果GPU不可用或 nvidia-smi 不存在，返回 False，否则返回 True。

解析nvidia-smi输出

def parse(line, qargs):
    numberic_args = ['memory.free', 'memory.total', 'power.draw', 'power.limit']
    power_manage_enable = lambda v: (not 'Not Support' in v)
    to_numberic = lambda v: float(v.upper().strip().replace('MIB','').replace('W',''))
    process = lambda k, v: ((int(to_numberic(v)) if power_manage_enable(v) else 1) if k in numberic_args else v.strip())
    return {k: process(k, v) for k, v in zip(qargs, line.strip().split(','))}

功能：解析通过 nvidia-smi 获取的CSV格式的GPU信息。
逻辑：
- 定义了数值参数的列表。
- 使用一些lambda函数处理和转换字符串值（去除单位、转换为数字等）。
- 将解析后的信息以字典形式返回。

查询GPU信息

def query_gpu(qargs=[]):
    qargs = ['index', 'gpu_name', 'memory.free', 'memory.total', 'power.draw', 'power.limit'] + qargs
    cmd = 'nvidia-smi --query-gpu={} --format=csv,noheader'.format(','.join(qargs))
    results = os.popen(cmd).readlines()
    return [parse(line, qargs) for line in results]

选择最空闲GPU

def auto_choice(self, mode=0):
    for old_infos, new_infos in zip(self.gpus, query_gpu(self.qargs)):
        old_infos.update(new_infos)
    unspecified_gpus = [gpu for gpu in self.gpus if not gpu['specified']] or self.gpus

    if mode == 0:
        chosen_gpu = self._sort_by_memory(unspecified_gpus, True)[0]
    # 其他选择逻辑...
    chosen_gpu['specified'] = True
    index = chosen_gpu['index']
    print('Using GPU {i}:\n{info}'.format(i=index, info='\n'.join([str(k)+':'+str(v) for k,v in chosen_gpu.items()])))
    return int(index)

功能：自动选择最空闲的GPU。
逻辑：
- 更新GPU信息。
- 根据不同模式选择GPU（如按自由内存大小、内存利用率等）。
- 打印所选GPU的信息并返回其索引。

最后我们把上面方法封装到GpuService类里

import os
import torch
def check_gpus():
    '''
    GPU available check
    http://pytorch-cn.readthedocs.io/zh/latest/package_references/torch-cuda/
    '''
    if not torch.cuda.is_available():
        print('This script could only be used to manage NVIDIA GPUs,but no GPU found in your device')
        return False
    elif not 'NVIDIA System Management' in os.popen('nvidia-smi -h').read():
        print("'nvidia-smi' tool not found.")
        return False
    return True

if check_gpus():
    def parse(line,qargs):
        '''
        line:
            a line of text
        qargs:
            query arguments
        return:
            a dict of gpu infos
        Pasing a line of csv format text returned by nvidia-smi
        解析一行nvidia-smi返回的csv格式文本
        '''
        numberic_args = ['memory.free', 'memory.total', 'power.draw', 'power.limit']#可计数的参数
        power_manage_enable=lambda v:(not 'Not Support' in v)#lambda表达式，显卡是否滋瓷power management（笔记本可能不滋瓷）
        to_numberic=lambda v:float(v.upper().strip().replace('MIB','').replace('W',''))#带单位字符串去掉单位
        process = lambda k,v:((int(to_numberic(v)) if power_manage_enable(v) else 1) if k in numberic_args else v.strip())
        return {k:process(k,v) for k,v in zip(qargs,line.strip().split(','))}

    def query_gpu(qargs=[]):
        '''
        qargs:
            query arguments
        return:
            a list of dict
        Querying GPUs infos
        查询GPU信息
        '''
        qargs =['index','gpu_name', 'memory.free', 'memory.total', 'power.draw', 'power.limit']+ qargs
        cmd = 'nvidia-smi --query-gpu={} --format=csv,noheader'.format(','.join(qargs))
        results = os.popen(cmd).readlines()
        return [parse(line,qargs) for line in results]

    def by_power(d):
        '''
        helper function fo sorting gpus by power
        '''
        power_infos=(d['power.draw'],d['power.limit'])
        if any(v==1 for v in power_infos):
            print('Power management unable for GPU {}'.format(d['index']))
            return 1
        return float(d['power.draw'])/d['power.limit']

    class GpuService():
        '''
        qargs:
            query arguments
        A manager which can list all available GPU devices
        and sort them and choice the most free one.Unspecified
        ones pref.
        GPU设备管理器，考虑列举出所有可用GPU设备，并加以排序，自动选出
        最空闲的设备。在一个GPUManager对象内会记录每个GPU是否已被指定，
        优先选择未指定的GPU。
        '''
        def G(self,qargs=[]):
            '''
            '''
            self.qargs=qargs
            self.gpus=query_gpu(qargs)
            for gpu in self.gpus:
                gpu['specified']=False
            self.gpu_num=len(self.gpus)

        def _sort_by_memory(self,gpus,by_size=False):
            if by_size:
                print('Sorted by free memory size')
                return sorted(gpus,key=lambda d:d['memory.free'],reverse=True)
            else:
                print('Sorted by free memory rate')
                return sorted(gpus,key=lambda d:float(d['memory.free'])/ d['memory.total'],reverse=True)

        def _sort_by_power(self,gpus):
            return sorted(gpus,key=by_power)

        def _sort_by_custom(self,gpus,key,reverse=False,qargs=[]):
            if isinstance(key,str) and (key in qargs):
                return sorted(gpus,key=lambda d:d[key],reverse=reverse)
            if isinstance(key,type(lambda a:a)):
                return sorted(gpus,key=key,reverse=reverse)
            raise ValueError("The argument 'key' must be a function or a key in query args,please read the documention of nvidia-smi")

        def auto_choice(self,mode=0):
            '''
            mode:
                0:(default)sorted by free memory size
            return:
                a TF device object
            Auto choice the freest GPU device,not specified
            ones
            自动选择最空闲GPU,返回索引
            '''
            for old_infos,new_infos in zip(self.gpus,query_gpu(self.qargs)):
                old_infos.update(new_infos)
            unspecified_gpus=[gpu for gpu in self.gpus if not gpu['specified']] or self.gpus

            if mode==0:
                print('Choosing the GPU device has largest free memory...')
                chosen_gpu=self._sort_by_memory(unspecified_gpus,True)[0]
            elif mode==1:
                print('Choosing the GPU device has highest free memory rate...')
                chosen_gpu=self._sort_by_power(unspecified_gpus)[0]
            elif mode==2:
                print('Choosing the GPU device by power...')
                chosen_gpu=self._sort_by_power(unspecified_gpus)[0]
            else:
                print('Given an unaviliable mode,will be chosen by memory')
                chosen_gpu=self._sort_by_memory(unspecified_gpus)[0]
            chosen_gpu['specified']=True
            index=chosen_gpu['index']
            print('Using GPU {i}:\n{info}'.format(i=index,info='\n'.join([str(k)+':'+str(v) for k,v in chosen_gpu.items()])))
            return int(index)
else:
    raise ImportError('GPU available check failed')

使用过程很简单

gm=GpuService()
with torch.cuda.device(gm.auto_choice()):
# 训练代码

或者
gm=GpuService()
torch.cuda.set_device(gm.auto_choice())

在gpu机器上运下下代码

腾讯云开发者社区

腾讯云面向开发者汇聚海量精品云计算使用和开发经验，营造开放的云计算技术生态圈。

更多推荐

终极指南：Flink SQL连接器版本管理从混乱到有序的升级之路

Apache Flink作为流处理领域的佼佼者，其SQL连接器的版本管理一直是开发者面临的核心挑战。本文将系统讲解Flink SQL连接器版本管理的最佳实践，帮助你轻松应对版本兼容性问题，实现从混乱到有序的升级之旅。## 连接器版本管理的常见痛点 😫在Flink应用开发中，连接器版本管理常常让开发者头疼不已。不同版本的连接器可能导致各种兼容性问题，例如API变更、功能差异甚至运行时错误。

腾讯云开发者社区

Elasticsearch复杂数据类型终极指南：从入门到精通

Elasticsearch作为功能强大的搜索引擎，支持多种复杂数据类型，让开发者能够灵活处理各种结构化和非结构化数据。本文将带你全面了解Elasticsearch中的复杂数据类型，从基础概念到实际应用，助你轻松掌握数据建模的核心技巧。## 内部对象：构建层级化数据结构在Elasticsearch中，对象类型（Object）是最基础的复杂数据类型之一，用于表示具有嵌套关系的数据。例如，我们可

腾讯云开发者社区

如何快速搭建Neon无服务器PostgreSQL：面向初学者的完整指南

Neon是一款革命性的无服务器PostgreSQL解决方案，它通过分离存储和计算层，实现了自动扩缩容、类代码式数据库分支以及零级扩展能力。本指南将帮助你从零开始搭建Neon开发环境，体验这款创新数据库的强大功能。## 准备工作：环境要求与依赖项在开始搭建Neon环境前，请确保你的系统满足以下要求：- Linux操作系统（推荐Ubuntu 20.04+或Debian 11+）- Git