Nvidia jetson使用tensorrt yolov8的 int8和mix precision混合量化的

具体可以看这个pull request: Tensorrt Mix Precision or INT8 conversion, mix precision almost same size and speed with INT8, but better precision, the converted model have good detection result with mix precision. by ZouJiu1 · Pull Request #9969 · ultralytics/ultralytics (github.com)

其他: tensorrt 10转yolov8模型engine和推理inference - 知乎 (zhihu.com)

tensorrt 10.0.06在win10安装以及版本的api变更 - 知乎 (zhihu.com)

转出来的模型可以正常inference,INT8的FPS,mix precision的FPS都提升了很多的,目测准确率下降了点但不多

在自己train的yolov8x.pt模型上试验tensorrt,用了10000个图片算平均inference时间,结果是FP16的engine的FPS是72左右,pytorch的FPS是36左右,INT8的engine的FPS是96左右,mix precision的engine的FPS是106左右

这边使用了官方的模型来试验,发现结果还是挺好的

INT8的模型压缩2x2倍,mix precision的模型压缩倍数类似,速度也类似的

INT8

int8就是权重和输入都采用int8的数字,int8的数值也就是在[-128, 127]之间

int8需要打开一个开关 half=False, int8 = True

mix precision

mix precision:就是某些层保证是FP16即float16,某些层保证是INT8,不同的层采用不同的压缩方式

一般来说的话,前几层、最后几层应该保证是FP16,其他层可以是INT8

下面就采用了前两层、最后一层保证是FP16,前两层和最后一层使用float16可以保证准确率下降少很多,前面几层和最后几层还是比较重要的,QAT的几个paper都提到了前面几层和后面几层尽量量化bit数不能太低,最好用float16或者float32。

mix precision的设置方式,就是依靠层的类型,以及Onnx里面层的名称来设置的,一般mix precision保证某些卷积层convolution是FP16。

mix precision需要打开两个开关 half=True, int8 = True

codes

运行以前还需要按照这个pull request做相应的修改

https://github.com/ultralytics/ultralytics/pull/9969

实际转换用到的codes,可以将yolov8x.pt修改到yolov8l.pt或者yolov8n.pt等

int8转换和mix precision转换都需要准备校准图片, https://github.com/NVIDIA/TensorRT/blob/main/samples/python/detectron2

1、下载 coco的val2017然后解压就行,这个就是下面的calib_input = r’E:\work\codeRepo\deploy\jz\val2017’

2、codes需要按照下面的方式
3、运行以前还需要注释掉ultralytics\cfg\init.py文件中的第322行raise SyntaxError(string + CLI_HELP_MSG) from e,防止报错的

转模型

模型转换的时候export,INT8设置 half=False, int8 = True ;mix precision设置 half = True, int8 = True

inference的时候,不需要配置这两个开关,也就是两者都是 half = False, int8 = False

官方转出来的模型,可以在这里下载

https://www.alipan.com/s/FdfFoPDGCWH

TensorRT/samples/python/detectron2 at main · NVIDIA/TensorRT

https://github.com/NVIDIA/TensorRT/blob/main/samples/python/efficientdet

设置batch size

根据英文文档, https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c,export转换的时候,batch_size一般越大越好,这样校准的时候就越准确。而且要保证数据和train的时候相类似,尽量打乱顺序的。To avoid this issue, calibrate with as large a single batch as possible, and ensure that calibration batches are well randomized and have similar distribution.

infer

inference的时候,设置好转换出来的engine档案,然后正常使用model.predict就可以,不需要配置额外的参数,不需要其他操作的
而且输入的图片最好是float32的,输出也配置到float32,不过默认就是的,所以不需要额外配置的,保持下面的默认就行

import os
import gc
import sys
sys.path.append(r'E:\work\codeRepo\deploy\common\zj\ultralytics')
from ultralytics import YOLO # newest version from "git clone and git pull"
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

'''
Platform: Window11

Ultralytics YOLOv8.1.44   Python-3.9.18 torch-2.2.1+cu118 CUDA:0 (NVIDIA GeForce RTX 4070 Ti, 12282MiB)

onnx 1.16.0 opset 17

TensorRT 10.0.0b6:  
https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.0/zip/TensorRT-10.0.0.6.Windows10.win10.cuda-11.8.zip

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
'''

if __name__ == '__main__':
    file = r'yolov8x.pt'
    model = YOLO(file)  # load a pretrained model (recommended for training)
    calib_input = r'E:\work\codeRepo\deploy\jz\val2017'
    cache_file = r'E:\work\codeRepo\deploy\jz\calibration.cache'
    if os.path.exists(cache_file):
        os.remove(cache_file)
    '''
    https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c
    To avoid this issue, calibrate with as large a single batch as possible, 
    and ensure that calibration batches are well randomized and have similar distribution.
    '''
    results0 = model.export(format='engine', simplify=True,
                            half=True,
                            int8=True,
                            calib_batch_size=20,
                            calib_num_images=len(os.listdir(calib_input)),
                            calib_input=calib_input,
                            cache_file=cache_file,
                            device='cuda:0')
    del model
    gc.collect()
    model = YOLO(r"E:\work\%s"%(file.replace(".pt", ".engine")))
    result = model.predict(
                           'https://ultralytics.com/images/bus.jpg', 
                           save_dir=r'.//', 
                           save=True)

将下面的 ultralytics/engine/tensorrt_int8/calibrator.py放入相应的地方

做了相应的修改,主要就是去掉了cuda的部分,直接使用了pytorch来分配显存和数组

根据英文文档, Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

可以修改不同的校准类,像:trt.IInt8EntropyCalibrator2, trt.IInt8MinMaxCalibrator,trt.IInt8EntropyCalibrator,trt.IInt8LegacyCalibrator

面使用的是trt.IInt8MinMaxCalibrator,也可以尝试使用trt.IInt8EntropyCalibrator2,根据英文文档,trt.IInt8EntropyCalibrator2比较适合CNN网络,IInt8MinMaxCalibrator比较适合bert或者NLP的网 络。

**IInt8EntropyCalibrator2** Entropy calibration chooses the tensor’s scale factor to optimize the quantized tensor’s information-theoretic content, and usually suppresses outliers in the distribution. This is the current and recommended entropy calibrator and is required for DLA. Calibration happens before Layer fusion by default. Calibration batch size may impact the final result. It is recommended for CNN-based networks.

IInt8MinMaxCalibrator This calibrator uses the entire range of the activation distribution to determine the scale factor. It seems to work better for NLP tasks. Calibration happens before Layer fusion by default. This is recommended for networks such as NVIDIA BERT (an optimized version of Google’s official implementation).

IInt8EntropyCalibrator This is the original entropy calibrator. It is less complicated to use than the LegacyCalibrator and typically produces better results. Calibration batch size may impact the final result. Calibration happens after Layer fusion by default.

IInt8LegacyCalibrator This calibrator is for compatibility with TensorRT 2.0 EA. This calibrator requires user parameterization and is provided as a fallback option if the other calibrators yield poor results. Calibration happens after Layer fusion by default. You can customize this calibrator to implement percentile max, for example, 99.99% percentile max is observed to have best accuracy for NVIDIA BERT and NeMo ASR model QuartzNet

#
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/efficientdet/build_engine.py

import os
import sys
import tensorrt as trt
from ultralytics.utils import LOGGER

sys.path.insert(1, os.path.join(os.path.dirname(os.path.realpath(__file__)), os.pardir))

from ultralytics.engine.tensorrt_int8.image_batcher import ImageBatcher

log = LOGGER


# class EngineCalibrator(trt.IInt8EntropyCalibrator2):
class EngineCalibrator(trt.IInt8MinMaxCalibrator):
    # class EngineCalibrator(trt.IInt8EntropyCalibrator):
    # class EngineCalibrator(trt.IInt8LegacyCalibrator):
    """Implements the INT8 Entropy Calibrator 2 or IInt8MinMaxCalibrator."""

    def __init__(self, cache_file, device="cuda:0"):
        """
        :param cache_file: The location of the cache file.
        """
        super().__init__()
        self.cache_file = cache_file
        self.image_batcher = None
        self.batch_allocation = None
        self.batch_generator = None
        self.device = device

    def set_image_batcher(self, image_batcher: ImageBatcher):
        """
        Define the image batcher to use, if any.
        If using only the cache file, an image batcher doesn't need to be defined.
        :param image_batcher: The ImageBatcher object
        """
        self.image_batcher = image_batcher
        self.batch_generator = self.image_batcher.get_batch()

    def get_batch_size(self):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Get the batch size to use for calibration.
        :return: Batch size.
        """
        if self.image_batcher:
            return self.image_batcher.batch_size
        return 1

    def get_batch(self, names):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Get the next batch to use for calibration, as a list of device memory pointers.
        :param names: The names of the inputs, if useful to define the order of inputs.
        :return: A list of int-casted memory pointers.
        """
        if not self.image_batcher:
            return None
        try:
            batch = next(self.batch_generator)
            LOGGER.info(
                "Calibrating image {} / {}".format(self.image_batcher.image_index, self.image_batcher.num_images)
            )
            # common.memcpy_host_to_device(self.batch_allocation, np.ascontiguousarray(batch))
            # return [int(self.batch_allocation)]
            return [int(batch.data_ptr())]
            # return [batch.data_ptr()]
        except StopIteration:
            LOGGER.info("Finished calibration batches")
            return None

    def read_calibration_cache(self):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Read the calibration cache file stored on disk, if it exists.
        :return: The contents of the cache file, if any.
        """
        if self.cache_file is not None and os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                LOGGER.info("Using calibration cache file: {}".format(self.cache_file))
                return f.read()

    def write_calibration_cache(self, cache):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Store the calibration cache to a file on disk.
        :param cache: The contents of the calibration cache to store.
        """
        if self.cache_file is None:
            return
        with open(self.cache_file, "wb") as f:
            LOGGER.info("Writing calibration cache data to: {}".format(self.cache_file))
            f.write(cache)

将下面的 ultralytics/engine/tensorrt_int8/image_batcher.py放入相应的地方

ultralytics/ultralytics/engine/tensorrt_int8/image_batcher.py at 2c160d03031ea3804b47566fdb769ba479669af7 · ultralytics/ultralytics (github.com)

做了相应的修改,主要就是去掉了cuda的部分,直接使用了pytorch来分配显存和数组

#
# SPDX-FileCopyrightText: Copyright (c) 1993-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# https://github.com/NVIDIA/TensorRT/blob/release/10.0/samples/python/efficientnet/build_engine.py
import torch
import cv2
import os
import sys
import numpy as np
from PIL import Image
from ultralytics.data.augment import LetterBox


class ImageBatcher:
    """Creates batches of pre-processed images."""

    def __init__(
        self,
        input,
        shape,
        dtype,
        max_num_images=None,
        exact_batches=False,
        config_file=None,
        shuffle_files=True,
        device="cuda:0",
    ):
        """
        :param input: The input directory to read images from.
        :param shape: The tensor shape of the batch to prepare, either in NCHW or NHWC format.
        :param dtype: The (numpy) datatype to cast the batched data to.
        :param max_num_images: The maximum number of images to read from the directory.
        :param exact_batches: This defines how to handle a number of images that is not an exact multiple of the batch
        size. If false, it will pad the final batch with zeros to reach the batch size. If true, it will *remove* the
        last few images in excess of a batch size multiple, to guarantee batches are exact (useful for calibration).
        :param config_file: The path pointing to the Detectron 2 yaml file which describes the model.
        """
        # Find images in the given input path.
        input = os.path.realpath(input)
        self.images = []

        extensions = [".jpg", ".jpeg", ".png", ".bmp", ".ppm"]

        def is_image(path):
            return os.path.isfile(path) and os.path.splitext(path)[1].lower() in extensions

        if os.path.isdir(input):
            self.images = [os.path.join(input, f) for f in os.listdir(input) if is_image(os.path.join(input, f))]
            self.images.sort()
            if shuffle_files:
                np.random.seed(999999999)
                np.random.shuffle(self.images)
        elif os.path.isfile(input):
            if is_image(input):
                self.images.append(input)
        self.num_images = len(self.images)
        if self.num_images < 1:
            print("No valid {} images found in {}".format("/".join(extensions), input))
            sys.exit(1)

        # Handle Tensor Shape.
        if dtype == np.float32:
            self.dtype = torch.float32
        elif dtype == np.float16:
            self.dtype = torch.float16
        elif dtype == np.int8:
            self.dtype = torch.int8
        self.shape = shape
        assert len(self.shape) == 4
        self.batch_size = shape[0]
        assert self.batch_size > 0
        self.format = None
        self.width = -1
        self.height = -1
        if self.shape[1] == 3:
            self.format = "NCHW"
            self.height = self.shape[2]
            self.width = self.shape[3]
        elif self.shape[3] == 3:
            self.format = "NHWC"
            self.height = self.shape[1]
            self.width = self.shape[2]
        assert all([self.format, self.width > 0, self.height > 0])

        # Adapt the number of images as needed.
        if max_num_images and 0 < max_num_images < len(self.images):
            self.num_images = max_num_images
        if exact_batches:
            self.num_images = self.batch_size * (self.num_images // self.batch_size)
        if self.num_images < 1:
            print("Not enough images to create batches")
            sys.exit(1)
        self.images = self.images[0 : self.num_images]

        # Subdivide the list of images into batches.
        self.num_batches = 1 + int((self.num_images - 1) / self.batch_size)
        self.batches = []
        for i in range(self.num_batches):
            start = i * self.batch_size
            end = min(start + self.batch_size, self.num_images)
            self.batches.append(self.images[start:end])

        # Indices.
        self.image_index = 0
        self.batch_index = 0
        self.newshape = [self.height, self.width]
        self.device = device
        self.LetterBox = LetterBox(self.newshape, scaleup=False)

    def preprocess_image(self, image_path):
        """
        The image preprocessor loads an image from disk and prepares it as needed for batching. This includes padding,
        resizing, normalization, data type casting, and transposing.

        This Image Batcher implements one algorithm for now:
        * Resizes and pads the image to fit the input size.
        :param image_path: The path to the image on disk to load.
        :return: Two values: A numpy array holding the image sample, ready to be contacatenated into the rest of the
        batch, and the resize scale used, if any.
        """
        image = Image.open(image_path)
        image = image.convert(mode="RGB")
        # Pad with mean values of COCO dataset, since padding is applied before actual model's
        # preprocessor steps (Sub, Div ops), we need to pad with mean values in order to reverse
        # the effects of Sub and Div, so that padding after model's preprocessor will be with actual 0s.
        image = np.asarray(image, dtype=np.float32).copy()
        image = self.LetterBox(labels=None, image=image)
        # cv2.imwrite(r'E:\work\codeRepo\deploy\jz\image_batcher\%d.jpg'%np.random.randint(999999999), image)
        # Change HWC -> CHW.
        image = np.transpose(image, (2, 0, 1)) / 255.0
        image = torch.from_numpy(image)
        image = torch.tensor(image, dtype=self.dtype)
        return image

    def get_batch(self):
        """
        Retrieve the batches.

        This is a generator object, so you can use it within a loop as:
        for batch, images in batcher.get_batch():
           ...
        Or outside of a batch with the next() function.
        :return: A generator yielding three items per iteration: a numpy array holding a batch of images, the list of
        paths to the images loaded within this batch, and the list of resize scales for each image in the batch.
        """
        for i, batch_images in enumerate(self.batches):
            batch_data = torch.zeros(tuple(self.shape), dtype=self.dtype, device=self.device)
            for j, image in enumerate(self.batches[self.batch_index]):
                self.image_index += 1
                batch_data[j] = self.preprocess_image(image)
            self.batch_index += 1
            yield batch_data

然后还需要修改文件 ultralytics/engine/exporter.py

    @try_export
    def export_engine(self, prefix=colorstr("TensorRT:")):
        """YOLOv8 TensorRT export https://developer.nvidia.com/tensorrt."""
        assert self.im.device.type != "cpu", "export running on CPU but must be on GPU, i.e. use 'device=0'"
        self.args.simplify = True
        f_onnx, _ = self.export_onnx()  # run before trt import https://github.com/ultralytics/ultralytics/issues/7016

        try:
            import tensorrt as trt  # noqa
        except ImportError:
            if LINUX:
                check_requirements("nvidia-tensorrt", cmds="-U --index-url https://pypi.ngc.nvidia.com")
            import tensorrt as trt  # noqa
        check_version(trt.__version__, "7.0.0", hard=True)  # require tensorrt>=7.0.0

        LOGGER.info(f"\n{prefix} starting export with TensorRT {trt.__version__}...")
        is_trt10 = int(trt.__version__.split(".")[0]) >= 10  # is TensorRT >= 10
        assert Path(f_onnx).exists(), f"failed to export ONNX file: {f_onnx}"
        f = self.file.with_suffix(".engine")  # TensorRT engine file
        logger = trt.Logger(trt.Logger.INFO)
        if self.args.verbose:
            logger.min_severity = trt.Logger.Severity.VERBOSE

        builder = trt.Builder(logger)
        config = builder.create_builder_config()
        workspace = int(self.args.workspace * (1 << 30))
        if is_trt10:
            config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, workspace)
        else:  # TensorRT versions 7, 8
            config.max_workspace_size = workspace
        flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        network = builder.create_network(flag)
        parser = trt.OnnxParser(network, logger)
        if not parser.parse_from_file(f_onnx):
            raise RuntimeError(f"failed to load ONNX file: {f_onnx}")

        inputs = [network.get_input(i) for i in range(network.num_inputs)]
        outputs = [network.get_output(i) for i in range(network.num_outputs)]
        for inp in inputs:
            LOGGER.info(f'{prefix} input "{inp.name}" with shape{inp.shape} {inp.dtype}')
        for out in outputs:
            LOGGER.info(f'{prefix} output "{out.name}" with shape{out.shape} {out.dtype}')

        if self.args.dynamic:
            shape = self.im.shape
            if shape[0] <= 1:
                LOGGER.warning(f"{prefix} WARNING ⚠️ 'dynamic=True' model requires max batch size, i.e. 'batch=16'")
            profile = builder.create_optimization_profile()
            min_shape = (1, shape[1], 32, 32)  # minimum input shape
            opt_shape = (max(1, shape[0] // 2), *shape[1:])  # optimal input shape
            max_shape = (*shape[:2], *(max(1, self.args.workspace) * d for d in shape[2:]))  # max input shape
            for inp in inputs:
                profile.set_shape(inp.name, min_shape, opt_shape, max_shape)
            config.add_optimization_profile(profile)

        half = builder.platform_has_fast_fp16 and self.args.half
        int8 = builder.platform_has_fast_int8 and self.args.int8
        mix_precision = half and int8
        if mix_precision:
            # https://github.com/NVIDIA/TensorRT/tree/main/samples/python/efficientdet
            """
            Experimental precision mode.

            Enable mixed-precision mode. When set, the layers defined here will be forced to FP16 to maximize INT8
            inference accuracy, while having minimal impact on latency.
            """
            config.set_flag(trt.BuilderFlag.PREFER_PRECISION_CONSTRAINTS)
            config.set_flag(trt.BuilderFlag.DIRECT_IO)
            config.set_flag(trt.BuilderFlag.REJECT_EMPTY_ALGORITHMS)

            # All convolution operations in the first four blocks of the graph are pinned to FP16.
            # These layers have been manually chosen as they give a good middle-point between int8 and fp16
            # accuracy in COCO, while maintining almost the same latency as a normal int8 engine.
            # To experiment with other datasets, or a different balance between accuracy/latency, you may
            # add or remove blocks.
            collect = []
            for i in range(network.num_layers):
                layer = network.get_layer(i)
                collect.append([layer.name, layer])
            for i in range(network.num_layers):
                layer = network.get_layer(i)
                if (
                    layer.type == trt.LayerType.CONVOLUTION
                    and any(
                        [
                            "/model.0/" in layer.name,
                            "/model.1/" in layer.name,
                            # "/model.2/" in layer.name,
                            # "/model.3/" in layer.name,
                            # "/model.4/m.0/" in layer.name,
                            # "/model.4/m.1/" in layer.name,
                            # "/model.22/cv2.0/" in layer.name,
                            # "/model.22/cv2.1/" in layer.name,
                            # "/model.22/cv2.2/" in layer.name,
                            # "/model.22/cv3.0/" in layer.name,
                            # "/model.22/cv3.1/" in layer.name,
                            # "/model.22/cv3.2/" in layer.name,
                            # "/model.22/cv2.0/cv2.0.2/Conv" in layer.name,
                            # "/model.22/cv3.0/cv3.0.2/Conv" in layer.name,
                            # "/model.22/cv2.1/cv2.1.2/Conv" in layer.name,
                            # "/model.22/cv3.1/cv3.1.2/Conv" in layer.name,
                            # "/model.22/cv2.2/cv2.2.2/Conv" in layer.name,
                            # "/model.22/cv3.2/cv3.2.2/Conv" in layer.name,
                            "/model.22/dfl/conv/Conv" in layer.name,
                        ]
                    )
                ) or (
                    any(
                        [
                            # "/model.22/Sigmoid" in layer.name,
                            # "/model.22/Mul_2" in layer.name,
                        ]
                    )
                ):
                    network.get_layer(i).precision = trt.DataType.HALF
                    LOGGER.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(layer.name))

            LOGGER.info(f"{prefix} building a Mix Precision with FP16 and INT8 engine as {f}")
        if half:
            LOGGER.info(f"{prefix} building FP16 engine as {f}")
            config.set_flag(trt.BuilderFlag.FP16)
        if int8:
            # https://github.com/NVIDIA/TensorRT/tree/main/samples/python/efficientdet
            LOGGER.info(f"{prefix} building INT8 engine as {f}")
            from ultralytics.engine.tensorrt_int8.calibrator import EngineCalibrator
            from ultralytics.engine.tensorrt_int8.image_batcher import ImageBatcher

            """
            https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/#enable_int8_c
            To avoid this issue, calibrate with as large a single batch as possible, 
            and ensure that calibration batches are well randomized and have similar distribution.
            """
            # The batch size for the calibration process, default:
            calib_batch_size = self.args.calib_batch_size
            # The maximum number of images to use for calibration, default: len(os.listdir(calib_input))
            calib_num_images = self.args.calib_num_images
            # The directory holding images to use for calibration
            calib_input = self.args.calib_input
            cache_file = self.args.cache_file
            if calib_num_images == None:
                calib_num_images = len(os.listdir(calib_input))
            config.set_flag(trt.BuilderFlag.INT8)
            config.int8_calibrator = EngineCalibrator(cache_file)
            if cache_file is None or not os.path.exists(cache_file):
                calib_shape = [calib_batch_size] + list(inputs[0].shape[1:])
                calib_dtype = trt.nptype(inputs[0].dtype)
                imagebatcher = ImageBatcher(
                    calib_input,
                    calib_shape,
                    calib_dtype,
                    max_num_images=calib_num_images,
                    exact_batches=True,
                    shuffle_files=True,
                )
                imagebatcher.newshape = inputs[0].shape[2:]
                config.int8_calibrator.set_image_batcher(imagebatcher)

        # Free CUDA memory
        del self.model
        torch.cuda.empty_cache()

        # Write file
        build = builder.build_serialized_network if is_trt10 else builder.build_engine
        with build(network, config) as engine, open(f, "wb") as t:
            # Metadata
            meta = json.dumps(self.metadata)
            t.write(len(meta).to_bytes(4, byteorder="little", signed=True))
            t.write(meta.encode())
            # Model
            t.write(engine if is_trt10 else engine.serialize())

        return f, None

https://zhuanlan.zhihu.com/p/692246336

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐