项目场景:

vllm推理qwen3-8b报错:torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
详细错误信息见下:

(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936] torch.AcceleratorError: CUDA error: the provided PTX was compiled with an unsupported toolchain.
(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936] Search for `cudaErrorUnsupportedPtxVersion' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=3675948) ERROR 01-22 13:05:52 [core.py:936]


解决方法:

import os
# 必须在导入vllm前执行
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"

或者在终端执行

export VLLM_ATTENTION_BACKEND=FLASHINFER

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐