单机部署:基于Atlas 800I A3服务器 SGLang的Qwen3-235B部署指南
在大型语言模型部署过程中,开发者常面临硬件兼容性、环境配置和性能优化等挑战。Qwen3-235B模型用Atlas 800I A3或Atlas 800T A3均可部署,本文档以Atlas 800I A3为例,本文档基于实际项目经验,系统介绍了在Atlas 800I A3服务器上使用Sglang框架部署Qwen3-235B模型的完整流程,涵盖环境准备、权重量化、服务启动和性能测试等关键环节。
作者:昇腾实战派
Sglang知识地图:
https://blog.csdn.net/weixin_41406651/article/details/156754353?spm=1001.2014.3001.5502
背景概述
在大型语言模型部署过程中,开发者常面临硬件兼容性、环境配置和性能优化等挑战。Qwen3-235B模型用Atlas 800I A3或Atlas 800T A3均可部署,本文档以Atlas 800I A3为例,本文档基于实际项目经验,系统介绍了在Atlas 800I A3服务器上使用Sglang框架部署Qwen3-235B模型的完整流程,涵盖环境准备、权重量化、服务启动和性能测试等关键环节。
一、版本与环境配置
1. 硬件规格
- 机器型号:Atlas 800I A3推理服务器
2. Docker镜像配置
提供以下镜像源供选择:
# docker.io
docker pull lmsysorg/sglang:main-cann8.3.rc1-910b # a2
docker pull lmsysorg/sglang:main-cann8.3.rc1-a3 # a3
# 国内镜像站
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.3.rc1-910b # a2
docker pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.3.rc1-a3 # a3
# 版本release包,国内镜像
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:v0.5.5-cann8.3.rc1-910b
容器启动命令(以Atlas 800I A3推理服务器为例):
# 以A3为例
docker run -itd --privileged --name={container_name} --net=host --shm-size=500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/home \
-v /root/.cache:/root/.cache \
swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.3.rc1-a3 bash
docker exec -it {container_name} bash
3. Sglang版本
# sglang路径:/home/sglang_code/sglang
git clone https://github.com/ping1jing2/sglang.git
cd sglang
git checkout -b main_qwen origin/main_qwen
4. 更新triton-ascend包
如果cann包版本>=8.3.RC1,建议更新triton-ascend包
pip install triton-ascend -i https://mirrors.huaweicloud.com/repository/pypi/simple
5. 更新torch_npu
# 下载whl包
# arm
wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com:443/sglang/torch_npu/torch_npu-2.8.0.post2.dev20251113-cp311-cp311-manylinux_2_28_aarch64.whl --no-check-certificate
# x86
wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/torch_npu/torch_npu-2.8.0.post2.dev20251113-cp311-cp311-manylinux_2_28_x86_64.whl --no-check-certificate
pip install torch_npu*.whl
6. 其他环境信息
python -m check_env
Python: 3.11.13 (main, Nov 2 2025, 10:27:27) [GCC 11.4.0]
NPU available: True
NPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Ascend910_9362
CANN_HOME: /usr/local/Ascend/ascend-toolkit/latest
CANN: 8.3.0.1.200:8.3.RC1
BiSheng: 2025-10-24T18:53:37+08:00 clang version 15.0.5 (clang-5c68a1cb1231 flang-5c68a1cb1231)
Ascend Driver Version: 25.3.rc1
PyTorch: 2.8.0+cpu
sglang: 0.5.5.post3
sgl_kernel: Module Not Found
flashinfer_python: Module Not Found
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: Module Not Found
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.13.2
fastapi: 0.122.0
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.32.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 6.0.0
pydantic: 2.12.4
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 1.99.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
torch_npu: 2.8.0
sgl-kernel-npu: 0.1.0
deep_ep: 1.0.0+a8e003ca
Ascend Topology:
Phy-ID0 Phy-ID1 Phy-ID2 Phy-ID3 Phy-ID4 Phy-ID5 Phy-ID6 Phy-ID7 Phy-ID8 Phy-ID9 Phy-ID10 Phy-ID11 Phy-ID12 Phy-ID13 Phy-ID14 Phy-ID15
Phy-ID0 X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID1 SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID2 HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID3 HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID4 HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID5 HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID6 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID7 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID8 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID9 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID10 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID11 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW HCCS_SW HCCS_SW
Phy-ID12 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO HCCS_SW HCCS_SW
Phy-ID13 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X HCCS_SW HCCS_SW
Phy-ID14 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW X SIO
Phy-ID15 HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW HCCS_SW SIO X
Legend:
X = Self
SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
PIX = Path traversing a single PCIe switch
PXB = Path traversing multipul PCIe switches
HCCS = Connection traversing HCCS.
SIO = Path traversing the SIO bus
HCCS_SW = Connection traversing HCCS through a switch
NA = Unknown relationship.
ulimit soft: 1073741816
二、权重量化
量化qwen3-235B权重,cann包版本>=8.3.RC1
# 安装msmodelslim
git clone https://gitcode.com/Ascend/msit.git
cd msit/msmodelslim
bash install.sh # 更新cann包之后需要重新安装
# 量化
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:False
cd example/Qwen3-MOE
python3 quant_qwen_moe_w8a8.py --model_path {浮点权重路径} \
--save_path {W8A8量化权重路径} \
--anti_dataset ../common/qwen3-moe_anti_prompt_50.json \
--calib_dataset ../common/qwen3-moe_calib_prompt_50.json \
--trust_remote_code True
三、运行脚本
pkill -9 python | pkill -9 sglang
pkill -9 python | pkill -9 sglang
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=50000
export SGLANG_SET_CPU_AFFINITY=1
# 设置PYTHONPATH
cd /home/sglang_code/sglang
export PYTHONPATH=${PWD}/python:$PYTHONPATH
unset https_proxy
unset http_proxy
unset HTTPS_PROXY
unset HTTP_PROXY
unset ASCEND_LAUNCH_BLOCKING
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
MODEL_PATH=/home/weight/Qwen3-235B-A22B-Instruct-2507-w8a8
export INF_NAN_MODE_FORCE_DISABLE=1
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
LOCAL_HOST1=`hostname -I|awk -F " " '{print$1}'`
LOCAL_HOST2=`hostname -I|awk -F " " '{print$2}'`
echo "${LOCAL_HOST1}"
echo "${LOCAL_HOST2}"
export HCCL_BUFFSIZE=2100
export HCCL_SOCKET_IFNAME=data0.3001
export GLOO_SOCKET_IFNAME=data0.3001
export HCCL_OP_EXPANSION_MODE="AIV"
export ENABLE_ASCEND_MOE_NZ=1
python -m sglang.launch_server --model-path $MODEL_PATH --served-model-name qwen3 \
--host 141.61.81.61 --port 8022 --trust-remote-code --nnodes 1 --node-rank 0 \
--attention-backend ascend --device npu --quantization w8a8_int8 \
--max-running-requests 576 --context-length 8192 --dtype bfloat16 \
--chunked-prefill-size 102400 --max-prefill-tokens 458880 \
--disable-radix-cache --moe-a2a-backend deepep --deepep-mode auto --watchdog-timeout 9000 \
--tp 16 --dp-size 16 --enable-dp-attention --enable-dp-lm-head --mem-fraction-static 0.8 --cuda-graph-bs {6,8,10,11,12,18,36}
四、benchmark测试
1. vllm bench
bs=256
in=2048
out=2048
vllm bench serve --model "qwen3" \
--tokenizer "/home/weight/Qwen3-235B-A22B-Instruct-2507-w8a8/" \
--ignore-eos \
--dataset-name random \
--random-input-len $in \
--random-output-len $out \
--num-prompts $((bs * 5)) \
--max-concurrency $bs \
--request-rate inf \
--percentile-metrics ttft,tpot,itl,e2el \
--host x.x.x.x \
--port 8022 \
--backend openai \
--endpoint /v1/completions
2. ais_bench
安装ais_bench
git clone https://github.com/AISBench/benchmark.git
cd benchmark/
pip3 install -e ./ --use-pep517
生成gsm8k测试数据
import json
from transformers import AutoTokenizer
# 以BERT为例(其他模型只需替换模型名称)
tokenizer = AutoTokenizer.from_pretrained("/home/weight/Qwen3-235B-A22B-Instruct-2507-w8a8/")
batch_size = 4096
input_len = 2048
dataset = []
dataset_path = "/home/benchmark/process_gsm8k/GSM8K.jsonl"
with open(dataset_path, 'r', encoding="utf-8") as f:
for line in f:
data = json.loads(line)
dataset.append(data['question'])
# repeat input_len
dataset_2k = []
for sentence in dataset:
words = tokenizer.tokenize(sentence)
print(len(words))
len_num = len(words) // input_len
if len_num == 0:
multiplier = (input_len // len(words)) + 1
repeated_len = words * multiplier
words = repeated_len[:input_len]
decoded_text = tokenizer.convert_tokens_to_string(words)
print(len(words))
dataset_2k.append(decoded_text)
# repeat to batch_size
batch_num = len(dataset_2k) // batch_size
if batch_num == 0:
multiplier = (batch_size // len(dataset_2k)) + 1
repeated_batch = dataset_2k * multiplier
dataset_2k = repeated_batch[:batch_size]
else:
dataset_2k = dataset_2k[:batch_size]
print(len(dataset_2k))
json_str = json.dumps(dataset_2k, ensure_ascii=False, indent=4)
with open(f'GSM8K-in{input_len}-bs{batch_size}.jsonl', 'w', encoding='utf-8') as f:
for i in range(len(dataset_2k)):
f.write(json.dumps({"question": dataset_2k[i], "answer": "none"}, ensure_ascii=False))
f.write("\n")
生成数据集后,将数据集拷贝至benchmark/ais_bench/datasets/gsm8k目录下,并命名为test.jsonl
PS:目录下同时创建空的一个train.jsonl,不然运行会报错
配置文件修改
cd benchmark
vim ais_bench/benchmark/configs/models/vllm_api/vllm_api_stream_chat.py
from ais_bench.benchmark.models import VLLMCustomAPIChatStream
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
models = [
dict(
attr="service",
type=VLLMCustomAPIChatStream,
abbr='vllm-api-stream-chat',
path="/home/weight/Qwen3-235B-A22B-Instruct-2507-w8a8/",
model="qwen3",
request_rate = 0,
retry = 2,
host_ip = "x.x.x.x",
host_port = 8022,
max_out_len = 2048, # 输出长度
batch_size=256, # 最大并发
trust_remote_code=False,
generation_kwargs = dict(
temperature = 0,
ignore_eos = True,
#top_k = 10,
#top_p = 0.95,
#seed = None,
#repetition_penalty = 1.03,
),
pred_postprocessor=dict(type=extract_non_reasoning_content)
)
]
运行--num-prompts 设置成{5*最大并发}
ais_bench --models vllm_api_stream_chat --datasets gsm8k_gen_0_shot_cot_str_perf --debug --summarizer default_perf --mode perf --num-prompts 5*并发
更多推荐
所有评论(0)