Qwen3.5-27B vllm claude-code本地化部署
本文介绍了如何在Ubuntu22.04系统上使用vLLM(nightly版本)本地部署Qwen3.5-27B大语言模型。主要内容包括:1)硬件要求建议64GB显存;2)下载Qwen3.5-27B模型;3)拉取vLLM nightly版Docker镜像;4)配置docker-compose.yml文件,设置模型路径、端口映射和运行参数;5)启动服务并验证运行状态。部署采用单卡配置,支持工具调用功能,
·
Qwen3.5-27B vllm claude-code本地化部署
不谈玄学,只讲落地。
我是一名深耕算法工程化一线的实践者,擅长将 新技术、关键技术、AI/ML 技术从论文和 demo 转化为可规模化部署的生产系统。在这里,你看不到堆砌公式的理论空谈,只有真实项目中踩过的坑、趟过的路,每一篇文章都源自实战经验的提炼。我相信技术的价值在于解决真实问题,而不是制造焦虑。如果你也厌倦了"收藏即学会",渴望掌握让算法真正跑起来的硬核能力,那么这里就是你的技术补给站。

安装环境
ubuntu22.04 + vllm>=0.16(nightly版本)
显存最好64G
(vllm已经支持了claude sdk 相关协议,无需再单独挂litellm代理,网上都说sglang比vllm高出20%-30%,采用了RadixAttention 缓存机制,下次测试一下)
$ nvidia-smi
Tue Mar 3 21:07:51 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A800 80GB PCIe On | 00000000:31:00.0 Off | 0 |
| N/A 35C P0 68W / 300W | 1609MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A800 80GB PCIe On | 00000000:B1:00.0 Off | 0 |
| N/A 33C P0 66W / 300W | 34989MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 272582 C text-embeddings-router 1586MiB |
| 0 N/A N/A 1535563 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 221801 C VLLM::EngineCore 34966MiB |
| 1 N/A N/A 1535563 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
下载Qwen3.5-27B模型
git clone https://huggingface.co/Qwen/Qwen3.5-27B
拉取vllm镜像
截止20260303,latest版本还未加入qwen3.5-27B的相关配置,先拉取日更新的版本
docker pull vllm/vllm-openai:nightly
docker-compose.yml配置
services:
# 1. vLLM 推理引擎容器
vllm-qwen3.5-agent:
#image: vllm/vllm-openai:v0.16.0
#image: vllm/vllm-openai:latest
image: vllm/vllm-openai:nightly
container_name: vllm-qwen
volumes:
- /bigdata/huggingface:/root/.cache/huggingface # 映射模型缓存目录
environment:
- CUDA_VISIBLE_DEVICES=0
- VLLM_USE_V1=0
ports:
- "8210:8000"
ipc: host # 共享内存,防止多卡通信报错
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
# 注意:如果显存不足,请在 command 中添加 --quantization bitsandbytes 或 awq
command: >
--model /root/.cache/huggingface/Qwen3.5-27B
--host 0.0.0.0
--port 8000
--trust-remote-code
--gpu_memory_utilization 0.8
--max-model-len 200000
--enforce-eager
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--served-model-name qwen3.5-27b
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
启动
$ docker compose up -d
WARN[0000] No services to build
[+] up 2/2
✔ Network vllm_claude_code_default Created 0.0s
✔ Container vllm-qwen Created
$ docker compose logs -f --tail 10
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/completions, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/completions/render, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/messages, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
vllm-qwen | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
vllm-qwen | (APIServer pid=1) INFO: Started server process [1]
vllm-qwen | (APIServer pid=1) INFO: Waiting for application startup.
vllm-qwen | (APIServer pid=1) INFO: Application startup complete.
看到complate启动成功!
安装claude code
参考:https://code.claude.com/docs/en/setup
curl -fsSL https://claude.ai/install.sh | bash
配置claude相关环境变量
# ~/.bashrc 或 ~/.zshrc 里加:
export ANTHROPIC_BASE_URL="http://localhost:8210"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_AUTH_TOKEN="dummy"
export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.5-27b"
export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.5-27b"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.5-27b"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
然后更新环境变量
source ~/.bashrc
测试
$ claude
╭─── Claude Code v2.1.63 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ │ Tips for getting started │
│ Welcome back! │ Run /init to create a CLAUDE.md file with instructions for Claude │
│ │ Note: You have launched claude in your home directory. For the best experience, launch it in a project directory instead. │
│ │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│ ▐▛███▜▌ │ Recent activity │
│ ▝▜█████▛▘ │ No recent activity │
│ ▘▘ ▝▝ │ │
│ qwen3.5-27b · API Usage Billing │ │
│ /home/algo │ │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
/model to try Opus 4.6
❯ 帮我查看显存占用了多少
● 用户询问显存占用情况,需要使用 nvidia-smi 命令来查看 GPU 显存信息。
我来帮你查看显存占用情况。
● Bash(nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv)
⎿ memory.total [MiB], memory.used [MiB], memory.free [MiB]
81920 MiB, 66922 MiB, 14230 MiB
81920 MiB, 34989 MiB, 46164 MiB
● nvidia-smi 返回了两张 GPU 的显存信息,我需要整理并清晰地展示给用户,包括每张卡的总显存、已用显存、剩余显存和占用百分比。
你的系统有两张 GPU,显存占用情况如下:
┌─────┬────────┬─────────┬─────────┬────────┐
│ GPU │ 总显存 │ 已用 │ 剩余 │ 占用率 │
├─────┼────────┼─────────┼─────────┼────────┤
│ 0 │ 80 GB │ 65.4 GB │ 13.9 GB │ 81.7% │
├─────┼────────┼─────────┼─────────┼────────┤
│ 1 │ 80 GB │ 34.1 GB │ 45.1 GB │ 42.8% │
└─────┴────────┴─────────┴─────────┴────────┘
总结:
- GPU 0 占用较高(约 82%),剩余显存较少
- GPU 1 占用较低(约 43%),还有较多可用显存
✻ Baked for 35s
为openclaw养虾做准备
更多推荐
所有评论(0)