Qwen3.5-27B vllm claude-code本地化部署

不谈玄学,只讲落地。
我是一名深耕算法工程化一线的实践者,擅长将 新技术、关键技术、AI/ML 技术从论文和 demo 转化为可规模化部署的生产系统。在这里,你看不到堆砌公式的理论空谈,只有真实项目中踩过的坑、趟过的路,每一篇文章都源自实战经验的提炼。我相信技术的价值在于解决真实问题,而不是制造焦虑。如果你也厌倦了"收藏即学会",渴望掌握让算法真正跑起来的硬核能力,那么这里就是你的技术补给站。

安装环境

ubuntu22.04 + vllm>=0.16(nightly版本)
显存最好64G
(vllm已经支持了claude sdk 相关协议,无需再单独挂litellm代理,网上都说sglang比vllm高出20%-30%,采用了RadixAttention 缓存机制,下次测试一下)

$ nvidia-smi
Tue Mar  3 21:07:51 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A800 80GB PCIe          On  |   00000000:31:00.0 Off |                    0 |
| N/A   35C    P0             68W /  300W |    1609MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A800 80GB PCIe          On  |   00000000:B1:00.0 Off |                    0 |
| N/A   33C    P0             66W /  300W |   34989MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A          272582      C   text-embeddings-router                 1586MiB |
|    0   N/A  N/A         1535563      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A          221801      C   VLLM::EngineCore                      34966MiB |
|    1   N/A  N/A         1535563      G   /usr/lib/xorg/Xorg                        4MiB |
+-----------------------------------------------------------------------------------------+

下载Qwen3.5-27B模型

git clone https://huggingface.co/Qwen/Qwen3.5-27B

拉取vllm镜像

截止20260303,latest版本还未加入qwen3.5-27B的相关配置,先拉取日更新的版本

docker pull vllm/vllm-openai:nightly

docker-compose.yml配置

services:
  # 1. vLLM 推理引擎容器
  vllm-qwen3.5-agent:
    #image: vllm/vllm-openai:v0.16.0
    #image: vllm/vllm-openai:latest
    image: vllm/vllm-openai:nightly
    container_name: vllm-qwen
    volumes:
      - /bigdata/huggingface:/root/.cache/huggingface # 映射模型缓存目录
    environment:
      - CUDA_VISIBLE_DEVICES=0 
      - VLLM_USE_V1=0
    ports:
      - "8210:8000"
    ipc: host # 共享内存,防止多卡通信报错
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    # 注意:如果显存不足,请在 command 中添加 --quantization bitsandbytes 或 awq
    command: >
      --model /root/.cache/huggingface/Qwen3.5-27B
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --gpu_memory_utilization 0.8
      --max-model-len 200000
      --enforce-eager
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --served-model-name qwen3.5-27b
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5

启动

$ docker compose up -d
WARN[0000] No services to build                         
[+] up 2/2
 ✔ Network vllm_claude_code_default Created                                                                                                                     0.0s 
 ✔ Container vllm-qwen              Created


$ docker compose logs -f --tail 10
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/completions, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/completions/render, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /v1/messages, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO 03-03 13:11:01 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
vllm-qwen  | (APIServer pid=1) INFO:     Started server process [1]
vllm-qwen  | (APIServer pid=1) INFO:     Waiting for application startup.
vllm-qwen  | (APIServer pid=1) INFO:     Application startup complete.

看到complate启动成功!

安装claude code

参考:https://code.claude.com/docs/en/setup

curl -fsSL https://claude.ai/install.sh | bash

配置claude相关环境变量

# ~/.bashrc 或 ~/.zshrc 里加:
export ANTHROPIC_BASE_URL="http://localhost:8210"
export ANTHROPIC_API_KEY="dummy"
export ANTHROPIC_AUTH_TOKEN="dummy"
export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen3.5-27b"
export ANTHROPIC_DEFAULT_SONNET_MODEL="qwen3.5-27b"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="qwen3.5-27b"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

然后更新环境变量

source ~/.bashrc

测试

$ claude

╭─── Claude Code v2.1.63 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                     │ Tips for getting started                                                                                                  │
│            Welcome back!            │ Run /init to create a CLAUDE.md file with instructions for Claude                                                         │
│                                     │ Note: You have launched claude in your home directory. For the best experience, launch it in a project directory instead. │
│                                     │ ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── │
│               ▐▛███▜▌               │ Recent activity                                                                                                           │
│              ▝▜█████▛▘              │ No recent activity                                                                                                        │
│                ▘▘ ▝▝                │                                                                                                                           │
│   qwen3.5-27b · API Usage Billing   │                                                                                                                           │
│             /home/algo              │                                                                                                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  /model to try Opus 4.6

❯ 帮我查看显存占用了多少 

● 用户询问显存占用情况,需要使用 nvidia-smi 命令来查看 GPU 显存信息。


  我来帮你查看显存占用情况。

● Bash(nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv)
  ⎿  memory.total [MiB], memory.used [MiB], memory.free [MiB]
     81920 MiB, 66922 MiB, 14230 MiB
     81920 MiB, 34989 MiB, 46164 MiB

● nvidia-smi 返回了两张 GPU 的显存信息,我需要整理并清晰地展示给用户,包括每张卡的总显存、已用显存、剩余显存和占用百分比。


  你的系统有两张 GPU,显存占用情况如下:

  ┌─────┬────────┬─────────┬─────────┬────────┐
  │ GPU │ 总显存 │  已用   │  剩余   │ 占用率 │
  ├─────┼────────┼─────────┼─────────┼────────┤
  │ 080 GB  │ 65.4 GB │ 13.9 GB │ 81.7%  │
  ├─────┼────────┼─────────┼─────────┼────────┤
  │ 180 GB  │ 34.1 GB │ 45.1 GB │ 42.8%  │
  └─────┴────────┴─────────┴─────────┴────────┘

  总结:
  - GPU 0 占用较高(约 82%),剩余显存较少
  - GPU 1 占用较低(约 43%),还有较多可用显存

✻ Baked for 35s

为openclaw养虾做准备

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐