参考:

[Bug]: ValueError: No available memory for the cache blocks on main branch after commit 46f98893 · Issue #14992 · vllm-project/vllm · GitHub

Support Needed: `ValueError: No available memory for the cache blocks` with Mistral-Nemo-12B-Instruct on NVIDIA GeForce RTX 4090 (16GB) in Docker - Models - NVIDIA Developer Forums

解决方法:

1. 增加限制 max_seq_len(不成功) -> 可能是 16G 显存不足

model_name = "./dir"
# or switch to "mistralai/Mistral-Nemo-Instruct-2407"
# or "mistralai/Mistral-Large-Instruct-2407"
# or any other mistral model with function calling ability

sampling_params = SamplingParams(max_tokens=8192, temperature=0.0)

"""
LLM (use Interface) -> LLMEngine(developer Interface)
    一个用于​​文本生成​​的离线推理工具,集成了分词器、语言模型(支持分布式 GPU)和 KV 缓存管理
    Parms:
        1. model: The name or path of a HuggingFace Transformers model. (本地路径也行)
        2. tokenizer: The name or path of a HuggingFace Transformers tokenizer.
        3. tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
                if available, and "slow" will always use the slow tokenizer.
        4. max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
            When a sequence has context length larger than this, we fall back
            to eager mode. Additionally for encoder-decoder models, if the
            sequence length of the encoder input is larger than this, we fall
            back to the eager mode.
"""
llm = LLM(model=model_name, max_seq_len_to_capture=256) # 添加 max_seq_len

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐