ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` (vllm)
1. 增加限制 max_seq_len(不成功) -> 可能是 16G 显存不足。
·
参考:
解决方法:
1. 增加限制 max_seq_len(不成功) -> 可能是 16G 显存不足
model_name = "./dir"
# or switch to "mistralai/Mistral-Nemo-Instruct-2407"
# or "mistralai/Mistral-Large-Instruct-2407"
# or any other mistral model with function calling ability
sampling_params = SamplingParams(max_tokens=8192, temperature=0.0)
"""
LLM (use Interface) -> LLMEngine(developer Interface)
一个用于文本生成的离线推理工具,集成了分词器、语言模型(支持分布式 GPU)和 KV 缓存管理
Parms:
1. model: The name or path of a HuggingFace Transformers model. (本地路径也行)
2. tokenizer: The name or path of a HuggingFace Transformers tokenizer.
3. tokenizer_mode: The tokenizer mode. "auto" will use the fast tokenizer
if available, and "slow" will always use the slow tokenizer.
4. max_seq_len_to_capture: Maximum sequence len covered by CUDA graphs.
When a sequence has context length larger than this, we fall back
to eager mode. Additionally for encoder-decoder models, if the
sequence length of the encoder input is larger than this, we fall
back to the eager mode.
"""
llm = LLM(model=model_name, max_seq_len_to_capture=256) # 添加 max_seq_len
更多推荐
所有评论(0)