针对你遇到的“vllm runtimeerror: failed to infer device type”错误,这里有几个可能的解决步骤: 确认错误信息的完整内容和上下文: 错误信息通常会包含更多细节,这些细节对于诊断问题至关重要。请确保你查看了完整的错误输出,并理解错误的上下文。如果可能,提供完整的错误输出将有助于更准确地定位问题。 检查代码中
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 12345 --max-model-len 65536 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.97 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner -...
packages/vllm/config.py", line 1091, in __init__ raise RuntimeError("Failed to infer device type") RuntimeError: Failed to infer device type Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment...
PS: 另外,vLLM的代码在CUDA >= 12.4 + SM<90的时候有bug,已经顺手修了,PR:https://github.com/vllm-project/vllm/pull/14796,似乎最近很多Triton MLA奇怪的error也是由于这个bug导致的。反正修复之后,在CUDA 12.2 ~ CUDA 12.6都可以愉快地run起来了。 # 使用 nvcr.io/nvidia/pytorch:24.05-py3docker run...
Read through the documentation inmodel.pyto understand how to configure this sample for your use-case. 通读m.p 中的文档来理解如何为你的用例配置该样例。 (2)步骤二启动 Triton 推理服务器 - Step 2: Launch Triton Inference Server Once you have the model repository setup, it is time to launch...
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: 后台循环已经出错,RuntimeError: Triton...
16, false> &, const cutlass::Array<cutlass::float_e4m3_t, 8, false> &, const cutlass::Ar...
vllm [用法]:v0.5.3.post1, ray, 2个主机,每个主机8x48G,Llama3.1-405B-FP8,失败-tp 8 -...
Reviewing the output of the 'run e2e tests' steps, the test names are updated: Old:TestvLLM.test_vllm New:TestvLLM_0_tests_e2e_vLLM_configs_fp8_dynamic_per_token_yaml.test_vllm Note:The failures (RuntimeError: Failed to infer device type) are expected due to a recent change (seemin...
--finetuning_type lora --infer_backend vllm --vllm_enforce_eager 日志如下 日志:INFO 04-10 09:19:07 [init.py:239] Automatically detected platform cuda. [INFO|configuration_utils.py:691] 2025-04-10 09:19:10,612 >> loading configuration file /root/autodl-tmp/DeepSeek-R1-Distill-Qwen-32...