services:vllm:container_name:vllmrestart:alwaysimage:docker.unsee.tech/vllm/vllm-openai:v0.6.4ipc:hostvolumes:-./Qwen2-VL-7B-Instruct-GPTQ-Int4:/models/Qwen2-VL-7B-Instruct-GPTQ-Int4command:-"--model"-"/models/Qwen2-VL-7B-Instruct-GPTQ-Int4"-"--served-model-name"-"Qwen2-VL-7B...
Also, Sometimes facing hallucinations on qwen2-vl-72b-gptq-int4. Can you suggest best model sampling parameter. Right now i am using same config mentioned in huggingface conf.py Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment AssigneesNo on...
我在一个有8张a100的服务器上运行程序,即便只是一个问题,模型也要花几十分钟甚至几个小时才能有答复。我用nvidia-smi看了一下显卡占用率,显存表示模型成功加载了,但是显卡核心使用率起伏不定。我以为是显卡的通信问题,于是限定程序在一张卡上面跑。但速度还是极慢。: