packages/vllm/engine/async_llm_engine.py", line 191, in step_async output = await self._run_workers_async( File "/home/user/projects/repos/transformers/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 227, in _run_workers_async assert output == other_output ...
Hello, I'm now using base docker rocm/pytorch:rocm5.7_ubuntu20.04_py3.9_pytorch_2.0.1. It works fine when using TP=1 or the number of prompts is small, but when I using 2 GPUs there is the error: File "/opt/conda/envs/vllm/lib/python3.10...
A said in the first post, it will be very complicated to release a test module if other software parts are involved, but I'll try during next weeks. Collaborator esp-zhp commented May 7, 2024 1-Is it possible to observe memory leaks from outside of the BT controller? Answer: Currentl...
vllm: 0.5.5 multipart: 0.0.9 openai: 1.43.0 anthropic: 0.34.1 NVIDIA Topology: GPU0 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS 0-23 N/A N/A NIC0 SYS X SYS NIC1 SYS SYS X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconne...
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = internlm2 llama_model_loader: - kv 1: general.name str = InternLM llama_model_loader: - kv 2: internlm2.context_length u32 = 327...
Your current environment The output of `python collect_env.py` 🐛 Describe the bug This is a compond and annoying bug, coupled with pytorch bug pytorch/pytorch#122815 . Basically, pytorch torch.cuda.device_count function will cache the de...
Heres my current output for qwen2. llama.cpp seems to be utterly broken right now. Here is my run command: Sadeghi85mentioned this issueJun 7, 2024 25 hidden itemsLoad more… wsxiaoysmentioned this issueJun 29, 2024 It appears that the Qwen2-72B model stopped functioning correctly after ...
llm_load_vocab: Word '' could not be converted to UTF-8 codepoints and will be replaced with: ❌❌ Sadly this was not the only error once things got running, as this gets output as well llm_load_vocab: mismatch in special tokens definition ( 1087/120128 vs 55/120128 ). ...
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ^^^ RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING...
(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache) 222 raise ValueError( 223 f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}" 224 ) 225 attn_weights = attn_weights + attention_mask ...