Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake v
@youkaichao Your current environment My environment: Name: vllm Version: 0.4.2+cu117 🐛 Describe the bug I quantified the model(Qwen2_72B) using AWQ myself, when i wanna to set api service by using two gpus it doesn't work. but using one ...
tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/scratch/pbs.5401450.kman.restech....
[rank0]: File "/mnt/data/Pai-Megatron-Patch/PAI-Megatron-LM-240718/megatron/core/pipeline_parallel/schedules.py", line 1344, in forward_backward_pipelining_without_interleaving [rank0]: output_tensor, num_tokens = forward_step( [rank0]: File "/mnt/data/Pai-Megatron-Patch/PAI-Megatron-LM-...
pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization='gptq', enforce_eager=False, max...