tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20,...
native_image.h native_interface_xcomponent.h native_vsync.h raw_dir.h raw_file_manager.h raw_file.h context.h data_type.h format.h model.h status.h tensor.h types.h neural_network_runtime_type.h neural_network_runtime.h native_avcodec_audiodecode...
even after handling tens of thousands of requests over several days. I use the qwen1.5-72b model with a tensor parallelism (tp) of 4. It appears that the bug was introduced in the transition