Easy-to-use and powerful LLM and SLM library with awesome model zoo. - [Fix] enable_sp_async_reduce_scatter for qwen_72b && llama2_70b (#8897) · PaddlePaddle/PaddleNLP@6f5bb76
Model Input Dumps No response 🐛 Describe the bug I'm simply change a little bit of the api_server.py to serve with multiple prompts and usingasyncio.gatherto wait all responses to be ready. the log shows that all requests can successfully finishes, but the response can't be returned fr...
max_model_len=None, worker_use_ray=False, distributed_executor_backend='ray', pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, swap_space=4, cpu_offload...
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --tensor-parallel-size 4 --served-model-name Qwen1.5-72B-Chat --model ../Qwen1.5-72B-Chat --port 8989 --max-model-len 14500 --gpu-memory-utilization 0.96 🐛 Describe the bug I query openai server with threa...
Model: "HuggingFaceH4/zephyr-7b-beta" The pod running on my k8s runs the following command: python3 -m vllm.entrypoints.openai.api_server --model HuggingFaceH4/zephyr-7b-beta --disable-frontend-multiprocessing --disable-custom-all-reduce ...
🐛 Describe the bug #init model weights model.init_weights() #parallelize the first embedding and the last linear out projection model = parallelize_module( model, tp_mesh, { "tok_embeddings": RowwiseParallel( # **Here's the problem** inp...
Explore All features Documentation GitHub Skills Blog Solutions By company size Enterprises Small and medium teams Startups Nonprofits By use case DevSecOps DevOps CI/CD View all use cases By industry Healthcare Financial services Manufacturing Government View all industr...
Description I have some confusion about the context. execute function. According to the TensorRT Python API document, there are execute and execute_async. However, according to here . | Inference time should be nearly identical when exec...
tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(...
cache_config = kwargs["cache_config"] parallel_config = kwargs["parallel_config"] if parallel_config.tensor_parallel_size == 1: num_gpus = cache_config.gpu_memory_utilization else: num_gpus = 1 engine_class = ray.remote(num_gpus=num_gpus)( self._engine_class).remote ret...