例如,如果您在同一 GPU 上运行两个 vLLM 实例,则可以为每个实例将 GPU 内存利用率设置为0.5。 --guided-decoding-backend{outlines,lm-format-enforcer,xgrammar}默认情况下,哪个引擎将用于引导解码(JSON 模式/正则表达式等)。目前支持 https:///outlines-dev/outlines, https:///mlc-ai/xgrammar, 和 https:/...
"include_stop_str_in_output": false, "guided_json": "string", "guided_regex": "string", "guided_choice": [ "string" ], "guided_grammar": "string", "guided_decoding_backend": "string", "guided_whitespace_pattern": "string" } 五、量化 这边以GPTQ为例,下载好模型Qwen2-7B-Instruct-G...
Format.AUTO:'auto'>,dtype='auto',kv_cache_dtype='auto',max_model_len=None,guided_decoding_backend='xgrammar',logits_processor_pattern=None,model_impl='auto',distributed_executor_backend=None,pipeline_parallel_size=1,tensor_parallel_size=1,enable_expert_parallel=False,max_parallel_loading_workers...
可通过请求中的guided_decoding_backend参数覆盖。 --distributed-executor-backend{ray,mp}用于分布式服务的后端。当使用多于1个GPU时,如果安装了"ray"将自动设置为"ray",否则设置为"mp"(多进程)。 --worker-use-ray 已弃用,请使用--distributed-executor-backend=ray。 --pipeline-parallel-size PIPELINE_PARALLEL...
kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name...
load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen)...
load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen)...
extra_body = { "guided_grammar": grammar, "guided_decoding_backend": "xgrammar", # optional } chat_completion = client.chat.completions.create( model=model, messages=messages, stream=True, temperature=0, max_tokens=1024, timeout=timeout, extra_body=extra_body, stream_options={"include_usag...
template_text_format='string', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=120000, guided_decoding_backend='...
max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=true, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed...