当LLMEngine初始化时,会在_build_logits_processors()方法中调用get_local_guided_decoding_logits_processor()方法获取当前可用后端对应的LogitsProcessor(位于vllm/model_executor/guided_decoding目录下)。 此时,需要传入 Guided Decoding 相关的参数
一、引言 Guided Decoding,又叫 Structured Output,是大模型推理领域中非常重要的一个特性,主要用于引导大模型输出符合某种特定格式(如:SQL、Json)的结果,以便更好地将大模型落地到具体的应用场景中。在我…
如果未指定,将从模型配置自动派生。 --guided-decoding-backend{outlines,lm-format-enforcer}哪个引擎将默认用于指导解码(JSON架构/正则表达式等)。当前支持https://github.com/outlines-dev/outlines 和 https://github.com/noamgat/lm-format-enforcer。可通过请求中的guided_decoding_backend参数覆盖。 --distributed...
例如,如果您在同一 GPU 上运行两个 vLLM 实例,则可以为每个实例将 GPU 内存利用率设置为 0.5。 --guided-decoding-backend {outlines,lm-format-enforcer,xgrammar} 默认情况下,哪个引擎将用于引导解码(JSON 模式/正则表达式等)。目前支持 https:///outlines-dev/outlines, https:///mlc-ai/xgrammar, 和 http...
🚀 The feature, motivation and pitch Currently we support guided decoding of (JSON, Regex, Choice, Grammar, and arbitrary JSON) in OpenAI inference server. It would be great that we expose the same functionality in the offline interface a...
disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen) INFO: Started server process [614] INFO: Waiting for...
42 changes: 39 additions & 3 deletions 42 tests/entrypoints/test_guided_processors.py Original file line numberDiff line numberDiff line change @@ -1,11 +1,14 @@ # This unit test should be moved to a new # tests/test_guided_decoding directory. import pytest import torch from transformer...
[str] = Field( default=None, description=( "If specified, the output will follow the context free grammar."), ) guided_decoding_backend: Optional[str] = Field( default=None, description=( "If specified, will override the default guided decoding backend " "of the server for this specific ...
bfloat16, max_seq_len=128, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend=...
(6.3.912) Ascend-vLLM介绍 支持的模型列表 版本说明和要求 推理服务部署 推理关键特性使用 量化 剪枝 分离部署 Prefix Caching multi-step 投机推理 图模式 多模态 Chunked Prefill multi-lora guided-decoding 推理服务精度评测 推理服务性能评测 附录 主流开源大模型基于Lite Server适配ModelLink PyTorch NPU训练指导...