对于传入的 HF 模型,TGI 会自动推理该参数的最大上限,如果你加载了一个 7B 的模型到 24GB 显存的显卡当中,你会看到你的显存占用基本上被用满了,而不是只占用了 13GB(7B 模型常见显存占用),那是因为 TGI 根据 max-batch-total-tokens 提前对显存进行规划和占用。但对于量化模型,该参数需要自己设定,设定时可...
"url": "ghcr.io/huggingface/text-generation-inference:2.1.1", # This is the min version"env": {"LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder", # Add adapters here"MAX_BATCH_PREFILL_TOKENS": "2048", # Set according to your needs...
"LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder", # Add adapters here "MAX_BATCH_PREFILL_TOKENS": "2048", # Set according to your needs "MAX_INPUT_LENGTH": "1024", # Set according to your needs "MAX_TOTAL_TOKENS": "1512", # Set according to your needs "MODEL_ID...
"LORA_ADAPTERS":"predibase/customer_support,predibase/magicoder",# Add adapters here "MAX_BATCH_PREFILL_TOKENS":"2048",# Set according to your needs "MAX_INPUT_LENGTH":"1024",# Set according to your needs "MAX_TOTAL_TOKENS":"1512",# Set according to your needs "MODEL_ID":"/reposito...
/models\ghcr.io/huggingface/text-generation-inference:1.0.0\--model-id /models/llama2-7b-chat-gptq-int4\--hostname 0.0.0.0\--port5001\--max-concurrent-requests256\--quantize gptq\--trust-remote-code\--max-batch-total-tokens30000\--shardedfalse\--max-input-length1024\--validation-...
{"LORA_ADAPTERS":"predibase/customer_support,predibase/magicoder",# Add adapters here"MAX_BATCH_PREFILL_TOKENS":"2048",# Set according to your needs"MAX_INPUT_LENGTH":"1024",# Set according to your needs"MAX_TOTAL_TOKENS":"1512",# Set according to your needs"MODEL_ID":"/repository"...
--num-shard 1 --port xxx --router-name=xx --max-top-n-tokens=1 --max-input-length=640 --max-total-tokens=960 --waiting-served-ratio=0.5 --max-batch-prefill-tokens=5120 --max-batch-total-tokens=16000 --max-waiting-tokens=16000 ...
Qwen2 supports long context lengths, so carefully choose the values for `--max-batch-prefill-tokens`, `--max-total-tokens`, and `--max-input-tokens` to avoid potential out-of-memory (OOM) issues. If an OOM occurs, you'll receive an error message upon startup. Qwen2 supports long co...
{ "LORA_ADAPTERS": "predibase/customer_support,predibase/magicoder", # Add adapters here "MAX_BATCH_PREFILL_TOKENS": "2048", # Set according to your needs "MAX_INPUT_LENGTH": "1024", # Set according to your needs "MAX_TOTAL_TOKENS": "1512", # Set according to your needs "MODEL...
docker run -d --name=tgi-mistral-7b --env HF_HUB_OFFLINE=1 --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN --env http_proxy=$http_proxy --env https_proxy=$https_proxy --env MAX_BATCH_TOTAL_TOKENS=32000 --env MAX_BATCH_PREFILL_TOKENS=16000 --env MAX_TOTAL_TOKENS=32000 --gp...