(xorbits) ailearn@gpts:~$ xinference launch -e http://gpts:9997 -n Qwen1.5-32B-Chat-AWQ -s 32 -f awq -q Int4 --gpu-idx 2,3 --enforce_eager True --max_num_seqs 16 Launch model name: Qwen1.5-32B-Chat-AWQ with kwargs: {'enforce_eager': True, 'max_num_seqs': 16} ...
INFO 04-15 21:07:05 model_runner.py:795]CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing`gpu_memory_utilization`or enforcing eager mode. You can also reduce the`max_num_seqs`as needed to decrease memory usage.(RayWorkerVllmp...
For now, things like max_model_len=128, block_size=128, and os.environ['MASTER_PORT'] = '12355' are quite mysterious to me. vllm/model_executor/__init__.py Outdated Show resolved examples/offline_inference_neuron.py Outdated max_num_seqs=8, max_model_len=128, block_size=128,...
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition model
max_num_seqs=8, # The max_model_len and block_size arguments are required to be same as max sequence length, # when targeting neuron device. Currently, this is a known limitation in continuous batching # support in transformers-neuronx. # TODO(liangfu): Support paged-attention in transform...