But we do support either data replication or tensor/pipeline parallelism during evaluation, on one node. To enable data replication, set the model_args of devices to the number of data replicas to run. For example, the command to run 8 data replicas over 8 GPUs is: torchrun --nproc-per...
We also support vLLM for faster inference on supported model types, especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example: lm_eval --model vllm \ --model_args pretrai...
We also support vLLM for faster inference on supported model types, especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example: lm_eval --model vllm \ --model_args pretrai...
We also support vLLM for faster inference on supported model types, especially faster when splitting a model across multiple GPUs. For single-GPU or multi-GPU — tensor parallel, data parallel, or a combination of both — inference, for example: lm_eval --model vllm \ --model_args pretrai...