一个block 占用内存大小(Byte)= token 数量 (block_size) ✖️ 一个 token 的 kv cache 占用 内存大小。 所以,我们只需要计算出单个 token 的 kv cache 对应的大小即可。block 大小的计算方法由vllm/vllm/worker/cache_engine.py文件里CacheEngine类的get_cache_block_size函数实现,代码也很简单,简化后如...
block_size: int, ) -> None: self.block_number = block_number self.block_size = block_sizeself.token_ids = [_BLANK_TOKEN_ID] * block_size self.num_tokens = 0def is_empty(self) -> bool: return self.num_tokens == 0def get_num_empty_slots(self) -> int: return self.block_size...
vllm/worker/worker.py", line 381, in _check_if_can_support_max_seq_len raise RuntimeError( RuntimeError: vLLM cannot currently support max_model_len=65536 with block_size=16 on GPU with compute capability (8, 9) (required shared memory 264252.0 > available shared memory 101376). This ...
Device.GPU, block_size, num_gpu_blocks) self.cpu_allocator: BlockAllocatorBase = CachedBlockAllocator( Device.CPU, block_size, num_cpu_blocks) else: # 常规情况,使用UncachedBlockAllocator self.gpu_allocator = UncachedBlockAllocator( Device.GPU, block_size, num_gpu_blocks) self.cpu_allocator = ...
上面代码中首先拿到num_heads和head_size两个变量的值,num_heads * head_size就表示单个 token 在单层多头注意力机制计算中所需要的参数量,不过这只是 key 或者 value cache 所占用的参数量。 一个block 占用的内存 = token 数量(block_size)✖️ 层数 (num_layers) ✖️ 单层 kv cache 占用内存 (2...
[V1] Fix when max_model_len is not divisible by block_size #10903 Merged WoosukKwon merged 3 commits into main from v1-max-model-len Dec 5, 2024 +10 −3 Conversation 2 Commits 3 Checks 11 Files changed 1 Conversation Collaborator WoosukKwon commented Dec 4, 2024 • edite...
lm_eval --model vllm --model_args pretrained=/home/vllm-dev/DeepSeek-R1,tensor_parallel_size=8 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto ... vllm (pretrained=/home/vllm-dev/DeepSeek-R1,tensor_parallel_size=8,trust_remote_code=True), gen_kwargs: (Non...
i += block_size) { 252255 expert_ids[i / block_size] =threadIdx.x; 253256 } 254- local_offsets[threadIdx.x] = cumsum[threadIdx.x]; 255257 } 258+ } 256259 257- __syncthreads(); 258- 259- for(inti = start_idx; i < numel && i < start_idx + tokens_per_thread; ++i) { ...
def _get_cache_block_size( cache_config: CacheConfig, model_config: ModelConfig, parallel_config: ParallelConfig, ) -> int: head_size = model_config.get_head_size() num_heads = model_config.get_num_kv_heads(parallel_config) num_attention_layers = model_config.get_num_layers_by_block_...
(model: str, block_size: int, max_num_seqs: int, concurrent_lora_int_ids: List[Optional[int]]): tokenizer = TokenizerGroup( tokenizer_id="facebook/opt-125m", enable_lora=False, max_num_seqs=max_num_seqs, max_input_length=None, ) hashes = [] for prefix in prefixes: for lora_...