同问,为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等,都是2? 有bos_id,不过没发现对应的special token,我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"] input_ids = tokenizer.convert_tokens_to_ids(tokens) ...
eos_token_id if tokenizer.bos_token != args.bos_token: tokenizer.bos_token = args.bos_token model.config.bos_token_id = tokenizer.bos_token_id if model.generation_config: model.generation_config.bos_token_id = tokenizer.bos_token_id if tokenizer.pad_token != args.pad_token: tokenizer....
Have been using the trainer functionality for awhile, but in trying it with the new Hugging Face's SmolLM 135M model, no matter what the dataset, I'd end up with EOS token warnings (see below). It's possible this is just a new model quir...
}// Handle add_bos_token and add_eos_token std::string key = kv(LLM_KV_TOKENIZER_ADD_BOS); int kid = gguf_find_key(ctx, key.c_str()); enum gguf_type ktype = kid < 0 ? GGUF_TYPE_COUNT : gguf_get_kv_type(ctx, kid);...
// prompt: [BOS]query[EOS][SEP]doc[EOS] prompt_tokens.clear(); prompt_tokens.push_back(llama_token_bos(model)); { const auto part = tokenize(slot.prompt[0], false); prompt_tokens.insert(prompt_tokens.end(), part.begin(), part.end()); } prompt_tokens.push_back(llama_token_eos(...