Have been using the trainer functionality for awhile, but in trying it with the new Hugging Face's SmolLM 135M model, no matter what the dataset, I'd end up with EOS token warnings (see below). It's possible this is just a new model quir...
同问,为什么没有bos_token_id。另外eos_token_id和pad_token_id为啥相等,都是2? 有bos_id,不过没发现对应的special token,我把代码改成下面了 tokens = prompt_tokens + src_tokens + ["[gMASK]", "sop"] + tgt_tokens + ["eop"] input_ids = tokenizer.convert_tokens_to_ids(tokens) ...
I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I'm doing? wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model from transformers import T5Tokeni...