get_special_tokens_mask(inputs["input_ids"], already_has_special_tokens=True)) Output: tokens : ['foo', '[UNK]', 'bar'] mask : [1, 0, 0, 0, 1] # [UNK] is ignored! mask from input ids : [1, 0, 1, 0, 1] Expected behavior [UNK] is special token. get_special_...
tokenizer_slow = AutoTokenizer.from_pretrained("./tok", use_fast= False) tokenizer_slow(tokenizer_slow.mask_token, add_special_tokens=False) Evaluates to{'input_ids': [4], 'attention_mask': [1]}(as expected). Not that in either case, mask_token is<mask>and corresponds to mask_token...