we used Swiss-Prot (2021_04) containing >0.5 M sequences to train our tokenizer. Following the training strategy of GPT217, our final vocabulary contained 50,256 tokens that correspond to the most widely reused oligomers in protein space, with an average size of four amino acids per toke...