The above loss plot is the first train attempt, using the independent-heads branch of this repo and my other repo https://github.com/mikayahlevi/transformer-train-script. Moving Forward I have limited compute and experience with datascience, so I haven't been able to test the LM on much...