You're also using the default--init-method-stdwhich is usually set to0.02in Megatron-LM which is too big for your model size. Based on your hidden size - you want something like 0.01 fromsqrt(2/(3072*5))- see:https://github.com/bigscience-workshop/bigscience/blob/master/train/lessons...
Doesn't happen when using CBIOS, which is ; good, since some other systems get timeout failures ; waiting for the floppy disk to spin up. pushad ; Try resetting the device xor ax,ax mov dl,[DriveNumber] int 13h popad loop .retry ; CX-- and jump if not zero ;shr word [...