for this memory-constrained context, we probably would start applying other memory saving techniques like (Q)LoRA, which changes the formulas. If we use LoRA, then there is lower optimizer state memory, which means that Adafactor gives less absolute savings. ...