The following diagram, coming from this blog post illustrates how this works: ZeRO's ingenious approach is to partition the params, gradients and optimizer states equally across all GPUs and give each GPU just a single partition (also referred to as a shard). This leads to zero o...
aPetaMlunarrdeigstefoaoFlr3mekHromBTspit/hsKtuoharneftoashncaiemsg,h(ie~.eTa1.,,s5ts0ahet0ee TKrbe a=)fn.i nd5T6cNs..uTolh.fu2erumhsiaynxdgirmitdhueeminsosmtuaputiecorhncsohntiogdhiudecertnitnhtigafyngtaThpec (a52nΔbda)ntwhdeosudclridsoosbsre-- ing the Fermi level ...
The following diagram, coming from this blog post illustrates how this works: ZeRO's ingenious approach is to partition the params, gradients and optimizer states equally across all GPUs and give each GPU just a single partition (also referred to as a shard). This leads to zero over...