1.4B482048 2.8B642560 (The layer count of Mamba doubles that of a Transformer with similar size, as two Mamba blocks are needed for each "layer" (MHA block + MLP block) of a Transformer.) Note: these are base models trained only for 300B tokens, without any form of downstream modificat...