1024+2048+tuning+fork

2025-05-06 17:47:27

拼音 [ 拼音 ]

GitHub - IsaacXu1024/mamba

1.4B482048 2.8B642560 (The layer count of Mamba doubles that of a Transformer with similar size, as two Mamba blocks are needed for each "layer" (MHA block + MLP block) of a Transformer.) Note: these are base models trained only for 300B tokens, without any form of downstream modificat...