map_location=torch.device('cpu'))official=torch.load(official_path,map_location=torch.device('cpu'))print('finetune keys:',finetune.keys(),'official keys',official.keys())# the args in
optim="adamw_torch", per_device_train_batch_size=1, evaluation_strategy="steps", save_strategy="steps", eval_steps=10, save_steps=10, output_dir=tmpdir, save_total_limit=2, load_best_model_at_end=True, save_safetensors=False, ) config = LlamaConfig( hidden_size=16, num_attention...
from accelerate.utils import load_and_quantize_model quantized_model = load_and_quantize_model(empty_model, weights_location=weights_location, bnb_quantization_config=bnb_quantization_config, device_map = "auto") 量化操作的具体实现,都集成在bitsandbytes 库的Linear8bitLt 模块中,它是torch.nn.module...
I found that either deliberately loading the optimizer states into cuda from the Trainer, or modifying the torch.optim.AdamW code to shift everything to cuda did the trick, though I feel like the fix on HF's end is a bit more elegant. Perhaps there's argument for changing the map_locat...
the Megatron-Deepspeed checkpoints since we need it for manipulating the 176B checkpoint, which is much bigger than 6B of GPT-J-6. If all goes well this work will eventually end up in normal ZeRO stages as well. The currenttorch.load()to cpu is simply not an option we can continue ...