(wq): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wk): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wv): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wo):...
(wq): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wk): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wv): QuantizedLinear(input_dims=5120, output_dims=5120, bias=False,group_size=64, bits=4) (wo):...
But how do you know if it's a quantized model or not? Presumably there are some loc somewhere that quantizes the model based on the config? (prior to loading the safetensors) davidkoski commentedon Apr 26, 2024 davidkoski awni commentedon Apr 26, 2024 ...
If it's a 2-bit quantized model, it may work on the latest iPad and iPhone. I hope it will be compatible with Phi-4 too! Is there anything currently being done to support it?Activity DePasqualeOrg commented on Jan 30, 2025 DePasqualeOrg on Jan 30, 2025 Contributor It's already ...
The model can be quantized through mlx_lm.convert, and the default quantization is INT4. This example is to quantize Phi-3-mini into INT4. After quantization, it will be stored in the default directory ./mlx_model We can test the model quantized with MLX from terminal ...
quantize_module(model, args.q_group_size, args.q_bits) # Update the config: quantized_config["quantization"] = { "group_size": args.q_group_size, "bits": args.q_bits, } quantized_weights = dict(tree_flatten(model.parameters())) return quantized_weights, quantized_config def make_...
# Quantize the model: nn.QuantizedLinear.quantize_module(model, args.q_group_size, args.q_bits) nn.quantize(model, args.q_group_size, args.q_bits) # Update the config: quantized_config["quantization"] = { 2 changes: 1 addition & 1 deletion 2 llms/llama/llama.py Original file line...
python lora.py --model <path_to_model> \ --train \ --iters 600 If --model points to a quantized model, then the training will use QLoRA, otherwise it will use regular LoRA. By default, the adapter weights are saved in adapters.npz. You can specify the output location with --adap...
nn.QuantizedLinear.quantize_module(model, args.q_group_size, args.q_bits) # Update the config: quantized_config["quantization"] = { "group_size": args.q_group_size, "bits": args.q_bits, } quantized_weights = dict(tree_flatten(model.parameters())) return quantized_weights, quantized_con...
nn.quantize(model, args.q_group_size, args.q_bits) # Update the config: quantized_config["quantization"] = { 2 changes: 1 addition & 1 deletion 2 llms/llama/llama.py Original file line numberDiff line numberDiff line change @@ -339,7 +339,7 @@ def load_model(model_path): quan...