353 - from deepspeed.utils import set_z3_leaf_modules # type: ignore 354 + if getattr(model.config, "model_type", None) == "mixtral": 354 355 from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock 355 356 356 - set_z3_leaf_modules(model, [MixtralSparse...
Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits. Useful for better inference than...
sparse_feature_cross_op_kernel", "//tensorflow/contrib/nearest_neighbor:nearest_neighbor_ops_kernels", "//tensorflow/contrib/rnn:all_kernels", "//tensorflow/contrib/seq2seq:beam_search_ops_kernels", "//tensorflow/contrib/tensor_forest:model_ops_kernels", "//tensorflow/contrib/tensor_forest:...
Fast Llama 2 on CPUs With Sparse Fine-Tuning and DeepSparse (https://neuralmagic.com/blog/fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse/) LLM Distillation Playbook (by Predibase) - Practical best practices for distilling large language models (https://github.com/predibase/llm...