This may because you installed auto_gptq using a pre-build wheel on Windows, in which exllama_kernels are not compiled. To use exllama_kernels to further speedup inference, you can re-install auto_gptq from source. CUDA kernels for auto_gptq are not installed, this will result in very ...
model = AutoGPTQForCausalLM.from_quantized( File "C:\Users\wuyux\anaconda3\envs\localgpt\lib\site-packages\auto_gptq\modeling\auto.py", line 94, in from_quantized return quant_func( File "C:\Users\wuyux\anaconda3\envs\localgpt\lib\site-packages\auto_gptq\modeling_base.py", line 74...
done Created wheel for auto-gptq: filename=auto_gptq-0.2.0+cu1162-cp310-cp310-linux_x86_64.whl size=3637006 sha256=84f5263e347cc5199923597b654f994ea35f1f0ea586ae81f5be94984c892b3f Stored in directory: /tmp/pip-ephem-wheel-cache-q1oqlde6/wheels/24/88/75/0af9bf8f82c28467ed0e61...
GPTQ kernels are fp16 by default. c420804 Qubitium closed this Jun 17, 2024 Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment Reviewers No reviews Assignees No one assigned Labels None yet Projects None yet Milestone No milestone ...
set(LLAMA_CUDA_DMMV_X "32" CACHE STRING "llama: x stride for dmmv CUDA kernels") set(LLAMA_CUDA_DMMV_Y "1" CACHE STRING "llama: y block size for dmmv CUDA kernels") if (GGML_CUBLAS_USE) target_compile_definitions(ggml${SUFFIX} PRIVATE GGML_USE_CUBLAS GGML_CUDA_DMMV_X=${...
Describe the bug I have been heavily investigating GPT-Neo for our company. Most of our models run directly on GPU with ONNX as backend. The problem is as follows: Running the gpt-neo-1.3B model in a custom build onnxruntime instance on ...
(out var ctx, CUctx_flags.CU_CTX_SCHED_AUTO, dev)); checkCudaErrors(cuCtxSetCurrent(ctx)); cuPrintCurrentContextInfo(); #endif #if USE_CUDA gpt2_load_kernels(model); #endif // read in model from a checkpoint file using (SafeFileHandle model_file = new SafeFileHandle(fopen(checkpoint_...