__CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA #ifdef GGML_CUDA_FORCE_MMQ return MMQ_DP4A_MAX_BATCH_SIZE; #else // GGML_CUDA_FORCE_MMQ return 128; #else // GGML_CUDA_FORCE_MMQ return MMQ_DP4A_MAX_BATCH_SIZE; #endif // GGML_CUDA_FORCE_MMQ#...
Matrix: windows-2019-cmake-cuda 0/2 jobs completed Show all jobs Matrix: windows-latest-cmake-hip-release 0/3 jobs completed Show all jobs Matrix: windows-latest-cmake 0/10 jobs completed Show all jobs macOS-latest-cmake-arm64 macOS-latest-cmake-x64 2m 56s ubuntu-22-cmake-...
std::initializer_list<uint32_t> warptile_mmq_l = { 128, 128, 128, 32, device->subgroup_size * 2, 64, 2, 4, 4, device->subgroup_...// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES"); ...
std::initializer_list<uint32_t> warptile_mmq_l = { 128, 128, 128, 32, device->subgroup_size * 2, 64, 2, 4, 4, device->subgroup_...// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES"); ...
What is the issue? For reference: #3938 The issue might be actually result of disabling the following mode: Older versions: 0.1.31 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: YES New versions (After 0.1.31) ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ...
Name and Version ./llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: ye...
std::initializer_list<uint32_t> warptile_mmq_l = { 128, 128, 128, 32, device->subgroup_size * 2, 64, 2, 4, 4, device->subgroup_...// Emulate behavior of CUDA_VISIBLE_DEVICES for Vulkan char * devices_env = getenv("GGML_VK_VISIBLE_DEVICES"); ...
ggml-cuda.cu #ifdefGGML_CUDA_FORCE_MMQ #defineMUL_MAT_SRC1_COL_STRIDE128 #else //with tensor cores, we copy the entire hidden state to the devices in one go #defineMUL_MAT_SRC1_COL_STRIDE ggerganovcommentedOct 28, 2023 The reason to do it like this is because on the main device...
This PR refactors and optimizes the IQ MMVQ CUDA code. Notably as part of these changes I'm changing some values in ggml-common.h. The "qr" values are meant to represent how many low bit data value...
Name and Version $./llama-server ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes version: 0 (unk...