Q5_177696772101 Note: since cuBLAS is supported only forggml_mul_mat(), we still need to use few CPU resources to execute remaining operations. With hipBLAS Measurements were made on CPU AMD Ryzen 9 5900X & GPU AMD Radeon RX 7900 XTX. The model isRWKV-novel-4-World-7B-20230810-ctx1...
The model is RWKV-novel-4-World-7B-20230810-ctx128k, 32 layers were offloaded to GPU. Latency per token in ms shown. Format1 thread2 threads4 threads8 threads24 threads f16 94 91 94 106 944 Q4_0 83 77 75 110 1692 Q4_1 85 80 85 93 1691 Q5_1 83 78 83 90 1115 Note: same ...
import argparse import time import sampling from rwkv_cpp import rwkv_cpp_shared_library, rwkv_cpp_model from tokenizer_util import add_tokenizer_argument, get_tokenizer from rwkv_cpp import rwkv_world_tokenizer from typing import List model_path = "./rwkv-5-h-world-7B-Q5_1.b...
{ + "version": "7.21.5", + "resolved": "https://registry.npmmirror.com/@babel/helper-string-parser/-/helper-string-parser-7.21.5.tgz", + "integrity": "sha512-5pTUx3hAJaZIdW99sJ6ZUUgWq/Y+Hja7TowEnLNMm1VivRgZQL3vpBY3qUACVsvw+yQU6+YgfBVmcbLaZtrA1w==", + "dev": true, + "...