To allow keeping N tokens when shuffling caption tokens 👍 1 Owner bmaltais commented Jan 27, 2023 • edited I will add it to the next release. It is now in the dev branch. To test it you can: git checkout dev git pull To go back to the master branch: git checkout master...
So, building on my example above, we can LoRA only a sparse subset of tokens, like: import torch N = 1000 # this is the number of tokens in the vocabulary K = 512 # this is the dimension of the embeddings T = 10 # number of tokens we want to let trainable R = 64 # lora ran...
struct ggml_tensor * Qcur = llm_build_lora_mm(lctx, ctx0, model.layers[il].wq, cur); cb(Qcur, "Qcur", il); @@ -8837,14 +8847,14 @@ struct llm_build_context { } Qcur = ggml_rope_ext( - ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_...
和LM-Infinite 基本是相同的 idea,但做了更多实验验证 Attention Sink 的现象(initial several tokens) 前面的层更倾向于 local attention(一斜),后面的层的注意力更倾向于 initial tokens 推理时,将中间的 kv cache 按照滑窗机制扔掉(灰色的块的 kv cache 就不再保留) LLM Maybe LongLM: Self-Extend LLM...