我是2023年10月开始关注到可以用 Sparse Autoencoder (SAE)来解释LLM的,到25年3月这一年半的时间里:(1) 训出了一系列基于Mistral-7b-inst的SAE模型;(2) 探索如何利用SAE的解释来提升LLM在生成任务的安全性和分类任务(e.g., Reward Modeling)的泛化性;(3) 参与了一篇SAE+LLM的survey。有人
还存在另一层不匹配,即我们的主观可解释性评估是我们真正目标“这个模型是如何工作的”的代理。有可能LLMs中的一些重要概念不容易解释,如果我们盲目地优化可解释性,可能会忽略这些概念。 有关SAE 评估方法的更详细讨论以及使用棋盘游戏模型 SAE 的评估方法,请参阅我的博客文章《Evaluating Sparse Autoencoders with B...
10:59 [动手写神经网络] pytorch 高维张量 Tensor 维度操作与处理,einops 23:03 [动手写 Transformer] 手动实现 Transformer Decoder(交叉注意力,encoder-decoder cross attentio) 14:43 [动手写神经网络] kSparse AutoEncoder 稀疏性激活的显示实现(SAE on LLM) 16:22 [...
Since comparing features across LLMs is challenging due to polysemanticity, in which LLM neurons often correspond to multiple unrelated features rather than to distinct concepts, sparse autoencoders (SAEs) have been employed to disentangle LLM neurons into SAE features corresponding to distinct ...
Paper tables with annotated results for Are Sparse Autoencoders Useful? A Case Study in Sparse Probing
KAN-LLaMA: An Interpretable Large Language Model With KAN-based Sparse Autoencoders Topics sparse-autoencoders kolmogorov-arnold-networks llm-interpretability Resources Readme Activity Stars 1 star Watchers 0 watching Forks 0 forks Report repository Releases No releases published Packages No...
Implements the Tsetlin Machine, Coalesced Tsetlin Machine, Convolutional Tsetlin Machine, Regression Tsetlin Machine, and Weighted Tsetlin Machine, with support for continuous features, drop clause, Type III Feedback, focused negative sampling, multi-task classifier, autoencoder, literal budget, and one-...
Sparse AutoEncode (SAE) TLDR 就是一个宽度很大的linear proj + 激活函数 + linear proj(有可能再加一个threshold i.e. JumpReLU),通过loss设计让激活稀疏化。根据transformer-circuits.pub 的说法,LLM本身的latent space是高度多义性的 e.g. 一个高维vector表达多种人类语义下的信息 ...
前言本文总结了Transformer/LLM中稀疏网络(Sparse),包含:LLM/语言模型、VLM/视觉语言模型、Prompt/提示词、Agent/智能体、CoT/思维链、MoE/混合专家模型、CLIP/图像语言模型、RAG/检索增强、SSM/状态空间模型、M…
Paper tables with annotated results for Sparse Autoencoder Features for Classifications and Transferability