首先他们给了一个autoencoder (就是所谓的逐层SAE),Transcode(给定L预测L+1)和crosscoder(一次预测一群)的对比,类似于防止混淆:basic setup就不翻译了,基本和SAE的loss差不多:Multi-Layer SAE Residual Stream Analysis with Multi-Layer SAEs GitHub - tim-lawson/mlsae: Multi-Layer Sparse Autoencoders ...
理解稀疏自编码器Sparse Autoencoders 文章分享:towardsdatascience.com/ 文章主要探讨了人工智能领域中的 Anthropic 公司如何通过手动操作来构建和理解稀疏自编码器,以提高大型语言模型的解释性。 文章通过一个关于 Zephyra 保护真理宝典的寓言故事开始,象征性地描述了 Anthropic AI 在提取模型中有意义特征的旅程。作者...
Paper tables with annotated results for Sparse Autoencoder Features for Classifications and Transferability
This codebase was designed to replicate Anthropic's sparse autoencoder visualisations, which you can see here. The codebase provides 2 different views: a feature-centric view (which is like the one in the link, i.e. we look at one particular feature and see things like which tokens fire...
...Until I learn / find out what I am actually doing here (with regard to Sparse Autoencoders), at least. =) Sparse Autoencoder inspiration: Anthropic.AI research "Golden Gate Claude" + SAE details OpenAI: Top-K activation function (replace ReLU in Sparse Autoencoders), arxiv💡...
图2:Sparse Autoencoder的解释效果。 SAE可以被Scaling Up SAE受到LLM解释性圈子关注的核心原因在于OpenAI和Anthropic分别在他们最先进的模型上训练了千万级特征的SAE (ref:GPT-4 SAE和Claude 3 SAE)。在图3中我们可以观察到,SAE从仅有两层的玩具模型,发展到GPT-2这种十几层且包含几百M参数的早期LLM,再到Claude...
它们还代表了 Anthropic 等公司所追求的一个早期里程碑,即“机器学习模型的 MRI”。它们目前并不能提供完美的理解,但可能有助于检测不良行为。与 SAE 及其评估相关的挑战并非不可克服,并且是许多正在进行的研究的主题。 【1】An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability...
REPO_ID = "jbloom/GPT2-Small-SAEs" FILENAME = f"final_sparse_autoencoder_gpt2-small_blocks.{layer}.hook_resid_pre_24576.pt" path = hf_hub_download(repo_id=REPO_ID, filename=FILENAME) model, sparse_autoencoder, activation_store = LMSparseAutoencoderSessionloader.load_session_from_pretra...