Fantastic curation. The progression from SHAP/LIME to SAEs and mechanistic interpretability maps exactly how the field shifted from "explain this prediction" to "understand this circuit." That Anthropic Golden Gate Bridge feature paper is a watershed moment, its like going from trying to explain individual neurons to actually reading the feature manifold. One gap I see though is production deployment scenarios where you need realtime explainability at scale, most of these methods dont address the latency vs fidelity tradeoff when explaining millions of inferences daily.
Fantastic curation. The progression from SHAP/LIME to SAEs and mechanistic interpretability maps exactly how the field shifted from "explain this prediction" to "understand this circuit." That Anthropic Golden Gate Bridge feature paper is a watershed moment, its like going from trying to explain individual neurons to actually reading the feature manifold. One gap I see though is production deployment scenarios where you need realtime explainability at scale, most of these methods dont address the latency vs fidelity tradeoff when explaining millions of inferences daily.