1 Comment
User's avatar
The AI Architect's avatar

Fantastic curation. The progression from SHAP/LIME to SAEs and mechanistic interpretability maps exactly how the field shifted from "explain this prediction" to "understand this circuit." That Anthropic Golden Gate Bridge feature paper is a watershed moment, its like going from trying to explain individual neurons to actually reading the feature manifold. One gap I see though is production deployment scenarios where you need realtime explainability at scale, most of these methods dont address the latency vs fidelity tradeoff when explaining millions of inferences daily.