Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability | Publicación