Featured Listing

Research Engineer, Interpretability

San Francisco, CAFull Time$280K – $480K/yr🛡️ AI Safety

Mechanistic InterpretabilityPyTorchAI SafetyResearchPython

Job Description

Join Anthropic's Interpretability team to understand what's happening inside large language models. You'll work on mechanistic interpretability — reverse-engineering the algorithms learned by neural networks — to make AI systems more transparent, predictable, and safe.

You'll design experiments to probe model internals, build tools for visualizing and analyzing neural network activations, and contribute to research that directly informs how we build safer AI systems.

Requirements

• Strong Python and PyTorch skills • Experience with neural network analysis or visualization • Curiosity about how neural networks work internally • Ability to design and execute rigorous experiments • Research background (publications a plus)

Benefits

• Competitive salary and equity • Comprehensive health benefits • Flexible work arrangements • Annual conference budget • Mission-driven work on AI safety

Opens company application page