
Research Fellowship - Mechanistic Interpretability
About Vmax
Vmax is an applied research lab developing AI capable of open-ended learning. We are building systems to exceed humans in all capacities by optimizing beyond the local maxima of learning from human expertise.
About the role
LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
This 3 to 6 month fellowship is for PhD students or equivalent early-career researchers who want to work at the intersection of mechanistic interpretability and reinforcement learning. You will own a focused research project, work closely with Vmax technical staff, and contribute to research publications.
Responsibilities
- Develop mechanistic interpretability methods for understanding internal representations, features, circuits, and computations in language models and agents.
- Investigate how model internals can be used to generate intrinsic rewards, auxiliary objectives, diagnostics, or training signals for reinforcement learning.
- Design and run experiments that test whether interpretability-derived signals improve learning, exploration, generalization, robustness, or sample efficiency.
- Compare internally derived rewards against baselines such as human-generated verifiers, reward models, task-level outcome rewards, and standard intrinsic motivation methods.
- Use techniques such as probing, activation analysis, sparse autoencoders, causal interventions, feature attribution, or representation analysis to study model behavior.
- Analyze failure modes, including reward hacking, spurious features, non-causal correlations, objective misspecification, and overfitting to narrow evaluation distributions.
- Build research code, evaluation harnesses, and experimental infrastructure that make results reproducible and useful to the broader team.
- Communicate research progress clearly through written updates, internal presentations, and final project outputs.
Role Requirements
- Currently enrolled in a PhD program in machine learning, computer science, artificial intelligence, computational neuroscience, mathematics, or a related technical field. Exceptional candidates with equivalent research experience may also be considered.
- Track record of research excellence or strong research promise, demonstrated through publications, preprints, open-source work, technical projects, competitions, or publicly available artifacts.
- Working understanding of reinforcement learning.
- Familiarity with mechanistic interpretability, representation analysis, or empirical methods for understanding neural networks.
- Strong programming ability in Python and experience with at least one major ML framework such as PyTorch or JAX.
- Clear written and verbal communication of technical ideas.
Nice to have
- Experience with LLM post-training methods
- Familiarity with intrinsic motivation, unsupervised RL, auxiliary objectives, representation learning for RL, or curiosity-driven learning.
- Experience with scalable ML experimentation, distributed training, experiment tracking, or reproducible research infrastructure.
- Interest in turning mechanistic understanding into practical training methods, rather than only analyzing models after training.
Role specific location policy
- This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement
Apply for this job
*
indicates a required field