
Member of Technical Staff, Infrastructure and Training Systems
Member of Technical Staff, Infrastructure & Training Systems
Location: SF Bay Area or Tokyo, Japan
Type: Full-time
About Radical Numerics
Radical Numerics is an AI lab bringing the rigor of distributed systems, model architecture, and numerics research to the challenges of biology. We are building the infrastructure needed to unlock scaling on vast biological sequence, structure, and image datasets so that biological world models become a reality. Our team introduced hybrid architectures that unlocked million-token context windows, enabling work toward AI-designed whole genomes and real gene-editing tools.
We believe biological world models will require not only strong research ideas, but exceptional training and inference systems: infrastructure that makes large-scale experimentation efficient, reproducible, and robust enough to support rapid scientific iteration. This role is focused on building that foundation.
About the Role
As a Member of Technical Staff, Infrastructure & Training Systems at Radical Numerics, you will design and build the systems that make large-scale model training possible across research and deployment workflows. You will work on distributed training, performance optimization, reusable internal frameworks, and the tooling that helps researchers move quickly without sacrificing reliability.
This role is ideal for someone who combines deep systems instincts with an interest in modern machine learning. You should care about how every layer of the stack affects research velocity: kernel performance, communication overhead, fault tolerance, observability, reproducibility, and the ergonomics of the training loop itself.
What You’ll Do
- Design and scale distributed training systems. Build and optimize distributed training infrastructure for large-scale biological world models across large distributed compute systems, with a focus on performance, stability, and scalability.
- Maximize throughput and hardware efficiency. Develop performance optimizations across the stack, including communication patterns, memory efficiency, custom kernels, compilation paths, and systems instrumentation, to ensure training compute is used effectively.
- Build reusable training frameworks. Develop internal libraries, abstractions, and workflows that improve reproducibility, reliability, and scalability across new model architectures and training recipes.
- Improve reliability under rapid iteration. Establish standards and mechanisms for robustness, maintainability, debugging, and safe deployment of fast-moving research infrastructure. That includes fault tolerance, checkpointing, monitoring, experiment hygiene, and incident analysis.
- Collaborate across research and engineering. Partner closely with model researchers, training scientists, and data/infrastructure engineers to identify bottlenecks, unblock experiments, and design systems that support new scientific directions rather than constrain them.
- Support new architectures and training paradigms. Adapt infrastructure to the needs of multimodal models, long-context training, and evolving model architectures, so the systems stack remains a research multiplier as model requirements change.
What We’re Looking For
- Strong engineering track record in distributed systems, high-performance ML infrastructure, training systems, or closely related areas.
- Proficiency in building performant, maintainable software in Python, PyTorch, Triton, CUDA, and C++.
- Strong understanding of modern deep learning frameworks and their systems internals.
- Ability to debug complex, multi-layered systems involving distributed training, memory/performance regressions, and reliability issues in large codebases.
- Comfort working in a highly collaborative environment with researchers, engineers, and domain experts, with a bias toward initiative and execution.
- Excellent written and verbal communication skills bridging technical and scientific domains.
Nice to Have
- Experience with large-scale distributed training for frontier or foundation models.
- Contributions to open-source ML systems or infrastructure such as PyTorch, Torchtitan or Megatron-LM.
- Familiarity with ML runtimes, compilers, numerics, communication libraries, and custom kernel development.
- Experience improving researcher productivity through infrastructure design, developer tooling, or workflow improvements.
- Background in applied math, systems, computational biology, or related quantitative sciences.
Why Radical Numerics
- Help build the computational foundation for multimodal biological world models aimed at rapid detection, response, and countermeasures across global health.
- Work on systems problems at the frontier of distributed training, architecture, and numerics, in service of real biological applications.
- Join a collaborative culture that values rigor, creativity, and cross-disciplinary partnership across AI labs, biotechs, hospital systems, and research institutes.
- Competitive compensation, comprehensive benefits, and support for continual learning.
Apply for this job
*
indicates a required field