Software Engineer, Infrastructure Generalist
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are a small team of scientists, engineers, and builders who've created some of the most widely used AI products, like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About This Role
We're looking for a Staff Software Engineer—a generalist across the backend—to help build the systems that power our foundation models.
You'll join a small, high-impact team responsible for architecting and scaling the core infrastructure behind everything we do. You’ll work across the full technical stack, solving complex distributed systems problems and building robust, scalable platforms.
Infrastructure is critical to us: it's the bedrock that enables every breakthrough. You'll work directly with researchers to accelerate experiments, improve infrastructure efficiency, and enable key insights across our models, products, and data assets.
What You’ll Do
- Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities.
- Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
- Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.
- Implement and maintain monitoring and alerting to support platform reliability and performance.
- Collaborate with research teams to unlock new features, improve system efficiency, and accelerate training cycles.
Required Qualifications
- Technical expertise:
- 5+ years of experience building distributed systems, ideally supporting high-scale applications or research platforms.
- Fluent in containerization, orchestration, and distributed compute frameworks.
- Hands-on experience with Kubernetes, Terraform, service discovery, and workflow orchestration tools.
- Experience with network programming, load balancing, or distributed consensus systems.
- Extensive experience with performance optimization, caching strategies, and system scalability patterns.
- Deeply familiar with cloud infrastructure, microservices architectures, and both synchronous and asynchronous processing.
- Strong knowledge of databases, storage systems, and how architecture choices impact performance at scale.
- Proactive about automation, testing, and building tools that empower engineering teams.
- System Design & Performance:
- Strong proficiency in systems programming languages (Rust) and scripting (Python)
- Familiarity with performance profiling and optimization in high-throughput distributed environments
- Track record of architecting resilient systems and debugging complex production issues
- Excellent communication and collaboration skills
Strong Candidates May Also Have
- Experience supporting machine learning training infrastructure or GPU clusters
- Background at AI research labs, high-performance computing centers, or ML-focused companies
- Published work on distributed systems, infrastructure, or performance optimization
- Open-source contributions to infrastructure projects, orchestration tools, or distributed computing frameworks
- Experience with specialized hardware (GPUs, TPUs) and their integrations into distributed training systems
Logistics
- Location: This role is based in San Francisco, California.
- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $300,000-$350,000 USD.
- We encourage you to apply even if you do not believe you meet every single qualification.
- As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.
Create a Job Alert
Interested in building your career at Thinking Machines Lab? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field