Back to jobs
Cluster & Infrastructure Engineer
Palo Alto, CA
About the Role
RadixArk is looking for a Cluster & Infrastructure Engineer to build and operate large-scale AI clusters that power frontier-level training and inference workloads. You'll design reliable infrastructure for multi-node, multi-rack GPU and TPU systems, optimize cluster utilization and scheduling efficiency, and ensure fault tolerance at scale for SGLang and our production systems.
Requirements
-
4+ years experience building and operating large-scale distributed systems or AI clusters
-
Bachelor's or Master's degree in Computer Science, Electrical Engineering, or equivalent industry experience
-
Strong experience with cluster management systems: Kubernetes, Slurm, or custom schedulers
-
Hands-on experience running GPU or TPU clusters at scale
-
Solid understanding of networking, storage, and distributed systems fundamentals
-
Proficiency in Python, Go, or Bash with production-quality infrastructure-as-code practices
-
Production experience operating large clusters (1000+ GPUs/TPUs) is a big plus
Responsibilities
-
Build and operate large-scale AI clusters:
-
Kubernetes, Slurm, schedulers, and resource management
-
GPU / TPU clusters, multi-node, multi-rack systems
-
-
Design reliable infrastructure for large-scale training and inference workloads
-
Improve cluster utilization, scheduling efficiency, and fault tolerance
-
Partner with systems and ML engineers to support frontier-scale workloads
-
Monitor, debug, and resolve infrastructure issues affecting training and serving reliability
-
Automate deployment, scaling, and maintenance of cluster infrastructure
-
Implement observability and alerting systems for cluster health and performance
- Document infrastructure architecture, runbooks, and operational best practices
About RadixArk
RadixArk is an infrastructure-first company built by engineers who've shipped production AI systems , created SGLang (20K+ GitHub stars, the fastest open LLM serving engine), and developed Miles (our large-scale RL framework). We're on a mission to democratize frontier-level AI infrastructure by building world-class open systems for inference and training. Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure that powers leading AI companies and research labs. We're backed by well-known investors in the infrastructure field and partner with Google, AWS, and frontier AI labs. Join us in building infrastructure that gives real leverage back to the AI community.Compensation
We offer competitive compensation with equity, comprehensive health benefits, and flexible work arrangements. Compensation is determined by location, level, and experience.Equal Opportunity
RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Apply for this job
*
indicates a required field