Member of Technical Staff — Supercomputing
About the Role
RadixArk is hiring a Member of Technical Staff — Supercomputing to help build, deploy, and operate production-grade AI infrastructure for frontier-scale inference and training workloads.
This role sits at the intersection of engineering, deployment, reliability, and customer infrastructure. You will work on bringing up SGLang, Miles, and the RadixArk infrastructure stack across cloud GPUs, customer VPCs, dedicated clusters, and partner environments. You will help ensure that our systems are not only fast in benchmarks, but reliable, observable, and operationally robust in real production settings.
This is not a traditional DevOps or SRE role. You will need to understand the model, the serving engine, the cluster, the workload, and the production constraints. One day you may be helping a customer bring up a new model on a GPU cluster; another day you may be debugging P99 latency, GPU utilization, autoscaling, networking behavior, deployment failures, or reliability regressions.
We are looking for someone hands-on, technically strong, calm under pressure, and excited to build the supercomputing foundation for a new category of AI infrastructure company.
What You’ll Do
- Deploy SGLang, Miles, and RadixArk infrastructure across customer, cloud, VPC, and dedicated cluster environments.
- Bring up production inference and training workloads for open-weight and customer-specific models.
- Own deployment reliability, environment management, rollout processes, and production validation.
- Debug issues across LLM serving, Kubernetes, networking, GPU infrastructure, storage, cloud capacity, and customer systems.
- Build and improve observability for latency, throughput, uptime, error rates, GPU utilization, memory usage, capacity, and workload health.
- Improve monitoring, alerting, incident response, runbooks, postmortems, and operational processes.
- Help design capacity planning, autoscaling, and reliability strategies for GPU-intensive workloads.
- Work closely with engineering teams to improve deployment tooling, automation, CI/CD, and production readiness.
- Partner with customer engineering teams during POCs, production launches, and ongoing operations.
- Feed deployment and reliability pain points back into the product and engineering roadmap.
- Help build the foundation for a world-class supercomputing deployment and reliability organization.
What We’re Looking For
- Strong hands-on experience operating production systems at meaningful scale.
- Experience with Linux, containers, Kubernetes, networking, cloud infrastructure, and distributed systems.
- Familiarity with GPU infrastructure, LLM inference serving, ML systems, or large-scale training workloads.
- Strong debugging skills across infrastructure, orchestration, serving, and performance layers.
- Experience with Python, Bash, Docker, Kubernetes, Terraform, and cloud environments.
- Familiarity with observability tools such as Prometheus, Grafana, Datadog, OpenTelemetry, ELK, or similar systems.
- Understanding of production metrics such as latency, throughput, uptime, GPU utilization, memory usage, capacity, and error rates.
- Ability to communicate clearly with both internal engineering teams and external customer teams.
- Strong ownership mindset and good judgment under pressure.
- Prior experience in production engineering, SRE, platform engineering, deployment engineering, ML infrastructure, or cloud operations is a strong plus.
- Experience with AWS, GCP, Azure, OCI, CoreWeave, Lambda, Crusoe, or similar infrastructure providers is a plus.
About RadixArk
RadixArk is an infrastructure-first AI company built by engineers who have shipped production AI systems, created SGLang, and developed Miles, our large-scale RL framework.
We are building world-class open systems for inference and training, with the mission of democratizing frontier-level AI infrastructure. Our team has optimized kernels serving billions of tokens daily, designed distributed training systems coordinating 10,000+ GPUs, and contributed to infrastructure used by leading AI companies and research labs.
We partner with major cloud providers, hardware companies, and frontier AI labs to build the next generation of scalable, reliable, and open AI infrastructure.
Compensation
We offer competitive compensation with meaningful equity, comprehensive health benefits, and flexible work arrangements. Compensation is determined by location, level, and experience.
Equal Opportunity
RadixArk is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Apply for this job
*
indicates a required field