Site Reliability Engineer
RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.
As our organization continues its rapid expansion in managing large-scale, distributed systems, we are seeking a full-time, remote Site Reliability Engineer to join our team. This technical position will be pivotal in designing, implementing, and maintaining our robust infrastructure across multiple data centers. The ideal candidate will have deep knowledge of Linux systems, containerization, and virtualization technologies, coupled with strong experience in managing large bare-metal fleets and implementing secure best practices. This role offers the opportunity to work with cutting-edge GPU/AI technologies, solve complex problems at scale, and contribute to the reliability and performance of our critical systems. We provide competitive compensation, including stock options, and the flexibility of remote work within a culture that values innovation, continuous learning, and technical excellence.
Key aspects of our SRE approach include:
- Automation First: We write software to manage, scale, and optimize our infrastructure, moving beyond manual operations to enable rapid, consistent, and reliable system scaling.
- Systems Thinking: Our SREs approach problems with a holistic view, considering how changes and improvements in one area can positively impact the entire system.
- Continuous Improvement: We constantly iterate on our processes and tooling, using data-driven decisions to enhance system reliability and performance.
- Proactive Problem Solving: Rather than reactively addressing issues, we build systems and tools that anticipate and mitigate potential problems before they occur.
- Scalability Through Code: We believe in managing infrastructure as code, allowing us to version, test, and deploy our infrastructure configurations with the same rigor as application code.
As an SRE in our team, you'll be at the forefront of this approach, using your software engineering skills to build robust, scalable systems that support our rapidly growing infrastructure. You'll work on challenging projects that require innovative solutions, always with an eye towards automation, reliability, and performance at scale.
If you are passionate about building and maintaining highly reliable, scalable systems and have the skills to match, we want to hear from you. Join our team and help shape the future of AI compute infrastructure!
Responsibilities:
- Design, implement, and maintain robust, scalable, and highly available systems
- Troubleshoot and resolve complex issues in distributed environments
- Develop and implement SLIs and SLOs to ensure system reliability and performance
- Manage and optimize large-scale bare-metal fleets across multiple data centers
- Implement and maintain secure practices for foundational systems
- Collaborate with cross-functional teams to improve system design and operation
- Automate processes to increase efficiency and reduce human error
- Participate in on-call rotations to provide 24/7 support for critical systems
Requirements:
- Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components]
- Extensive experience with distributed system troubleshooting and design
- Proficiency in at least one programming language, preferably Python or Golang
- Proven experience implementing and managing SLIs and SLOs
- Experience with pull-based configuration management tools such as Chef or Puppet
- Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers
- Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems
- Comprehensive understanding of OSI model Layers 3, 4, and 7
- Successful completion of a background check
Preferred:
- Bachelor's degree in Computer Science, Engineering, or a related field
- Relevant industry certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
- Experience with cloud platforms (AWS, GCP, Azure)
- Familiarity with monitoring and observability tools (e.g., Statsd, Grafana, Datadog, OpenTelemetry, VictoriaMetrics)
- Experience with managing fleets of GPU compute resources at scale
- Strong communication skills and ability to work effectively in a team environment
What You’ll Receive:
- The competitive base pay for this position ranges from $152,000 - $175,000. Factors that may be used to determine your actual pay may include your specific job related knowledge, skills and experience
- Stock options
- The flexibility of remote work with an inclusive, collaborative team.
- An opportunity to grow with a company that values innovation and user-centric design.
- Generous vacation policy to ensure work-life harmony and well-being.
- Contribute to a company with a global impact based in the US, Canada, and Europe.
RunPod is committed to maintaining a workplace free from discrimination and upholding the principles of equality and respect for all individuals. We believe that diversity in all its forms enhances our team. As an equal opportunity employer, RunPod is committed to creating an inclusive workforce at every level. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, marital status, protected veteran status, disability status, or any other characteristic protected by law.
Apply for this job
*
indicates a required field