Site Reliability Engineer
Site Reliability Engineer
As our organization continues its rapid expansion in managing large-scale, distributed systems, we are seeking a full-time, remote Site Reliability Engineer to join our team. This technical position will be pivotal in designing, implementing, and maintaining our robust infrastructure across multiple data centers. The ideal candidate will have deep knowledge of Linux systems, containerization, and virtualization technologies, coupled with strong experience in managing large bare-metal fleets and implementing secure best practices. This role offers the opportunity to work with cutting-edge GPU/AI technologies, solve complex problems at scale, and contribute to the reliability and performance of our critical systems. We provide competitive compensation, including stock options, and the flexibility of remote work within a culture that values innovation, continuous learning, and technical excellence.
Description
- Job Title: Site Reliability Engineer
- Full time, Remote
- Reports to: Head of Infrastructure
- Salary Range: $152,000 - $175,000
We are seeking a highly skilled and experienced Site Reliability Engineer to join our team. The ideal candidate will have a deep understanding of complex distributed systems and a passion for maintaining and improving large-scale infrastructure.
Key aspects of our SRE approach include:
- Automation First: We write software to manage, scale, and optimize our infrastructure, moving beyond manual operations to enable rapid, consistent, and reliable system scaling.
- Systems Thinking: Our SREs approach problems with a holistic view, considering how changes and improvements in one area can positively impact the entire system.
- Continuous Improvement: We constantly iterate on our processes and tooling, using data-driven decisions to enhance system reliability and performance.
- Proactive Problem Solving: Rather than reactively addressing issues, we build systems and tools that anticipate and mitigate potential problems before they occur.
- Scalability Through Code: We believe in managing infrastructure as code, allowing us to version, test, and deploy our infrastructure configurations with the same rigor as application code.
As an SRE in our team, you'll be at the forefront of this approach, using your software engineering skills to build robust, scalable systems that support our rapidly growing infrastructure. You'll work on challenging projects that require innovative solutions, always with an eye towards automation, reliability, and performance at scale.
Responsibilities:
- Design, implement, and maintain robust, scalable, and highly available systems
- Troubleshoot and resolve complex issues in distributed environments
- Develop and implement SLIs and SLOs to ensure system reliability and performance
- Manage and optimize large-scale bare-metal fleets across multiple data centers
- Implement and maintain secure practices for foundational systems
- Collaborate with cross-functional teams to improve system design and operation
- Automate processes to increase efficiency and reduce human error
- Participate in on-call rotations to provide 24/7 support for critical systems
Required Qualifications:
- Deep knowledge of Linux kernel internals, containerization (Docker), virtualization (Kata/QEMU), and networking components
- Extensive experience with distributed system troubleshooting and design
- Proficiency in at least one programming language, preferably Python or Golang
- Proven experience implementing and managing SLIs and SLOs
- Experience with pull-based configuration management tools such as Chef or Puppet
- Demonstrated ability to manage large-scale bare-metal fleets (5,000+ machines) across multiple data centers
- Strong background in implementing secure best practices for foundational systems, including secret management, AWS IAM permissions, and key distribution systems
- Comprehensive understanding of OSI model Layers 3, 4, and 7
Preferred Qualifications:
- Bachelor's degree in Computer Science, Engineering, or a related field
- Relevant industry certifications (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
- Experience with cloud platforms (AWS, GCP, Azure)
- Familiarity with monitoring and observability tools (e.g., Statsd, Grafana, Datadog, OpenTelemetry, VictoriaMetrics)
- Experience with managing fleets of GPU compute resources at scale
- Strong communication skills and ability to work effectively in a team environment
If you are passionate about building and maintaining highly reliable, scalable systems and have the skills to match, we want to hear from you. Join our team and help shape the future of AI compute infrastructure!
Apply for this job
*
indicates a required field