
Sr. SRE Manager, Site Reliability
As a Sr. SRE Manager, you will be responsible for the overall reliability of our cloud infrastructure, leading the development and execution of strategies that promote high availability, scalability, and performance. You will mentor and grow an SRE team, partner with development and operations teams, and leverage automation and DevOps practices to streamline workflows. The ideal candidate will have significant experience working with AWS, Kubernetes, CI/CD tools like GitLab and Argo CD, and Grafana.
This is a leadership role requiring both technical expertise and strong management skills. You’ll be instrumental in fostering a culture of operational excellence, driving continuous improvement, and ensuring system reliability at scale.
You Will:
- Lead and mentor a team of SREs, providing guidance, technical support, and career development.
- Build and scale the SRE team, hiring top talent and fostering a collaborative, results-driven environment.
- Own the strategy for ensuring service reliability, availability, and scalability in production environments.
- Work with leadership and cross-functional teams to set performance goals and track progress towards reliability metrics (SLOs, SLIs, SLAs).
- Champion a DevOps culture by promoting collaboration between engineering, operations, and product teams.
- Design and implement scalable, resilient, and secure cloud infrastructure on AWS, with a focus on high availability and fault tolerance.
- Lead automation initiatives for infrastructure provisioning, application deployment, and monitoring using tools such as Terraform, Ansible, and CloudFormation.
- Manage and optimize Kubernetes clusters (EKS or OKE) for container orchestration and ensure smooth operations of microservices.
- Implement and continuously improve CI/CD pipelines using tools like GitLab, Argo CD, and others, enabling seamless, automated deployments.
- Develop and enforce best practices for monitoring, alerting, and incident management using tools such as Grafana, Prometheus, and CloudWatch.
- Drive the definition and implementation of reliability best practices, focusing on minimizing downtime and improving system uptime.
- Lead incident response, managing post-mortem analysis and implementing improvements to avoid recurrence.
- Ensure effective disaster recovery plans are in place, regularly tested, and meet the business continuity objectives.
- Develop and track operational metrics that reflect the health and reliability of production systems.
- Implement and enforce service-level objectives (SLOs), service-level indicators (SLIs), and service-level agreements (SLAs) for cloud systems
Cloud Infrastructure & DevOps:
- Manage AWS cloud infrastructure to ensure scalability, cost optimization, and performance.
- Support the adoption and implementation of DevOps principles across teams, enabling faster, more reliable software delivery.
- Drive the use of Infrastructure as Code (IaC) to automate cloud provisioning, configuration management, and system monitoring.
- Foster an automation-first mindset, identifying opportunities to streamline and improve operational processes.
Optional Technologies & Knowledge:
- OCI (Oracle Cloud Infrastructure): Experience with managing infrastructure and services in OCI to extend multi-cloud strategies.
- OpenVPN: Knowledge of OpenVPN or other VPN solutions for secure remote access and infrastructure connectivity.
- EMQX: Familiarity with EMQX (MQTT broker) for managing high-throughput message queues in real-time systems.
- Argo CD: Hands-on experience with Argo CD for managing GitOps-based CI/CD pipelines.
- Data Engineering: Familiarity with data engineering concepts such as ETL pipelines, data lakes, and analytics systems.
You Bring:
- 10+ years of experience in Site Reliability Engineering or a related field, with 3+ years in a leadership role.
- Deep expertise in managing AWS infrastructure, including EC2, S3, RDS, Lambda, CloudWatch, and others.
- Strong experience with Kubernetes (EKS or OKE) for orchestrating containerized applications at scale.
- Proven expertise with CI/CD tools, especially GitLab, ArgoCD, or similar.
- Experience with Grafana, Prometheus, or other monitoring tools to provide real-time observability and performance insights.
- Solid understanding of Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible.
- Strong programming or scripting skills (e.g., Python, Bash, Go, etc.).
- Excellent problem-solving skills with a focus on root cause analysis and continuous improvement.
- Ability to collaborate and communicate effectively across technical and non-technical teams.
Preferred Qualifications:
- Experience with Oracle Cloud Infrastructure (OCI), understanding multi-cloud strategies and hybrid cloud environments.
- Knowledge of OpenVPN or other VPN solutions for secure network access and infrastructure management.
- Hands-on experience with EMQX, especially for real-time data streaming or messaging in IoT or event-driven applications.
- Knowledge of Argo CD for continuous delivery and GitOps-based operations.
- Experience in Data Engineering practices, including data pipelines, ETL processes, and integration with big data or analytics platforms.
Soft Skills:
- Strong leadership, mentorship, and team-building abilities.
- Excellent communication skills, both written and verbal.
- A collaborative, empathetic approach to managing teams and driving cross-functional alignment.
- Proven ability to work under pressure and manage high-priority incidents effectively.
- A passion for reliability engineering and ensuring systems perform at their best.
Base Pay Range (Annual)
$173,700 - $254,760 USD
By Submitting your application, you understand and agree that your personal data will be processed in accordance with our Candidate Privacy Notice. If you are a California resident, please refer to our California Candidate Privacy Notice.
Apply for this job
*
indicates a required field