
SRE
Reliability & Performance
· Design and implement monitoring, alerting, and reliability tooling using CloudWatch, Grafana, Prometheus, Datadog, or ELK.
· Analyze production performance, capacity, and error budgets to maintain agreed SLIs/SLOs.
· Implement automated health checks, scaling rules, and self-recovery mechanisms to minimize manual intervention.
· Drive root cause analysis (RCA) and post-incident reviews, ensuring permanent fixes and documentation.
Automation & Operations
· Build automation for deployment, configuration, and infrastructure management using Terraform, Ansible, or CloudFormation.
· Develop and maintain CI/CD pipelines with GitHub Actions, GitLab CI, or Jenkins.
· Manage and optimize containerized and serverless workloads (Kubernetes, ECS, EKS, Lambda).
· Implement automated rollbacks, blue/green deployments, and canary releases.
Incident Response & On-Call
· Participate in 24/7 on-call rotation for critical systems and lead incident management for your domain.
· Reduce mean time to detection (MTTD) and mean time to recovery (MTTR) through proactive automation and observability.
· Develop runbooks and operational playbooks for global SRE teams.
Security & Compliance
· Embed security practices into automation and deployment processes.
· Ensure systems adhere to ISO 27001 and SOC 2 requirements through continuous compliance monitoring.
· Manage IAM policies, secrets, and network configurations securely and efficiently.
Collaboration & Continuous Improvement
· Partner with developers to design for operability, scalability, and resilience from day one.
· Contribute to cross-team reliability reviews and platform improvement initiatives.
· Champion DevOps and reliability culture across Amtech’s engineering organization.
· 4+ years of experience in Site Reliability, DevOps, or Infrastructure Engineering roles.
· Strong background in AWS (EC2, ECS/EKS, RDS, Lambda, S3, IAM, VPC).
· Proficiency with Infrastructure-as-Code and automation (Terraform, Ansible, CloudFormation).
· Experience with observability tools (Prometheus, Grafana, CloudWatch, ELK, or Datadog).
· Scripting and automation skills (Python, Bash, Go, or PowerShell).
· Solid understanding of networking, DNS, and load balancing.
· Strong troubleshooting, incident management, and root cause analysis skills.
· Excellent communication and collaboration abilities in a cross-functional, distributed environment.
- AWS Solutions Architect Professional certification preferred
Create a Job Alert
Interested in building your career at Amtech Software? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
