Back to jobs
Senior Site Reliability Engineer
Coimbatore, Tamil Nadu, India; Hyderabad, Telangana, India
Job Overview:
Drive reliability and operational maturity for Kubernetes workloads on GKE through safe rollout patterns, high-signal observability, resilient IaC, and effective incident response. Collaborate with developers to harden CI/CD pipelines and address infrastructure concerns within application code.
Key responsibilities:
- Design and maintain resilient deployment patterns (blue-green, canary, GitOps syncs) across services.
- Instrument and optimize logs, metrics, traces, and alerts to reduce noise and improve signal.
- Review backend code (e.g., Django, Node.js, Go, Java) with a focus on infra touchpoints like database usage, timeouts, error handling, and memory consumption.
- Tune and troubleshoot GKE workloads, HPA configs, network policies, and node pool strategies.
- Improve or author Terraform modules for infrastructure resources (e.g., VPC, CloudSQL, Secrets, Pub/Sub).
- Diagnose production issues from logs, traces, dashboards, and lead or support incident response.
- Reduce config drift across environments and standardize secrets, naming, and resource tagging.
- Collaborate with developers to harden delivery pipelines, standardize rollout readiness, and clean up infra smells in code.
Key skills:
- Have 4–6+ years of experience in backend or infra-focused engineering roles (e.g., SRE, platform, DevOps, or fullstack).
- Can confidently write or review production-grade code and infra-as-code (Terraform, Helm, GitHub Actions, etc.).
- Have deep hands-on experience with Kubernetes in production, ideally on GKE, including workload autoscaling and ingress strategies.
- Understand cloud concepts like IAM, VPCs, secret storage, workload identity, and CloudSQL performance characteristics.
- Think in systems: you understand cascading failure, timeout boundaries, dependency health, and blast radius.
- Regularly contribute to incident mitigation or long-term fixes (not just closing alerts).
- Can influence through well-written PRs, documentation, and thoughtful design reviews.
Good to have:
- Exposure to GitOps tooling such as ArgoCD or FluxCD.
- Experience developing or integrating Kubernetes operators.
- Familiarity with service-level indicators (SLIs), service-level objectives (SLOs), and structured alerting.
Tools and Expectations:
- Datadog - Monitor infrastructure health, capture service-level metrics, reduce alert fatigue through high signal thresholds.
- PagerDuty - Own incident management pipeline. Route alerts by severity and align with business SLAs.
- GKE / Kubernetes - Improve cluster stability and workload isolation. Define auto-scaling configurations and tune for efficiency.
- Helm / GitOps (ArgoCD/Flux) - Validate release consistency across clusters. Monitor sync status and rollout safety.
- Terraform Cloud - Support DR planning and detect infrastructure changes through state comparisons.
- CloudSQL / Cloudflare - Diagnose DB and networking issues. Monitor latency, enforce access patterns, and validate WAF usage.
- Secret Management - Audit access to secrets, apply short-lived credentials, and define alerts for abnormal usage.
Create a Job Alert
Interested in building your career at Orion Innovation Naukri? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field