Back to jobs
New

Staff Site Reliability Engineer

Chennai, Tamil Nadu, India

Staff Site Reliability Engineer

Who we are:

Arcadia is the AI-powered energy intelligence platform for businesses. We replace fragmented tools and manual workflows with one platform to pay utility bills, buy energy, and advance sustainability — across every location, at enterprise scale.

Trusted by Fortune 2000 companies, Arcadia combines unified data, AI-powered analytics, and expert advisory to help enterprise teams save money, mitigate risk, and cut carbon.

We deliver this through three comprehensive solutions:

  • Utility Bill Management: Automating the entire utility bill lifecycle — from data capture and validation to payment processing and auditing.
  • Energy Procurement Advisory: Bringing together comprehensive data, AI-powered analytics, market expertise, and a strong partner network to make sophisticated procurement options accessible to all. .
  • Sustainability Reporting — Verified emissions data with seamless integration into leading sustainability platforms.

Tackling the world's most complex energy challenges requires diverse thinking. We're building teams of people from different backgrounds, industries, and disciplines — united by a belief that energy management should be simple, intelligent, and a genuine driver of business value.

What we’re looking for:

We are seeking a Staff Site Reliability Engineer (L4) to join our SRE/Platform Engineering team in India. This is a senior technical leadership role — not people management, but engineering leadership through execution, mentorship, and architectural ownership.

Our India SRE team is growing, and this role is central to that growth. As we scale, we need a technical anchor in the India timezone who can independently own multi-week SRE projects from problem statement to production, make sound architectural decisions under ambiguity, and elevate the team around them. You will be the person engineers lean on for design reviews, debugging escalations, and “how should we approach this?” conversations. You’ll bring the depth and experience to drive execution autonomously in the India timezone while collaborating closely with US-based SRE leadership on roadmap priorities, incident response, and platform strategy.

This is a role for someone who doesn’t wait for direction — you identify reliability gaps, propose solutions, build consensus, and ship.

Our infrastructure is primarily AWS-based, managed by Terraform and CloudFormation, and deployed using CI/CD best practices. In your application, please include a link to GitHub or another place where your code is published, though we understand that not everyone has public code online.

What you’ll do:

  • Own and deliver SRE projects end-to-end — from scoping and design through implementation, testing, rollout, and documentation
  • Serve as a technical anchor for the India SRE team — conduct design reviews, pair on complex debugging, and mentor engineers to develop the judgment to work through ambiguous problems independently
  • Design and implement infrastructure solutions across AWS (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, S3, CloudFront, Lambda, SQS) using Terraform and CloudFormation, with an emphasis on making the right tradeoffs between speed, reliability, and cost
  • Lead Kubernetes operations including cluster upgrades, capacity planning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments — and build the runbooks and automation so these become repeatable rather than one-off heroics
  • Evolve CI/CD pipelines across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD — with an emphasis on reducing manual deployment steps and improving rollback safety 
  • Drive observability stack enhancements — deliver the infrastructure and architectural direction necessary for engineering teams to leverage Prometheus, Grafana, and CloudWatch effectively
  • Identify and execute FinOps initiatives — find zombie resources, right-size instances, enforce tagging standards, and present cost-reduction recommendations with data to back them up
  • Manage database reliability across MySQL and PostgreSQL including backup validation, performance tuning, replication health, failover testing, and operational runbooks
  • Strengthen security posture through IAM least-privilege enforcement, CSPM reviews, GuardDuty/CloudTrail monitoring, secrets management (Vault, AWS Secrets Manager, Parameter Store), and audit readiness
  • Troubleshoot complex cross-cutting production issues spanning networking, Kubernetes, compute, databases, and CI/CD — and then turn the fix into a runbook or automation so the same issue doesn’t require the same person next time
  • Write the documentation the team actually needs — architecture decision records, operational runbooks, troubleshooting guides, and post-incident action items that get closed, not just filed
  • Collaborate daily with US-based SRE leadership on incident reviews, migration planning, roadmap execution, and platform strategy — bringing context and recommendations, not just status updates
  • Participate in on-call rotations and drive post-incident analysis with a focus on systemic fixes over individual blame

 

What will help you succeed:

Must-haves:

  • 10–14 years of experience in SRE/DevOps/Cloud Engineering, with a demonstrated progression from task execution to project ownership — we’re looking for evidence that you have independently scoped, designed, and delivered infrastructure projects end-to-end
  • Deep, hands-on expertise with AWS — EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty, Lambda, SQS. You should be able to architect a multi-AZ, multi-account solution and explain why you made the choices you made
  • Strong Terraform skills with experience managing complex, multi-environment state, writing reusable modules, and reviewing others’ IaC for correctness and maintainability
  • Advanced Kubernetes knowledge — you don’t just deploy to K8s, you troubleshoot networking issues at the CNI level, tune resource requests and limits based on actual usage data, and can plan and execute cluster upgrades with minimal downtime
  • CI/CD pipeline design and ownership across Jenkins (Groovy), GitHub Actions, ArgoCD, or FluxCD — with a track record of improving deployment reliability and reducing manual steps
  • Observability stack experience with Prometheus, Grafana, Datadog, or equivalent — including defining SLOs/SLIs, building meaningful dashboards, and tuning alerting to reduce noise
  • Proven mentorship ability — you have helped less experienced engineers grow. This could be formal (tech lead role, code review ownership) or informal (the person everyone goes to when they’re stuck). We will ask you about this in interviews
  • Strong written and verbal communication skills — you will interact with US-based teams daily, present proposals asynchronously, and write documentation that others can actually follow
  • Automation-first mindset — your instinct when you do something manually is to immediately think about how to script it. You have a track record of reducing operational toil through scripting and tooling
  • Incident management experience — you have led or significantly contributed to incident response and post-incident reviews in production environments, and you understand the difference between fixing the symptom and fixing the system
  • Ability to operate with autonomy — you don’t need daily direction. Given a problem space and constraints, you can propose an approach, pressure-test it with peers, and execute

Nice-to-haves:

  • Experience with FinOps practices — cloud cost analysis, rightsizing, tagging governance, reserved instance planning
  • Exposure to secrets management platforms (HashiCorp Vault, AWS Secrets Manager)
  • Experience with event-driven architectures using AWS Lambda, CloudWatch Events, SQS, and SNS
  • Exposure to AI-enabled tooling (automation assistants, MCP, RAG pipelines, LLM-based debugging)
  • Experience with data warehouses (Snowflake) and their operational requirements
  • Experience with n8n or similar workflow automation platforms
  • Industry certifications — AWS Solutions Architect Professional, CNCF CKA/CKS, HashiCorp Terraform Associate, or equivalent
  • Experience working in a company that has grown through acquisitions, with exposure to consolidating disparate infrastructure environments

Benefits:

  • Competitive compensation based on market standards 
  • We are working on a hybrid model with remote first policy
  • Apart from Fixed Base Salary potential candidates are eligible for following benefits
    • Flexible Leave Policy
    • Office located in the heart of the city in case you need to step in for any purpose
  • We provide comprehensive coverage including accident policy and life insurance.
  • Medical Insurance (1+5 Family Members)
  • Flexible Benefit Plan
  • Awards and Bonus
  • Annual performance cycle
  • Quarterly engagement activities

A supportive engineering culture that values diversity, empathy, teamwork, trust, and efficiency

Eliminating carbon footprints, eliminating carbon copies.

Here at Arcadia, we cultivate diversity, celebrate individuality, and believe unique perspectives are key to our collective success in creating a clean energy future. Arcadia is committed to equal employment opportunities regardless of race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, disability, genetic information, protected veteran status, or any status protected by applicable federal, state, or local law. 



Thank you

Create a Job Alert

Interested in building your career at Arcadia? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Arcadia Self-Identification Questions

For government reporting purposes, we ask candidates to respond to the below self-identification survey. Whatever your decision, it will not be considered in the hiring process or thereafter. Any information that you do provide will be recorded and maintained in a confidential file.

As set forth in Arcadia’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

---

If you believe you belong to any of the categories of protected veterans listed below, please indicate by making the appropriate selection. As a government contractor subject to the Vietnam Era Veterans Readjustment Assistance Act (VEVRAA), we request this information in order to measure the effectiveness of the outreach and positive recruitment efforts we undertake pursuant to VEVRAA. Classification of protected categories is as follows:

A "disabled veteran" is one of the following: a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or a person who was discharged or released from active duty because of a service-connected disability.

A "recently separated veteran" means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.

An "active duty wartime or campaign badge veteran" means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.

An "Armed forces service medal veteran" means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.

---

Voluntary Self-Identification of Disability

Why are you being asked to complete this form?

We are required to measure our progress toward having at least 7% of our workforce be individuals with disabilities. To do this, we must ask applicants and employees if they have a disability or have ever had a disability. Because a person may become disabled at any time, we ask all of our employees to update their information at least every five years.

Identifying yourself as an individual with a disability is voluntary, and we hope that you will choose to do so. Your answer will be maintained confidentially and not be seen by selecting officials or anyone else involved in making personnel decisions. Completing the form will not negatively impact you in any way, regardless of whether you have self-identified in the past. For more information about this form or the equal employment obligations of federal contractors under Section 503 of the Rehabilitation Act, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

You are considered to have a disability if you have a physical or mental impairment or medical condition that substantially limits a major life activity, or if you have a history or record of such an impairment or medical condition.

Disabilities include, but are not limited to:

  • Autism
  • Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, or HIV/AIDS
  • Blind or low vision
  • Cancer
  • Cardiovascular or heart disease
  • Celiac disease
  • Cerebral palsy
  • Deaf or hard of hearing
  • Depression or anxiety
  • Diabetes
  • Epilepsy
  • Gastrointestinal disorders, for example, Crohn's Disease, or irritable bowel syndrome
  • Intellectual disability
  • Missing limbs or partially missing limbs
  • Nervous system condition for example, migraine headaches, Parkinson’s disease, or Multiple sclerosis (MS)
  • Psychiatric condition, for example, bipolar disorder, schizophrenia, PTSD, or major depression
Select...
Select...
Select...
Select...