
Site Reliability Engineer - Big Data (5 to 7 years)
About PhonePe Limited:
Headquartered in India, its flagship product, the PhonePe digital payments app, was launched in Aug 2016. As of April 2025, PhonePe has over 60 Crore (600 Million) registered users and a digital payments acceptance network spread across over 4 Crore (40+ million) merchants. PhonePe also processes over 33 Crore (330+ Million) transactions daily with an Annualized Total Payment Value (TPV) of over INR 150 lakh crore.
PhonePe’s portfolio of businesses includes the distribution of financial products (Insurance, Lending, and Wealth) as well as new consumer tech businesses (Pincode - hyperlocal e-commerce and Indus AppStore Localized App Store for the Android ecosystem) in India, which are aligned with the company’s vision to offer every Indian an equal opportunity to accelerate their progress by unlocking the flow of money and access to services.
Culture:
At PhonePe, we go the extra mile to make sure you can bring your best self to work, Everyday!. And that starts with creating the right environment for you. We empower people and trust them to do the right thing. Here, you own your work from start to finish, right from day one. PhonePe-rs solve complex problems and execute quickly; often building frameworks from scratch. If you’re excited by the idea of building platforms that touch millions, ideating with some of the best minds in the country and executing on your dreams with purpose and speed, join us!
About the Role
As an SRE (5 to 7 years) (Big Data) Engineer at PhonePe, you will be responsible for ensuring the stability, scalability, and performance of distributed systems operating at scale. You will collaborate with development, infrastructure, and data teams to automate operations, reduce manual efforts, handle incidents, and continuously improve system reliability. This role requires strong problem-solving skills, operational ownership, and a proactive approach to mentoring and driving engineering excellence.
Roles and Responsibilities
- Ensure the ongoing stability, scalability, and performance of PhonePe’s Hadoop ecosystem and associated services.
- Manage and administer Hadoop infrastructure including HDFS, HBase, Hive, Pig, Airflow, YARN, Ranger, Kafka, Pinot, and Druid.
- Automate BAU operations through scripting and tool development.
- Perform capacity planning, system tuning, and performance optimization.
- Set-up, configure, and manage Nginx in high-traffic environments.
- Administration and troubleshooting of Linux + Bigdata systems, including networking (IP, Iptables, IPsec).
- Handle on-call responsibilities, investigate incidents, perform root cause analysis, and implement mitigation strategies.
- Collaborate with infrastructure, network, database, and BI teams to ensure data availability and quality.
- Apply system updates, patches, and manage version upgrades in coordination with security teams.
- Build tools and services to improve observability, debuggability, and supportability.
- Participate in Kerberos and LDAP administration.
- Experience in capacity planning and performance tuning of Hadoop clusters.
- Work with configuration management and deployment tools like Puppet, Chef, Salt, or Ansible.
Skills Required
- Minimum 1 year of Linux/Unix system administration experience.
- Over 4 years of hands-on experience in Hadoop administration.
- Minimum 1 years of experience managing infrastructure on public cloud platforms like AWS, Azure, or GCP (optional ) .
- Strong understanding of networking, open-source tools, and IT operations.
- Proficient in scripting and programming (Perl, Golang, or Python).
- Hands-on experience with maintaining and managing the Hadoop ecosystem components like HDFS, Yarn, Hbase, Kafka .
- Strong operational knowledge in systems (CPU, memory, storage, OS-level troubleshooting).
- Experience in administering and tuning relational and NoSQL databases.
- Experience in configuring and managing Nginx in production environments.
- Excellent communication and collaboration skills.
Good to Have
- Experience designing and maintaining Airflow DAGs to automate scalable and efficient workflows.
- Experience in ELK stack administration.
- Familiarity with monitoring tools like Grafana, Loki, Prometheus, and OpenTSDB.
- Exposure to security protocols and tools (Kerberos, LDAP).
- Familiarity with distributed systems like elasticsearch or similar high-scale environments.
PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles)
- Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance
- Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System
- Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program
- Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy
- Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment
- Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Our inclusive culture promotes individual expression, creativity, innovation, and achievement and in turn helps us better understand and serve our customers. We see ourselves as a place for intellectual curiosity, ideas and debates, where diverse perspectives lead to deeper understanding and better quality results. PhonePe is an equal opportunity employer and is committed to treating all its employees and job applicants equally; regardless of gender, sexual preference, religion, race, color or disability. If you have a disability or special need that requires assistance or reasonable accommodation, during the application and hiring process, including support for the interview or onboarding process, please fill out this form.
Read more about PhonePe on our blog.
Apply for this job
*
indicates a required field