Site Reliability Engineer
Who We Are:
Resident is an industry leader in the Direct-to-Consumer (e-commerce) space. While our customers are primarily based in the US, our R&D, Product, and Data teams have been operating out of Tel Aviv since our founding. Our mission is simple, we are building a best-in-class e-commerce platform that leverages data and technology to create a competitive advantage for our brands. Starting from the marketing acquisition funnel and continuing through each customer’s journey, our tools and technology enable us to go the extra step to deliver a world-class customer experience.
Our company is built around continuously improving our ability to introduce new customers to our products and wow them with exceptional experiences through the shopping and post-purchase journey. We love to use data and metrics to drive our decisions while keeping in mind that customers don’t speak in numbers and that each one should be treated as a member of our family. Oh, and by the way, you’ll get to work with a diverse group of experts around the globe. You can expect a hard-working team of people who understand how to create meaningful connections and get great work done virtually - it’s in our nature!
What We Do:
Our department is responsible for the backbone of the entire development process and infrastructure. We design, implement, and maintain the processes, methodologies, and technologies that enable and support the development of the Resident’s platform, upholding high standards of quality, performance, security, availability, and agility.
What You’ll Be Doing:
We are looking for a Site Reliability Engineer to join our DevOps team. You will ensure the reliability, performance, and scalability of our back-office solutions, which serve as the foundation for the entire purchasing process. This role will lead the development of SRE capabilities, meeting SLI/SLO/SLA targets, and establishing effective monitoring systems. You will enhance our Software Development Lifecycle by integrating reliability and scalability, working with cross-functional teams, and supporting production environments. Additionally, you will implement incident management processes and conduct post-mortem analyses to drive continuous improvement. If you have a strong engineering and automation background and are passionate about the E-commerce field, then we would love to hear from you.
Roles and Responsibilities:
- Develop and implement SRE capabilities to enhance the reliability, availability, and performance of Admin solutions.
- Design and maintain proactive monitoring and alerting systems for deep visibility into critical business flows, beyond simple statuses, to identify functional issues.
- Drive improvements in the Software Development Lifecycle (SDLC) for reliability and scalability from design to deployment.
- Collaborate with development and operations teams to troubleshoot production incidents affecting the purchase flow through root cause analysis.
- Lead SRE initiatives to boost system resilience and operational efficiency.
- Implement best practices for incident management and conduct blameless post-mortems, contributing to capacity planning and performance testing to ensure scalability.
Qualifications:
- 5+ years of experience as a Site Reliability/DevOps Engineer
- Deep understanding of E-commerce flows, specifically with back-office operations and order processing - must
- Experience as an Automation/Software Engineer with a strong understanding of software development principles and in building, testing, and deploying distributed systems - must
- Experience in designing, implementing, and utilizing monitoring and observability platforms such as DataDog, NewRelic, Prometheus/Grafana, or ELK stack - must
- Proficiency in scripting and automation using languages such as Python, Java, etc. - must
- Ability to create dashboards, alerts, and insightful queries - must
- Experience with AWS services to build and operate scalable and resilient applications (e.g., EC2, ECS/EKS, RDS, S3, Lambda, CloudWatch) - plus
- Experience in automating infrastructure provisioning, application deployments, and repetitive operational tasks - plus
- Proactive approach with excellent problem-solving skills
- Strong collaborator, with an ability to work with cross-functional teams
- Proficient in English
Apply for this job
*
indicates a required field