Back to jobs

Senior Staff Site Reliability Engineer, Monitoring & Observability

Dublin, Ireland

About Us

At Udemy, we’re on a mission to improve lives through the power of learning. We’re a leading global learning company and one of the world’s largest education platforms, with more than 67 million learners. Our goal is to provide flexible, effective skill development to empower organizations and individuals. 

Talented people are everywhere, and the right opportunity can be hard to come by. That’s why we’re focused on revolutionizing learning, using our skills and expertise to help others develop theirs and reach their full potential. Individually, we bring our unique perspective to reimagine the way we share knowledge. Together, we can improve lives by making learning more accessible for our learners, our instructors, and businesses around the world.

Hybrid work

Udemy is headquartered in San Francisco with global offices in Australia, India, Ireland, Türkiye, and other US locations. Our robust hybrid work model spans San Francisco, Denver, Ankara, Dublin, and Melbourne. This hybrid position requires two days per week in the office at the nearest hub. Learn more about us on our company page.

About you 

You are a motivated, meticulous Engineer with a team-oriented approach and exceptional problem-solving skills. You are organized and proactive and take the initiative to prioritize your own work and projects effectively.

You thrive in a collaborative environment and are eager to work with and learn alongside the best in Product, Design, and Engineering.

At Udemy, we value individuals who thrive in the face of complexity and love to turn challenges into solutions. As a Monitoring & Observability Engineer, you'll be a key player in building and evolving our systems. You know that complex systems are hard to measure and monitor, but you're driven to tackle these challenges head-on.

You have deep expertise in microservices and are passionate about optimizing the way we monitor, measure, and instrument them. User experience is at the heart of your work, and you're always thinking about how our metrics impact the way people interact with our systems. Linux is your natural environment, and you aren't afraid to dive deep into troubleshooting application, system, and network issues. You've worked with industry-leading monitoring tools like Datadog, New Relic, and Honeycomb, and you're always eager to refine your skills and learn new ones.

Above all, you're a strong communicator in English and excel at collaborating with engineers and teams across the organization.

We care less about your formal education or mathematical expertise and more about your hands-on experience and your passion for monitoring. If you're obsessed with building observability systems, automating repetitive tasks, and driving improvements across the board, we want you on our team.

Here’s what you will be doing:

  • Leading the evolution of our monitoring and observability strategy, making it a core pillar of how we work
  • Partnering with engineering teams to enhance the visibility and reliability of our systems, ensuring that we build for long-term success
  • Driving the standardization of SLIs + SLOs across all engineering teams, aligning on best practices
  • Owning and optimizing our current monitoring systems, including Datadog, Sentry, and other key tools
  • Collaborating with teams to proactively improve site availability, ensuring a seamless user experience
  • Leading incident analysis while fostering a Blameless Culture, ensuring that we learn from challenges and improve
  • Promoting best practices for on-call and incident management, ensuring teams are always prepared and resilient
  • Continuously improving developer happiness and productivity by automating manual tasks and creating processes that prevent surprises

About your skills:

  • 3+ years experience managing complex monitoring systems like Datadog, Honeycomb, or New Relic
  • Proficiency in programming languages such as Go (preferred), Python, Bash, or Java
  • Experience with incident management tools and processes, with at least 3 years on-call experience
  • Hands-on experience with paging tools and incident response frameworks
  • Solid understanding of Terraform, Kubernetes (K8s), and AWS for deployment and management
  • A knack for problem-solving, with the ability to think creatively and work collaboratively with peers
  • Excellent communication skills and a desire to continuously learn and grow within a fast-paced environment

Apply for this job

*

indicates a required field

Resume/CV

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...

We want to understand all of the ways that you have interacted or been exposed to Udemy so that we can continue to invest in efforts that resonate with candidates.

Select...
Select...
Select...

Saying “No” to this question indicates you are eligible for work in the Ireland and do not require sponsorship.


Demographic Questions

Voluntary Self-Identification

To support our inclusive recruiting process and for reporting purposes, we welcome you to participate in the self-identification survey. This survey is confidential, voluntary and anonymous. 

We believe everyone has something special to give – their authenticity, empathy, unique backgrounds. At Udemy, we make a promise to each other to respect that and be kind. And because we believe the best ideas are born as a result of people from all walks of life coming together, we work hard to create an inclusive space for all.

As part of Udemy’s Equal Employment Opportunity policy, we don’t discriminate based on any protected group status under any applicable law. So rest assured, whatever your decision, the survey will not be considered in the hiring process or thereafter.

Information regarding data privacy is available within the Udemy Careers Privacy Notice.

Select...
Select...
Select...