Back to jobs

Senior Site Reliability Engineer - Caffeine.ai

Zürich, Switzerland

At Caffeine.ai, we are building the world's first platform to create full-stack, on-chain applications through natural language. Our mission is to make building software as simple as a conversation, transforming ideas into live applications instantly. We are a cross-functional team of engineers and researchers building the AI that will power this new paradigm. To do this, we need to ensure the platform that performs this magic is exceptionally reliable, fast, and scalable.

About the Role

As a Senior Site Reliability Engineer, you will be the guardian of the Caffeine.ai user experience. You are not just keeping servers online; you are ensuring the end-to-end reliability of the core "idea-to-application" journey. Your focus will be on the availability, reliability, and scalability of our user-facing products and the complex AI-driven microservices that power them. You will be deeply embedded with our product and engineering teams, acting as the critical bridge between our ambitious AI vision and a rock-solid production reality.

This is a hands-on role for an engineer who thinks about reliability from the user's perspective and wants to provide the best developer experience for your fellow engineers and wants to solve novel challenges in a rapidly evolving AI/ML environment.

What You’ll Do

  • Own Product Reliability: Take ownership of the availability and reliability of the Caffeine.ai platform. You'll define our Service Level Objectives (SLOs), provide a reliable Continuous Delivery (CD) platform and work across teams to meet and exceed them.
  • Build Deep Product Insight: Design, implement, and manage our observability stack (Datadog, Opentelemetry, distributed tracing, logs, metrics) to provide high-fidelity signals into the health of our services and, most importantly, the user experience.
  • Engineer Scalable Solutions: Dive deep into our architecture to identify and eliminate performance bottlenecks, single points of failure, and sources of toil. You'll write code—primarily in Rust, Go and Typescript (we use Pulumi)—to automate operations and build robust, self-healing systems. You will setup routing and service mesh configurations (e.g. Istio).
  • Champion Reliability from Day One: Partner with software engineers during design and code reviews to proactively bake in reliability, scalability, and operability. You will be the expert voice that helps the team build for production from the start.
  • Lead and Learn from Incidents: Coordinate the incident response process for our production services. You'll lead blameless post-mortems that drive meaningful improvements across our systems and processes.
  • Participate in an On-Call Rotation: As a key member of the team, you will be part of a compensated on-call rotation focused on coordinating incident response and ensuring platform stability.

Who You Are

  • You are a product-minded engineer with proven experience as a Site Reliability Engineer, with a strong focus on user-facing applications and distributed service architectures.
  • You have deep expertise in building and running modern observability stacks (e.g., Datadog, Opentelemetry) and believe in data-driven decision-making.
  • You are a proficient software developer. You have experience designing and writing production-grade applications and automation, ideally in a systems and infra language like Rust or Go, and are open to use Python, Typescript or Bash.
  • You are a methodical troubleshooter, capable of systematically diagnosing complex issues across the entire stack, from networking protocols (TCP/IP, DNS, TLS) up to the application layer.
  • You understand the complexities of modern CICD pipelines and have experience building and maintaining them.
  • You thrive in a collaborative environment and possess excellent communication skills, capable of explaining complex technical concepts to a diverse audience.

Bonus

  • You have experience with the reliability and performance challenges of AI/ML-powered systems or large-scale data processing pipelines.

*This is a hybrid role based in our Zurich office, with a requirement of 3+ days in the office per week.

About DFINITY and the Internet Computer:

DFINITY is a leading contributor to the Internet Computer Protocol (ICP), with a mission to bring the world's compute onto the secure ICP network. Built on its unique third-generation blockchain technology, ICP enables the development and operation of a new generation of unstoppable, tamper-proof, fully decentralized web applications. Its powerful technology can run entire AI models within smart contracts, representing a major advancement for secure AI. Through seamless integration with Bitcoin, Ethereum, and other networks, ICP facilitates multi-chain operations for digital assets and web3.

Join our team of over 250 talented individuals, including world-renowned cryptographers, distributed systems engineers, programming language experts, and industry leaders, who are shaping the future of the internet and web3.
 
DFINITY was founded in 2016 by entrepreneur and crypto theoretician, Dominic Williams.

All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.

Apply for this job

*

indicates a required field

Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...

Starting with your strongest first

Select...
Select...
Select...
Select...

If you are currently employed, please indicate your notice period. If you are available immediately, select “Immediately Available.”

What gender pronoun(s) do you identify with?