Job Application for Senior Site Reliability Engineer (SRE) at Stacklok

New

At Stacklok, we’re an AI-first company led by Kubernetes co-founder Craig McLuckie, helping enterprise developers connect the data, systems, and services that power their businesses today with the agentic and assistive AI systems they’re building for tomorrow. We believe the shift from applications to agents is the next major evolution in software, and we’re building the foundation that helps teams make that leap with confidence.

Our open source platform, ToolHive, provides developers with a powerful yet simple way to securely connect AI systems to real-world environments, delivering the right context at the right time. It solves tough challenges like security, access control, and observability without adding friction to the developer experience. By using open protocols like MCP (Model Context Protocol) and a highly pluggable architecture, supported by a community first development approach, ToolHive allows enterprises to run AI agents safely behind firewalls, with full control over data flow, context, and decision-making.

Connect With Us!

Location

This is a hybrid role that requires in-person work three days a week: Tuesday, Wednesday, and Thursday. We believe this approach balances flexibility with the value of in-person collaboration and community.

Our current office is located at:
3120 139th Avenue SE, Suite 500
Bellevue, WA 98005

Please note: we are planning to relocate to a more central location in the near future.

Elevator Pitch

Stacklok is seeking a Senior Site Reliability Engineer to design, build, and operate the infrastructure that powers our products and services. In this role, you’ll own key production systems, lead reliability-focused engineering efforts, and help deliver secure, scalable infrastructure for real-world AI use cases.

You’ll work hands-on with technologies like Kubernetes, Terraform, and ArgoCD to evolve cloud-native systems. You’ll automate deployments and incident response, enhance service health through telemetry and SLOs, and ensure our infrastructure can scale with product adoption.

We’re looking for an engineer who thrives in high-change environments, builds reliable and maintainable infrastructure, and applies strong technical judgment to complex operational challenges. You should be comfortable collaborating across teams, developing automation and internal tooling, and mentoring less experienced engineers.

If you're excited about reducing toil, scaling infrastructure through code, and making AI-powered systems dependable in production, we’d love to hear from you.

Success In The Role: 6-12 Months Expectations

Embedded in Team and Culture: Built strong, trust-based relationships across engineering, product, and design. Adapted quickly to team workflows, values, and collaboration norms. Contributed effectively to team goals with minimal oversight.
Product and Platform Fluency Demonstrated: Developed a deep understanding of Stacklok’s products, architecture, and strategy. Used this fluency to inform infrastructure decisions, collaborate effectively with product and engineering teams, and align platform work with near- and long-term goals.
Infrastructure Ecosystem Designed and Implemented: Led the design and setup of a scalable deployment ecosystem using Terraform and Kubernetes. Selected and configured tooling for observability, monitoring, and delivery. Embedded infrastructure security and operational best practices from the outset.
Automation and Reliability Improved: Delivered automation across provisioning, deployment, recovery, and operational workflows that significantly reduced manual effort and operational risk. Improved consistency, accelerated engineering velocity, and helped eliminate recurring sources of toil. Drove optimizations, including cloud cost reduction.
Operational Excellence Established: Defined and implemented meaningful SLOs and KPIs tied to service health and business goals. Designed and rolled out the team’s initial on-call and incident response processes. Contributed to shaping a strong culture of operational readiness and shared accountability.
Team Clarity and Production Knowledge Scaled: Produced high-quality documentation, system diagrams, and runbooks that improved team preparedness and visibility. Mentored peers in production ownership, tooling usage, and operational best practices. Helped foster a culture of shared responsibility and engineering excellence

In This Role You Will:

Design and Operate Reliable Infrastructure: Contribute to the evolution of our infrastructure by designing and managing production systems that support multiple engineering teams. Continuously improve platform performance, availability, and operational robustness through well-engineered solutions.
Automate Operational Workflows: Apply an automation-first mindset to reduce manual processes in areas like provisioning, deployment, and incident response. Deliver resilient tooling and workflows that enable faster delivery and improve reliability
Monitor and Improve Service Health: Define and maintain key metrics that reflect system performance and reliability. Use telemetry and observability tooling to proactively detect issues and drive systemic improvements.
Champion Operational Excellence: Establish and iterate on SLOs, incident response, and on-call practices that ensure reliable service delivery. Promote a culture of accountability, preparedness, and continuous improvement.
Mentor and Enable Engineering Teams: Share production knowledge, write and maintain high-quality runbooks and system documentation, and support engineers in adopting sound operational practices. Contribute to a healthy, inclusive engineering culture through mentorship and collaboration.

We Understand

We understand that not everyone will meet every requirement listed, and that’s perfectly okay! We encourage you to apply regardless of your self-assessment. We value a diverse range of skills and experiences and believe that your unique attributes can make a significant impact. We want to hear from you!

Desired Skill & Experience

Site Reliability Engineering: Strong foundation in SRE, with experience designing, operating, and scaling reliable production systems in fast-paced environments.
Programming: Proficient in applying fundamental programming principles to build reliable, maintainable automation, scripting, and internal tools. Experienced with languages such as Python, Go, Bash, or similar, with an emphasis on clear structure, testing, and operational reliability.
Infrastructure as Code (IaC): Deep experience with Terraform or similar tooling to provision, configure, and manage cloud environments using code-driven workflows.
Cloud-Native Operations: Hands-on experience with Kubernetes and Docker in production environments. Familiarity with autoscaling, recovery strategies, and cloud-native architecture patterns.
Cloud Provider Experience: Proficient with at least one major cloud provider (e.g., AWS, Azure, GCP). Experience with AWS is preferred.
GitOps and Deployment Tooling: Experience deploying to Kubernetes using GitOps practices. Familiarity with ArgoCD (preferred) or similar tools like Flux.
Incident Response Automation: Experience automating incident response workflows using tools such as PagerDuty to improve response times and operational consistency.
Observability and Monitoring: Proficient with log aggregation and telemetry tools such as AWS CloudWatch, Prometheus, Grafana, or similar, to support monitoring, performance tuning, and proactive issue detection.
Service Quality and Metrics: Experienced in defining and using SLOs and KPIs to guide reliability goals, improve service quality, and drive operational focus.
Operational Security Awareness: Familiar with operational and infrastructure security best practices, including secure software supply chain considerations.
Business-Aligned Impact: Track record of delivering technical solutions that drive measurable business outcomes. Applies engineering judgment with product and customer context in mind.
Collaboration and Communication: Strong written and verbal communication skills. Comfortable collaborating across technical and non-technical audiences, mentoring peers, and contributing to inclusive team culture.
Startup Agility and Versatility: Thrives in fast-moving, high-growth environments. Adaptable across responsibilities, self-directed, and proactive in driving clarity and execution.

Base Salary Range: $156,000 - $198,000

#LI-Hybrid

Why Join Us?

At Stacklok, we believe great technology is built by teams that support, challenge, and inspire one another. We are AI maximalists, confident in its potential and committed to ensuring it is used in ways that are safe and sustainable.

You will join a highly motivated, collaborative team with deep experience building some of the world’s most impactful technologies. We work in the open, side by side with the community, with strong roots in open source, cloud-native technologies, security, and developer tools.

We offer competitive compensation, equity, comprehensive healthcare, and a flexible work environment - including adaptable work hours and flexible PTO to support your success.

If you're excited about the future of AI, and want to build alongside people who care deeply about their craft, their community, and each other, we would love to hear from you.

Stacklok Inc, is proud to be an equal opportunity employer. We are committed to providing equal employment opportunities for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law.

Create a Job Alert

Interested in building your career at Stacklok? Get future opportunities sent straight to your email.

First Name

Last Name

Phone

Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf

LinkedIn Profile

Website | Github

Country

Select...

Please indicate the country where you will be working from

State/Province

Select...

Please indicate the state/province where you will be working from

City

Please indicate the city where you will be working from

Are you legally authorized to work in the country where you have indicated that you will be working from?

Select...

Will you now or in the future require Stacklok to commence ("sponsor") an immigration case in order to employ you (for example, H1-B or other employment-based immigration case)? This is sometimes called "sponsorship" for employment-based visa status

Select...

Do you currently live within a reasonable commuting distance of our Bellevue, WA office?

Select...

Are you able to work onsite at our Bellevue, WA office three days a week (Tuesday, Wednesday, and Thursday)?

Select...

Are you subject to any Agreements that may affect your ability to accept a position or perform your employment obligations with Stacklok?

Select...

Have you worked as a Site Reliability Engineer (or equivalent) with direct responsibility for running and improving production systems, beyond internal platform support or pre-production environments?

Select...

Have you ever designed and built cloud production infrastructure from scratch, including selecting core tooling, standing up environments, and implementing deployment and monitoring workflows?

Select...

Have you provisioned, configured, and operated Kubernetes clusters in a production environment, including responsibility for service deployment, reliability, and troubleshooting?

Select...

Have you implemented, configured, and maintained observability for production systems using tools like Prometheus, Grafana, CloudWatch, or similar?

Select...

Have you used Infrastructure as Code tools (e.g., Terraform) to provision and manage cloud infrastructure in a production setting?

Select...

Senior Site Reliability Engineer (SRE)

Location

Elevator Pitch

Success In The Role: 6-12 Months Expectations

In This Role You Will:

We Understand

Desired Skill & Experience

Why Join Us?

Apply for this job