Principal Platform Engineer
Who we are
Endor Labs is building the Application Security platform for the software development revolution. Modern software is complex and dependency-rich, making it increasingly difficult to pinpoint the risks that truly matter. Endor Labs solves this challenge by building a call graph of your entire software estate—enabling teams to clearly identify, prioritize, and fix critical risks faster.
Trusted by companies that are one or one hundred years old, Endor Labs secures code whether it was written by humans or AI, and whether it's 40-year old C++ code or cutting-edge Bazel Monorepos. Endor Labs was founded by serial entrepreneurs Varun Badhwar and Dimitri Stiliadis, and is backed by leading VC firms such as Dell Technology Capital, Lightspeed, and Sierra Ventures.
How You'll Make an Impact
- Build Cloud Infrastructure at Scale: Design, deploy, and operate highly available infrastructure on Azure, Google Cloud, and AWS that processes billions of security events daily for Fortune 500 customers and startups alike. Own the systems that power agentless security scanning and real-time analysis at unprecedented scale.
- Master Kubernetes & CNCF Ecosystem: Leverage your deep expertise with Kubernetes, ArgoCD, and the CNCF landscape to build resilient, self-healing infrastructure. Architect multi-tenant clusters that serve diverse customer workloads while maintaining strict isolation and security boundaries.
- Scale Our Observability Platform: Lead the evolution of our open source observability stack built on Prometheus, Grafana, Mimir, and Pyroscope as we handle exponential growth. Design solutions that process billions of events monthly while maintaining query performance and managing costs efficiently.
- Transform Developer Experience: Build self-service tools and automation that make onboarding new services effortless. Create patterns and abstractions that let engineers ship features faster while maintaining infrastructure best practices—think golden paths, not golden cages.
- Drive Infrastructure as Code: Own large-scale Terraform/OpenTofu deployments that manage our entire cloud footprint. Build reusable modules, establish best practices, and enable other teams to provision infrastructure safely and efficiently through GitOps workflows.
- Solve Complex Technical Challenges: Tackle problems like zero-downtime migrations across cloud providers, cost optimization at scale (we've grown 8x while reducing per-transaction costs), and building ephemeral environments that accelerate development velocity.
- Collaborate Across Teams: Partner closely with Security, Backend, and Product Engineering teams to ensure our platform meets the demanding requirements of enterprise security workloads. Your infrastructure decisions directly impact our customers' ability to secure their software supply chains.
- Iterate and Innovate: Work in a fast-paced environment where you'll ship infrastructure improvements weekly. Balance speed with reliability, experiment with cutting-edge technologies, and continuously improve our platform's performance and cost efficiency.
What You Bring to the Table
- Battle-Tested SRE Experience: Bring 12+ years of Site Reliability Engineering or Platform Engineering experience, with a proven track record of building and operating production infrastructure at scale. You've been on call, resolved critical incidents, and know how to build systems that don't wake you up at 3 AM.
- Kubernetes Expert: Demonstrate deep, hands-on expertise with Kubernetes and the CNCF ecosystem in production environments. You understand controllers, operators, custom resources, and can debug complex networking issues across multi-cluster deployments. You've managed clusters serving hundreds of workloads and know the operational challenges intimately.
- Cloud Native Engineer: Show significant experience with at least one major cloud provider (Azure, Google Cloud, or AWS—in that order of preference). You understand cloud-native patterns, can design for high availability across regions, and know how to optimize for both performance and cost.
- Infrastructure as Code Practitioner: Possess strong experience managing large infrastructure deployments using Terraform, OpenTofu, or Terragrunt. You've built reusable modules, implemented state management at scale, and established patterns that enable teams to move fast without breaking things.
- Observability Advocate: Bring hands-on experience with open source observability tools, particularly Prometheus, Grafana, Mimir, and Pyroscope. You understand cardinality challenges, know how to design effective SLIs/SLOs, and can build dashboards that help teams operate complex distributed systems.
- Self-Driven Problem Solver: Demonstrate initiative in identifying problems before they become incidents. You don't wait for perfect requirements—you gather context, propose solutions, and drive projects to completion with minimal oversight.
- Customer-Focused Engineer: Understand that platform engineering is a service organization. You build tools and infrastructure that enable other teams to move faster, and you measure success by their productivity and satisfaction.
- Clear Communicator: Articulate complex technical concepts clearly to both technical and non-technical audiences. Document your decisions, share knowledge generously, and collaborate effectively across time zones.
Preferred Skills (These Will Make You Stand Out)
- Monorepo Experience: Familiarity with managing infrastructure for large monorepos, including build optimization, caching strategies, and developer experience improvements. Bonus points if you've worked with Bazel, Nx, or similar tooling
- GitOps Practitioner: Hands-on experience implementing GitOps workflows with ArgoCD or Flux. You understand declarative infrastructure, sync waves, progressive delivery, and how to troubleshoot synchronization issues.
- Security Mindset: Experience building secure infrastructure, understanding compliance requirements (SOC 2, GDPR), and implementing security controls without sacrificing developer velocity.
- Performance Optimization: Track record of optimizing infrastructure costs and performance. You've reduced cloud bills by meaningful percentages while maintaining or improving service quality.
At Endor Labs, we:
- Go to extraordinary lengths to distinguish ourselves through world-class work.
- Prioritize quality over speed, and speed over scope.
- Desire working with deeply kind, mission-driven people.
- Strive to make the complex simple.
- Use first principles to debate ideas, test assumptions, and make decisions.
- Seek the truth by putting data above opinions.
- Assume good intent and give tactical feedback to help each other get better.
- Hold no ego—when our customers win, we all win.
Create a Job Alert
Interested in building your career at Endor Labs? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field