Back to jobs
tags.new

Senior/Staff DevOps Engineer

Remote

 

About Ethos

Ethos is on a mission to bridge the human readiness gap by transforming how training is developed, consumed, and aligned with strategic business outcomes. As a well-funded Series A startup ($40M+ raised), we are a trusted partner to over 150 enterprise customers across sectors including the U.S. military, life sciences, manufacturing, supply chain, and professional sports. We are expanding our engineering team to evolve a best-in-class platform that makes learning smarter, faster, and more optimized. 

 

About the Role 

You’ll play a critical role in optimizing the efficiency and effectiveness of our engineering organization. Reporting directly to the CTO, you will lead engineering operations, facilitate cross-functional collaboration, and ensure smooth execution of engineering initiatives in a high-growth startup environment. This is an opportunity to bring order and excellence to complex engineering processes, accelerate velocity and quality, and enable our engineers to do their best work.

What You’ll Do

  • Design & Operate the Platform: Run scrums, standups, retrospectives, and sprint planning in coordination with Product. Drive predictability, accountability, and throughput across teams.
  • CI/CD & Release Engineering: Oversee planning and tracking of engineering initiatives, ensuring roadmap alignment, clear ownership, and on-time delivery.
  • Observability & Reliability: Improve workflows, SDLC, and tooling to increase velocity, reduce cycle time, and raise quality.
  • Security & Compliance by Design: Own the release calendar, change management, and incident response (severity matrix, on-call, postmortems/PIRs).
  • Cost & Performance: Define and track engineering KPIs (e.g., DORA, SPACE, defect escape rate, roadmap hit rate). Build and maintain operational dashboards.
  • Technical Leadership: Act as the connective tissue between Engineering, Product, Security, Compliance, and customer-facing teams.
  • Gov/Constrained Deployments: Support recruiting, onboarding, team administration, and coordination across distributed teams.
  • (Staff) Strategy & Standards: Partner with Security to support SOC 2/ISO 27001 audit readiness, evidence collection, and SDLC controls

Measures of Success (First 6–12 Months)

  • Availability & Reliability: Meet or exceed service SLOs; reduce MTTR by ≥30%.
  • Delivery Velocity: Increase deployment frequency by ≥2× while keeping change failure rate ≤15%.
  • Pipeline Efficiency: Cut CI pipeline duration by ≥25% and reduce flaky tests significantly.
  • Security Posture: Achieve ≥95% pass rate for supply-chain/security gates (image signing, SBOM scans, vulnerability thresholds); reduce MTTR for CVEs to ≤14 days for high severity.
  • Cost & Drift: Deliver ≥15% infra cost savings without performance regressions; keep infra drift near zero via GitOps and policy as code.
  • Gov/Offline Readiness: Stand up an artifact promotion flow (build → scan → sign → export) suitable for disconnected deployments with documented runbooks. 

30/60/90 Day Plan

First 30 Days — Learn & Map

  • Deep-dive on current cloud topology, CI/CD, observability, security controls, and on-call.
  • Inventory build and runtime artifacts; document deployment environments and promotion paths.
  • Baseline reliability and delivery metrics (SLOs, MTTR, deploy frequency, CFR, pipeline timing). 

60 Days — Stabilize & Ship

  • Harden CI/CD: add SBOM generation, signing (e.g., Cosign/Sigstore), and policy gates.
  • Implement or refine infrastructure modules (Terraform) and Helm/Kustomize charts with GitOps flows.
  • Establish service SLOs and golden signals; wire alerts and dashboards for top services.
  • Pilot artifact export/import flow for air-gapped/disconnected deployments; write runbooks. 

90 Days — Scale & Standardize

  • Standardize CI/CD pipelines and infrastructure modules across existing services.
  • Migrate priority services to hardened delivery paths; deprecate legacy workflows.
  • Land cost/performance wins (e.g., autoscaling policies, instance/storage class right-sizing).
  • (Staff) Publish the Platform Strategy and roadmap for 12–18 months (tenancy model, environments, multi-region, AI workload guidance). 

Basic Qualifications 

  • 5+ years building and operating cloud platforms; 3+ years deploying SaaS in production.
  • Strong with Terraform, Helm/Kustomize, and containers (Docker, Kubernetes).
  • Deep AWS experience (e.g., VPC, EKS, EC2, S3, RDS, ECR, IAM/KMS, Route 53; CloudFront desirable).
  • CI/CD expertise (e.g., GitHub Actions, CircleCI, or Argo Workflows) and GitOps (Argo CD or Flux).
  • Observability across metrics, logs, and traces (e.g., Prometheus/Grafana, OpenTelemetry, ELK).
  • Proven track record in IaC, scalable system design, and quality tooling (automated tests, canaries/blue-green, feature flags).
  • Excellent communication; comfortable partnering with Product, Security, and Customer teams.
  • Thrives in a startup environment—ownership, autonomy, and pragmatic delivery.

Preferred Qualifications

  • Supply-chain security (SBOMs, SLSA concepts, image signing, provenance) and vulnerability management (e.g., Trivy/Grype, Snyk; Chainguard experience a plus).
  • Experience identifying/mitigating CVEs and setting policy thresholds.
  • Background with DoD/regulated customers; familiarity with IL-4/IL-5, Platform One patterns, and RMF documentation workflows.
  • Knowledge of STIG/CIS hardening, air-gapped architectures, and offline update mechanisms.
  • Experience with data/AI platforms (GPU scheduling, model artifact management, queuing/streaming, MCP tool integration) is a plus.
  • (Staff) Demonstrated influence at org level: setting standards, leading cross-team initiatives, and mentoring at scale. 

Tooling You Might Touch

We use technologies similar to and including some of these to build our products:  

Terraform modules; Helm/Kustomize; Kubernetes (EKS); GitHub Actions/Workflows; Argo CD/Flux; Docker/OCI; Prometheus/Grafana, Datadog, OpenTelemetry; Loki/ELK; LaunchDarkly/Flagsmith; Cosign/Sigstore, Trivy/Grype/Snyk; AWS (VPC, EKS, EC2, S3, RDS, ECR, IAM/KMS, Route 53, CloudFront); HashiCorp Vault/Parameter Store/Secrets Manager. 

Compensation & Benefits

  • Competitive base salary (Senior: $150k-170k; Staff: $170k-195k) based on location and experience with significant equity upside.  
  • Subsidized health insurance, 401(k), life insurance, and cell phone stipend.
  • Remote-first culture with up to 10% travel for offsites.
  • Work eligibility: Applicants must be authorized to work in the U.S.

One Final Note

We’re committed to building a diverse, inclusive, and authentic workplace. If you’re excited about this role but your experience doesn’t perfectly align with every qualification, please apply—you may be just the right candidate.

EEO & accommodations: Ethos is an Equal Opportunity Employer. We welcome applicants of all backgrounds and provide reasonable accommodations throughout the hiring process. 

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf