Member of Technical Staff - Platform (Deployment Infrastructure)
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All employees are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
ABOUT THE ROLE:
You will build the tooling that turns a hardware listing and a deployment profile into a complete, self-contained software bundle capable of standing up xAI's full AI inference platform — from bare metal provisioning through GPU workloads — at any site, in any environment, with no internet access required.
xAI operates GPU infrastructure across public cloud, on-premise, and classified environments. Today, these targets are served by separate codebases that drift with every release. You will build the unified deployment platform that eliminates this divergence: a single generator that reads a thin profile (site topology, compliance requirements, connectivity model) and produces everything needed to deploy — Kubernetes manifests, switch configurations, OS provisioning configs, monitoring stacks, signed container image bundles, and acceptance tests. One source, every target.
You work on the unclassified (low) side. You build the tooling; cleared engineers at classified sites execute it. The quality of what you build directly determines how effectively those engineers can operate in environments where they cannot call you for help. Your tooling must be deterministic, complete, well-tested, and foolproof.
RESPONSIBILITIES:
- Design and build the deployment generator: a Go CLI that reads a YAML profile (6 deployment axes + site topology) and produces a fully-resolved deployment manifest with pinned image digests, rendered Helm values, switch configs, OSP inventory, network telemetry configuration, and AlertManager grouping/inhibition rules computed from the site topology.
- Build the bundle pipeline: collect all referenced container images, Helm charts, OS boot images, NVIDIA drivers, and model weights into a signed, self-contained tarball with CycloneDX SBOM and cosign signatures. Build the update bundle pipeline for delta-only updates: diff against the previously shipped baseline manifest, package only changed artifacts, sign, and include apply-update scripts and machine-readable changelogs.
- Implement profile-driven rendering: the same model deployment YAML, the same operator charts, the same monitoring stack produce correct output for public cloud (ArgoCD), enterprise on-prem (Pulumi), and classified air-gap (static manifests) targets based on profile selection.
- Build and maintain the cross-profile CI matrix: every PR touching shared platform code is validated against all active deployment profiles before merge, catching cross-profile breakage at PR time.
- Build the testing and validation framework: manifest validation against CRD schemas (kubeconform), profile-specific constraint checks (no external dependencies in air-gap profiles, FIPS requirements for gov profiles), acceptance test generation, and shadow cluster pre-transfer testing.
- Develop the actuator: the high-side executor that receives a signed bundle, verifies signatures, loads images into a local registry, and converges the cluster to the manifest state with zero-downtime updates and automatic rollback on failure.
- Build the CDS send-side pipeline: stage signed bundles for transfer through a one-way data diode or physical media, with machine-readable changelogs and verification tooling.
- Own the bare metal provisioning pipeline: OSP integration, squashfs boot image builds, cloud-init template generation, PXE server configuration — all computed from the profile's site topology.
- Own the monitoring stack migration: transition the on-prem K8s monitoring from Prometheus (kube-prometheus-stack) to VictoriaMetrics (VM Operator + VMSingle + VMAgent + VMAlert) to align with the public baseline and network telemetry stack. Ensure dashboards, alert rules, and ServiceMonitors work unchanged after migration.
- Integrate with the existing Supercompute team's operators, Helm charts, and CRDs — consuming their work as inputs to the generator without modifying production infrastructure.
- Maintain and extend the profile schema: validation rules, constraint cascades (e.g., IL6 classification automatically implies FIPS, HSM, Chainguard images), and extensibility for new deployment targets.
- Create and maintain platform documentation: generator usage guides, profile authoring guides, bundle pipeline documentation, and troubleshooting guides for cleared engineers who execute the tooling remotely.
BASIC QUALIFICATIONS:
- 5+ years of experience in platform engineering, infrastructure tooling, or developer platforms, with a focus on building deployment systems rather than operating them.
- Strong proficiency in Go — the generator, actuator, and bundle tooling are Go.
- Deep experience with Kubernetes manifest generation and management: Helm, Kustomize, CRDs, and admission webhooks.
- Experience building CI/CD pipelines and deployment automation (Buildkite, GitHub Actions, or equivalent).
- Familiarity with container image management: OCI image format, registries, image signing (cosign/Sigstore), and SBOM generation (syft, CycloneDX).
- Experience with Infrastructure-as-Code (Pulumi or Terraform) and understanding of state management tradeoffs.
- Strong understanding of Linux systems: boot process, systemd, networking, disk partitioning.
- Excellent communication skills — you will work closely with cleared engineers who execute your tooling in environments you cannot access. Your documentation, error messages, and runbooks must be clear enough to debug remotely.
PREFERRED SKILLS AND EXPERIENCE:
- Experience building air-gapped or disconnected deployment tooling where all dependencies must be pre-staged.
- Familiarity with GPU infrastructure: NVIDIA drivers, CUDA, NCCL, InfiniBand/RoCE networking.
- Experience with PXE boot, cloud-init, squashfs, or other bare metal provisioning technologies.
- Familiarity with ArgoCD, Flux, or other GitOps systems — understanding how they work so the generator can produce compatible output.
- Experience with CUE, KCL, or other typed configuration languages for validation.
- Understanding of federal compliance requirements (FIPS 140-3, DISA STIGs) — not to implement them directly, but to build tooling that produces compliant output.
- Experience with cross-domain solutions (CDS) or secure artifact transfer pipelines.
COMPENSATION AND BENEFITS:
$180,000 - $440,000 USD
Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.
xAI is an equal opportunity employer. For details on data processing, view our Recruitment Privacy Notice.
Create a Job Alert
Interested in building your career at xAI? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
