
Infrastructure, Large-scale Training
About Hark
Hark is an artificial intelligence company building advanced, personalized intelligence. One that is proactive, multimodal, and capable of interacting with the world through speech, text, vision, and persistent memory.
We're pairing that intelligence with next-generation hardware to create a universal interface between humans and machines. While today's AI largely operates through chat boxes and decade-old devices, Hark is focused on what comes next: agentic systems that interact naturally with people and the real world.
To get there, we're developing multimodal models and next-generation AI hardware together - designed from the ground up as a single, unified interface for a new era of intelligent systems.
About the Role
We are looking for a Member of Technical Staff, Infrastructure Compute to lead and manage large-scale GPU computing clusters powering our AI training and deployment workloads. You'll work at the intersection of systems engineering and machine learning infrastructure, owning the reliability, scalability, and efficiency of the compute platform that our research and engineering teams depend on. This is a high-impact, highly technical role suited for someone who thrives in complex distributed systems environments and cares deeply about infrastructure as a product.
Responsibilities
- Design, implement, and maintain Infrastructure as Code (IaC) best practices to enable repeatable, auditable, and scalable cluster provisioning.
- Enhance and harden CI/CD deployment pipelines to ensure robust, secure, and low-latency model service delivery across production environments.
- Own and evolve stable training infrastructure operating at the scale of 10,000+ GPUs, including job scheduling, fault tolerance, and network fabric optimization.
- Partner closely with ML researchers and engineers to understand compute bottlenecks and translate them into infrastructure improvements.
- Monitor system health, define SLOs, and lead incident response for critical training and inference workloads.
- Drive capacity planning, cost efficiency initiatives, and hardware lifecycle management across the GPU fleet.
- Contribute to internal tooling and platform abstractions that improve developer experience for teams consuming compute resources.
Requirements
- 5+ years of experience in infrastructure, systems, or platform engineering, with at least 2 years working in ML or HPC environments.
- Demonstrated experience managing GPU clusters or large-scale distributed compute infrastructure.
- Strong proficiency in at least one systems or infrastructure programming language.
- Deep understanding of networking fundamentals (RDMA, InfiniBand, or RoCE a plus) relevant to high-throughput training workloads.
- Experience with container orchestration, job scheduling, and multi-tenant resource management.
- Proven track record owning production systems with high reliability requirements.
- Strong debugging and observability skills across the full infrastructure stack.
Bonus Qualifications
- Kubernetes (K8s) — particularly experience operating large, GPU-aware clusters.
- Pulumi or similar modern IaC tooling.
- Rust and/or Go for systems-level tooling and performance-critical services.
- Familiarity with PyTorch and Ray for understanding workload patterns and integration requirements.
Compensation
The pay offered for this position may vary based on several individual factors, including job-related knowledge, skills, and experience. The total compensation package may also include additional components and benefits depending on the specific role. This information will be shared if an employment offer is extended.
Apply for this job
*
indicates a required field