Senior Infrastructure Engineer
Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.
Your Role:
We are seeking a Senior Infrastructure Engineer to design, deploy, and manage the physical infrastructure powering Moonlite's GPU clusters and high-performance computing environments. You will be responsible for building and operating scalable, reliable compute, storage, and networking infrastructure that powers AI training / inference and research workloads. This role focuses on the hardware and provisioning layer—servers, GPUs, networking equipment, firmware, and bare-metal provisioning systems—ensuring our infrastructure is tuned for performance and reliability. You will partner closely with network engineers, systems engineers, and SREs to deliver robust infrastructure at scale.
Job Responsibilities
- Infrastructure Design & Deployment: Architect and implement GPU and compute infrastructure at the server, rack, and system level for AI workloads across co-located data center environments.
- Bare-Metal Provisioning: Deploy and manage bare-metal servers using provisioning tools like Canonical MAAS, building automated workflows for severe lifecycle management from installation through decommissioning.
- Hardware & Firmware Management: Develop and maintain systems to monitor hardware health, manage firmware updates across compute/storage/network equipment, and automate recovery processes.
- GPU Operations: Trouble GPU-related performance issues at the driver, kernel, or firmware level and optimize configurations for training and inference workloads.
- Infrastructure Automation: Build automation using Ansible, Terraform, and Python to eliminate manual provisioning, streamline patching processes, and enable scalable infrastructure operations.
- Performance Monitoring: Monitor system performance, identify bottlenecks in compute/storage/networking layers, and proactively address reliability issues or capacity issues.
- Cross-Team Collaboration: Work closely with network engineers, systems engineers, and SREs to ensure cohesive infrastructure operations and seamless integration with Kubernetes and platform orchestration layers.
- Vendor Management: Serve as primary point of contact for hardware escalations, RMAs (Return Material Authorization), and vendor relationship for compute/storage/networking equipment.
Requirements
- Experience: 5+ years in infrastructure engineering, systems engineering, or hardware-focused roles, preferably with AI/HPC workloads.
- Linux Expertise: Strong background in Linux systems administration, performance tuning, and troubleshooting at the system level.
- Bare-Metal Provisioning: Hands-on experience with bare-metal provisioning tools (MAAS or similar) and automated deployment workflows.
- DCIM & Documentation: Familiarity with data center infrastructure management tools (NetBox, Device42, or similar) for asset tracking, network documentation, and maintaining infrastructure source of truth.
- Hardware & GPU Systems: Familiarity with server hardware, GPU configurations, drivers, and system level performance optimization.
- Automation Skills: Proficiency with Ansible, Terraform, and scripting (Python, Bash) for infrastructure automation and operational efficiency.
- Infrastructure Operations: Experience deploying and maintaining physical infrastructure in production data center environments.
- Problem-Solving: Ability to troubleshoot complex hardware, firmware, and system issues under pressure.
- Collaboration: Comfortable working with cross-functional teams including network engineers, systems engineers, and platform developers to resolve infrastructure challenges.
Preferred Qualifications
- Experience with GPU workload orchestration platforms (Kubernetes, SLURM) and their infrastructure requirements.
- Familiarity with high-performance networking (InfiniBand, RDMA, RoCE) and spine-leaf network architectures.
- Experience with monitoring and observability tools (Prometheus, Grafana).
- Understanding of Kubernetes infrastructure requirements (compute, storage, networking layer)
- Exposure to co-located data center operations or building infrastructure for regulated environments
- Background supporting research institutions, HPC facilities, or enterprise AI infrastructure
Key Technologies
- Linux, Canonical MAAS, NetBox, Terraform, Ansible, Python, NVIDIA GPU Drivers/Tools, High-Performance Networking, Enterprise Storage Systems, Prometheus, Grafana
Why Moonlite
- Build the Future of AI Infrastructure: Join a pioneering team shaping scalable solutions for the enterprise. Your work will directly impact the deployment and usability of AI at scale.
- Hands-On Ownership: As an early engineer, you’ll have end-to-end ownership of projects and the autonomy to influence our product and technology direction.
- Collaborate with Experts: Work alongside seasoned engineers and industry professionals passionate about high-performance computing, innovation, and problem-solving.
- Start-Up Agility with Industry Impact: Enjoy the dynamic, fast-paced environment of a startup while making an immediate impact in an evolving and critical technology space.
We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.
Create a Job Alert
Interested in building your career at Moonlite? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field