Back to jobs
New

Senior Infrastructure Engineer

Chicago, IL or Remote

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.

Your Role:

We are seeking a Senior Infrastructure Engineer to design, deploy, and manage the physical infrastructure powering Moonlite's GPU clusters and high-performance computing environments. You will be responsible for building and operating scalable, reliable compute, storage, and networking infrastructure that powers AI training / inference and research workloads. This role focuses on the hardware and provisioning layer—servers, GPUs, networking equipment, firmware, and bare-metal provisioning systems—ensuring our infrastructure is tuned for performance and reliability. You will partner closely with network engineers, systems engineers, and SREs to deliver robust infrastructure at scale.

Job Responsibilities

  • Infrastructure Design & Deployment: Architect and implement GPU and compute infrastructure at the server, rack, and system level for AI workloads across co-located data center environments.
  • Bare-Metal Provisioning: Deploy and manage bare-metal servers using provisioning tools like Canonical MAAS, building automated workflows for severe lifecycle management from installation through decommissioning.
  • Hardware & Firmware Management: Develop and maintain systems to monitor hardware health, manage firmware updates across compute/storage/network equipment, and automate recovery processes.
  • GPU Operations: Trouble GPU-related performance issues at the driver, kernel, or firmware level and optimize configurations for training and inference workloads.
  • Infrastructure Automation: Build automation using Ansible, Terraform, and Python to eliminate manual provisioning, streamline patching processes, and enable scalable infrastructure operations.
  • Performance Monitoring: Monitor system performance, identify bottlenecks in compute/storage/networking layers, and proactively address reliability issues or capacity issues.
  • Cross-Team Collaboration: Work closely with network engineers, systems engineers, and SREs to ensure cohesive infrastructure operations and seamless integration with Kubernetes and platform orchestration layers.
  • Vendor Management: Serve as primary point of contact for hardware escalations, RMAs (Return Material Authorization), and vendor relationship for compute/storage/networking equipment.

Requirements

  • Experience: 5+ years in infrastructure engineering, systems engineering, or hardware-focused roles, preferably with AI/HPC workloads.
  • Linux Expertise: Strong background in Linux systems administration, performance tuning, and troubleshooting at the system level.  
  • Bare-Metal Provisioning: Hands-on experience with bare-metal provisioning tools (MAAS or similar) and automated deployment workflows.
  • DCIM & Documentation: Familiarity with data center infrastructure management tools (NetBox, Device42, or similar) for asset tracking, network documentation, and maintaining infrastructure source of truth.
  • Hardware & GPU Systems: Familiarity with server hardware, GPU configurations, drivers, and system level performance optimization.
  • Automation Skills: Proficiency with Ansible, Terraform, and scripting (Python, Bash) for infrastructure automation and operational efficiency.
  • Infrastructure Operations: Experience deploying and maintaining physical infrastructure in production data center environments.
  • Problem-Solving: Ability to troubleshoot complex hardware, firmware, and system issues under pressure.
  • Collaboration: Comfortable working with cross-functional teams including network engineers, systems engineers, and platform developers to resolve infrastructure challenges.

Preferred Qualifications

  • Experience with GPU workload orchestration platforms (Kubernetes, SLURM) and their infrastructure requirements.
  • Familiarity with high-performance networking (InfiniBand, RDMA, RoCE) and spine-leaf network architectures.
  • Experience with monitoring and observability tools (Prometheus, Grafana).
  • Understanding of Kubernetes infrastructure requirements (compute, storage, networking layer)
  • Exposure to co-located data center operations or building infrastructure for regulated environments
  • Background supporting research institutions, HPC facilities, or enterprise AI infrastructure

Key Technologies

  • Linux, Canonical MAAS, NetBox, Terraform, Ansible, Python, NVIDIA GPU Drivers/Tools, High-Performance Networking, Enterprise Storage Systems, Prometheus, Grafana

Why Moonlite

  • Build the Future of AI Infrastructure: Join a pioneering team shaping scalable solutions for the enterprise. Your work will directly impact the deployment and usability of AI at scale.
  • Hands-On Ownership: As an early engineer, you’ll have end-to-end ownership of projects and the autonomy to influence our product and technology direction.
  • Collaborate with Experts: Work alongside seasoned engineers and industry professionals passionate about high-performance computing, innovation, and problem-solving.
  • Start-Up Agility with Industry Impact: Enjoy the dynamic, fast-paced environment of a startup while making an immediate impact in an evolving and critical technology space.

We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.

Create a Job Alert

Interested in building your career at Moonlite? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV

Accepted file types: pdf, doc, docx, txt, rtf