Senior Software Engineer, Compute Platform
Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance.
Your Role:
You will be instrumental in building out our GPU-accelerated compute platform that powers distributed AI training and inference, large-scale simulations, and computational research workloads. Working closely with product, your platform team members, and infrastructure specialists, you’ll design and implement the compute orchestration layer that manages GPU clusters, bare-metal provisioning, and resource scheduling-enabling researchers and engineers to programmatically access high-performance compute resources with cloud-like simplicity.
Job Responsibilities
- Compute Orchestration Systems: Design and build scalable compute orchestration platforms that manage GPU clusters, bare-metal server provisioning, and resource allocation across co-located infrastructure environments.
- Resource Management & Scheduling: Implement intelligent workload scheduling, resource allocation, and optimization algorithms that maximize GPU utilization while maintaining performance guarantees for research and training workloads.
- GPU Platform Engineering: Develop platform capabilities for managing latest-generation NVIDIA GPU configurations (H100, H200, B200, B300), including GPU resource management, multi-tenant isolation, and integration with compute orchestration systems.
- Bare-Metal Lifecycle Management: Build automation and tooling for complete bare-metal server lifecycle management – from initial provisioning and configuration through ongoing operations, updates, and resource reallocation.
- Performance-Critical Systems: Optimize compute platform components for high-throughput and low-latency performance, ensuring research workloads achieve near-bare-metal efficiency in virtualized or containersized environments.
- Platform APIs & Integration: Develop robust APIs and SDKs that enable researchers to programmatically provision and manage compute resources, integrating seamlessly with existing workflows and research infrastructure.
- Observability & Monitoring: Implement comprehensive monitoring and telemetry systems for compute resources, providing visibility into GPU virtualization, workload performance and infrastructure health.
- Multi-Tenancy and Isolation: Build enterprise-grade multi-tenant compute isolation, security boundaries, and resource quotas that enable safe sharing of GPU infrastructure across teams and organizations.
Requirements
- Experience: 5+ years in software engineering with proven experience building compute platforms, container orchestration systems, or distributed compute infrastructure for production environments.
- Compute Platform Engineering: Strong background in building compute orchestration, resource scheduling, or workload management systems at scale.
- Programming Skills: Expert-level Python proficiency. Experience with C/C++, Go, or Rust for performance-critical components is highly valued.
- Linux & Systems Programming: Strong experience with Linux in production environments, including systems for programming, performance optimization, and low-level resource management.
- Virtualization & Containers: Deep knowledge of virtualization technologies (KVM, Xen), container runtimes, and orchestration platforms.
- GPU Computing Fundamentals: Understanding of GPU architectures, CUDA programming (where/when needed), and GPU resource management – or a strong ability to learn quickly.
- Bare-Metal Infrastructure: Experience with bare-metal provisioning, out-of-band management systems, and hardware abstraction layers.
- Problem-Solving & Architecture: Demonstrated ability to solve complex performance and scalability challenges while balancing pragmatic shipping with good long-term architecture.
- Autonomy & Communication: Comfortable navigating ambiguity, defining requirements collaboratively, and communicating technical discussions through clear documentation.
- Commitment to Growth: Growth mindset with continuous focus on learning and professional development.
Preferred Qualifications
- Experience with GPU virtualization technologies (SR-IOV, NVIDIA vGPU) and multi-tenant GPU sharing
- Background in container orchestration platforms with custom scheduling or resource management
- Knowledge of high-performance networking for GPU communication (InfiniBand, RDMA, NVLink, NVSwitch)
- Familiarity with AI/ML training frameworks (PyTorch, TensorFlow) and their infrastructure requirements
- Understanding of distributed training patterns and multi-node GPU coordination
- Experience building infrastructure for research institutions,labs, or technical computing environments
- Background in financial services or other regulated industry infrastructure is a plus
Key Technologies
- Python, C/C++, Go, KVM, Docker, Kubernetes,, NVIDIA GPUDirect, SR-IOV, NVIDIA vGPU, CUDA, InfiniBand, RDMA, Terraform, FastAPI, gRPC, Linux systems programming
Why Moonlite
- Build Next-Generation Infrastructure: Your work will create the platform foundation that enables financial institutions to harness AI capabilities previously impossible with traditional infrastructure.
- Hands-On Ownership: As an early engineer, you’ll have end-to-end ownership of projects and the autonomy to influence our product and technology direction.
- Shape Industry Standards: Contribute to defining how enterprise AI infrastructure should work for the most demanding regulated environments.
- Collaborate with Experts: Work alongside seasoned engineers and industry professionals passionate about high-performance computing, innovation, and problem-solving.
- Start-Up Agility with Industry Impact: Enjoy the dynamic, fast-paced environment of a startup while making an immediate impact in an evolving and critical technology space.
We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together.
Create a Job Alert
Interested in building your career at Moonlite? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field