Senior HPC Infrastructure Engineer
Role Summary
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Key Responsibilities
- Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
- Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
- Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
- Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
- Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.
- Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.
- Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.
- Document architecture designs, operational procedures, and performance results.
- Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
- Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
- Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
- Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.
Skills & Experience
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
- Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.
- Strong understanding of Slurm configuration and compiling AI and HPC applications.
- Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).
- Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.
- Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.
- Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.
- Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
- Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
- Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
- Excellent documentation skills with strong attention to detail.
- Experience participating in an on-call rotation supporting production services.
- Proactive self-starter with a drive for continuous technical improvement.
Key Competencies
- Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.
- Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.
- GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.
- Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.
- Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.
Success Metrics
- Reliable provisioning of Kubernetes and Slurm AI clusters.
- Performance validation and optimisation.
- Improved operational efficiency.
- High-quality documentation and effective knowledge transfer.
Location & Reporting
- Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)
- Reporting to Senior Manager, Software Defined Infrastructure
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
Create a Job Alert
Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field