Back to jobs
New

Senior HPC Infrastructure Engineer

Australia or Singapore

Role Summary

Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.

You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.

 

Key Responsibilities

  • Design and implement bare-metal provisioning workflows using Ironic and Kubernetes CRDs.
  • Deploy and manage GPU-enabled AI compute nodes with RDMA, InfiniBand, and RoCE networking.
  • Optimise Kubernetes and Slurm platforms for multi-node AI training performance, including NCCL, UCX, GPUDirect, and fabric tuning.
  • Implement Kubernetes primitives for GPU scheduling, isolation, and resource management models.
  • Design, deploy, and fine-tune Slurm GPU clusters with topology-aware configurations.
  • Develop and execute performance benchmarking workloads, including MLPerf, NCCL tests, microbenchmarks, and throughput/latency validation.
  • Establish observability across GPU, InfiniBand fabric, storage, and provisioning components.
  • Document architecture designs, operational procedures, and performance results.
  • Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
  • Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
  • Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
  • Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI workload performance for large-scale GPU cluster commissioning.

 

Skills & Experience

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
  • Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
  • Deep knowledge of Kubernetes internals, including CRDs, controllers, operators, and cluster lifecycle management.
  • Strong understanding of Slurm configuration and compiling AI and HPC applications.
  • Strong understanding of GPU systems (NVIDIA H100/H200 SXM platforms), CUDA/NCCL, and GPU topology (NVLink, NVSwitch, PCIe).
  • Familiarity with container runtimes for compute workloads, including Docker, Enroot, Singularity, and Podman.
  • Experience with benchmarking and performance validation for AI, HPC, or distributed training workloads.
  • Practical Linux systems engineering experience, including kernel, cgroups, system services, networking, and drivers.
  • Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
  • Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
  • Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
  • Excellent documentation skills with strong attention to detail.
  • Experience participating in an on-call rotation supporting production services.
  • Proactive self-starter with a drive for continuous technical improvement.

 

Key Competencies

  • Systems Architecture: Ability to design and integrate bare-metal, GPU, RDMA, and Kubernetes/Slurm platforms.
  • Infrastructure Automation: Skilled in automated provisioning and lifecycle management of hardware and clusters.
  • GPU and HPC Performance: Understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.
  • Technical Communication: Ability to communicate technical concepts effectively across diverse engineering and operations teams.
  • Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.

 

Success Metrics

  • Reliable provisioning of Kubernetes and Slurm AI clusters.
  • Performance validation and optimisation.
  • Improved operational efficiency.
  • High-quality documentation and effective knowledge transfer.

 

Location & Reporting

  • Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)
  • Reporting to Senior Manager, Software Defined Infrastructure

 

Employment Basis

Full-time

 

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

Create a Job Alert

Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...