Back to jobs
New

Senior HPC Infrastructure Engineer (Compute System)

Australia or Singapore

Role Summary

Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.

You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.

 

Key Responsibilities

  • Own the end-to-end lifecycle of AI compute systems, including GPU compute, NVSwitch, and platform firmware (BIOS, GPU, NIC, and storage devices).
  • Define, maintain, and enforce supported firmware and driver compatibility matrices across hardware generations, operating systems, kernels, and AI software stacks.
  • Lead firmware qualification and regression testing to ensure updates do not introduce performance degradation, instability, or compatibility issues.
  • Investigate and remediate performance regressions caused by firmware, driver, or system-level changes, working closely with networking, storage, and HPC engineers.
  • Collaborate to integrate firmware and performance checks into SDI tooling, enabling automated validation during provisioning, upgrades, and cluster bring-ups.
  • Produce clear technical documentation, including firmware standards, validation reports, and benchmarking results, to support operational consistency and informed decision-making.
  • Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
  • Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
  • Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
  • Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI clusters for large-scale GPU cluster commissioning.

 

Skills & Experience

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
    Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar.
  • Hands-on expertise with platform firmware and low-level system components, including BIOS, BMC, GPU firmware, NIC firmware, and storage devices.
  • Proven experience managing firmware and driver compatibility across operating systems, Linux kernels, and AI software stacks, with a disciplined approach to version control and validation.
  • Solid understanding of GPU architecture and interconnects, including PCIe, NVLink, and GPU-to-GPU communication patterns.
  • Demonstrated experience in performance benchmarking and validation using industry-standard and custom tools to measure GPU, compute, storage, and interconnect performance.
  • Strong Linux systems knowledge, including kernel behaviour, driver management, performance tuning, and troubleshooting at the OS and hardware boundary.
  • Experience diagnosing and resolving performance regressions related to firmware, drivers, or system-level changes in production or pre-production environments.
  • Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
  • Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
  • Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
  • Excellent documentation skills with a high level of attention to detail.
  • Experience participating in an on-call rotation supporting production services.
  • Proactive self-starter with a drive for continuous technical improvement.

 

Key Competencies

  • Ability to understand AI compute platforms as end-to-end systems spanning hardware, firmware, operating systems, drivers, and workloads.
  • Ability to anticipate cross-layer impacts of changes and design solutions that optimise overall system performance and reliability.
  • Proactively identifies risks related to firmware upgrades and ensures compatibility through structured validation and rollback strategies.
  • Experience operating AI infrastructure at medium to large scale, with a focus on reliability, repeatability, and performance consistency.
  • Strong sense of ownership and accountability for system performance and reliability.
  • Comfortable operating in ambiguous, fast-evolving environments while driving continuous improvement.

 

Success Metrics

  • Reliable, automated firmware validation and upgrade systems and processes.
  • Performance validation and optimisation.
  • Improved operational efficiency.
  • High-quality documentation and effective knowledge transfer.

 

Location & Reporting

  • Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)
  • Reporting to Senior Manager, Software Defined Infrastructure

 

Employment Basis

Full-time

 

Diversity

At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.

Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.

Create a Job Alert

Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...