Senior HPC Infrastructure Engineer (Compute System)
Role Summary
Firmus is seeking a highly skilled and driven Kubernetes HPC Engineer to join our Software Defined Infrastructure team. In this role, you will build high-performance, fault-tolerant, and reliable infrastructure to support bare-metal provisioning, performance benchmarking, and platform validation.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our complex and mission-critical bare-metal HPC GPU clusters.
Key Responsibilities
- Own the end-to-end lifecycle of AI compute systems, including GPU compute, NVSwitch, and platform firmware (BIOS, GPU, NIC, and storage devices).
- Define, maintain, and enforce supported firmware and driver compatibility matrices across hardware generations, operating systems, kernels, and AI software stacks.
- Lead firmware qualification and regression testing to ensure updates do not introduce performance degradation, instability, or compatibility issues.
- Investigate and remediate performance regressions caused by firmware, driver, or system-level changes, working closely with networking, storage, and HPC engineers.
- Collaborate to integrate firmware and performance checks into SDI tooling, enabling automated validation during provisioning, upgrades, and cluster bring-ups.
- Produce clear technical documentation, including firmware standards, validation reports, and benchmarking results, to support operational consistency and informed decision-making.
- Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
- Support hardware bring-up activities, including BIOS tuning, GPU topology verification, NUMA alignment, and PCIe/NVLink checks.
- Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
- Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks that optimise AI clusters for large-scale GPU cluster commissioning.
Skills & Experience
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Experience with bare-metal cluster provisioning using tools such as Metal3, OpenStack Ironic, MaaS, xCAT, or similar. - Hands-on expertise with platform firmware and low-level system components, including BIOS, BMC, GPU firmware, NIC firmware, and storage devices.
- Proven experience managing firmware and driver compatibility across operating systems, Linux kernels, and AI software stacks, with a disciplined approach to version control and validation.
- Solid understanding of GPU architecture and interconnects, including PCIe, NVLink, and GPU-to-GPU communication patterns.
- Demonstrated experience in performance benchmarking and validation using industry-standard and custom tools to measure GPU, compute, storage, and interconnect performance.
- Strong Linux systems knowledge, including kernel behaviour, driver management, performance tuning, and troubleshooting at the OS and hardware boundary.
- Experience diagnosing and resolving performance regressions related to firmware, drivers, or system-level changes in production or pre-production environments.
- Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
- Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
- Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
- Excellent documentation skills with a high level of attention to detail.
- Experience participating in an on-call rotation supporting production services.
- Proactive self-starter with a drive for continuous technical improvement.
Key Competencies
- Ability to understand AI compute platforms as end-to-end systems spanning hardware, firmware, operating systems, drivers, and workloads.
- Ability to anticipate cross-layer impacts of changes and design solutions that optimise overall system performance and reliability.
- Proactively identifies risks related to firmware upgrades and ensures compatibility through structured validation and rollback strategies.
- Experience operating AI infrastructure at medium to large scale, with a focus on reliability, repeatability, and performance consistency.
- Strong sense of ownership and accountability for system performance and reliability.
- Comfortable operating in ambiguous, fast-evolving environments while driving continuous improvement.
Success Metrics
- Reliable, automated firmware validation and upgrade systems and processes.
- Performance validation and optimisation.
- Improved operational efficiency.
- High-quality documentation and effective knowledge transfer.
Location & Reporting
- Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)
- Reporting to Senior Manager, Software Defined Infrastructure
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
Create a Job Alert
Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field