Senior AI Infrastructure Engineer
Role Summary
Firmus is seeking a highly skilled and driven Senior Engineer to play a key role in designing, building, and operating software-defined infrastructure, including high-performance AI storage platforms. You will help evolve our Software Defined Infrastructure by building reliable, scalable solutions that power some of the world’s largest and most innovative AI workloads.
You will be instrumental in ensuring the stability, performance, and continuous improvement of our mission-critical control plane and storage infrastructure.
Key Responsibilities
- Design and implement a highly scalable, multi-tenant control plane that supports Firmus’ growing AI and infrastructure needs.
- Contribute to the development of exabyte-scale, S3-compatible object storage, distributed file systems, and high-performance filesystems.
- Work with bare-metal provisioning tools such as Base Command Manager, Warewulf, Ironic, MaaS, and similar platforms.
- Apply a deep understanding of operating systems, computer networks, software-defined storage, and high-performance applications.
- Work with technologies including RDMA, GPU Direct Storage, RoCE, InfiniBand, DPDK, Ceph, Weka, DAOS, and others.
- Collaborate with operations teams to monitor, analyse, and optimise internal clusters and storage platforms.
- Document architecture designs, operational procedures, and performance results.
- Collaborate with L2 SRE engineers, site operations, and networking teams to ensure platform reliability, reproducibility, and performance.
- Contribute to continuous improvement in cluster validation, CI/CD automation, and provisioning and testing frameworks.
- Apply knowledge of Kubernetes and composable storage clusters.
- Contribute to the development of custom Kubernetes operators and intelligent orchestration frameworks to optimise AI workload performance for large-scale GPU cluster commissioning.
Skills & Experience
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
6–10 years of experience in infrastructure engineering and/or storage engineering. - Hands-on experience with bare-metal provisioning.
Ability to operate software-defined storage platforms such as Ceph, Weka, Vast Data, DAOS, or Lustre. - Solid understanding of cloud-native infrastructure, Kubernetes, and scalable system architectures.
- Strong debugging and problem-solving skills in distributed, high-performance environments.
- Practical Linux systems engineering experience (kernel, cgroups, system services, networking, drivers).
- Strong automation mindset using tools such as Ansible, Helm, Terraform/OpenTofu, or equivalent.
- Understanding of firmware, BIOS, BMC/IPMI/Redfish, and low-level system tuning.
- Proficiency in one or more programming languages such as Go, Bash, Rust, or Python.
- Excellent documentation skills with strong attention to detail.
- Experience participating in an on-call rotation supporting production services.
Proactive self-starter with a drive for continuous technical improvement.
Key Competencies
- Systems Architecture: Ability to design and integrate virtualisation, bare-metal, GPU, storage, and Kubernetes/Slurm platforms.
- Infrastructure Automation: Expertise in automated provisioning and lifecycle management of hardware and clusters.
- GPU and HPC Performance: Strong understanding of GPU systems, RDMA fabrics, and distributed AI workload performance.
- Technical Communication: Ability to communicate complex technical concepts effectively across engineering and operations teams.
- Continuous Improvement: Demonstrates curiosity, proactive learning, and innovation in AI and HPC infrastructure.
Success Metrics
- Reliable provisioning and benchmarking of scalable, high-performance storage systems.
- Performance validation and optimisation.
- Operational efficiency improvements.
- High-quality documentation and effective knowledge transfer.
Location & Reporting
- Singapore or Australia (Melbourne, VIC or Sydney, NSW or Launceston, TAS)
- Reporting to Senior Manager, Software Defined Infrastructure
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
Create a Job Alert
Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field