Site Reliability Engineer
Firmus Technologies
Firmus Technologies is a global leader pioneering the development and operation of efficient AI infrastructure across Asia Pacific.
Founded in Australia in 2019, our mission is to create the most efficient AI infrastructure by combining cutting-edge technology with a steadfast commitment to sustainability.
At Firmus, we are unique in our approach. We design, build, and operate a new class of digital infrastructure – the AI Factory. Through our model-to-grid technology approach, we have pushed the boundaries of multi-generational liquid cooling systems, energy management, AI software orchestration, and construction. For our customers, this approach allows us to make every watt count and deliver low-cost AI tokens globally.
Firmus AI Cloud
Our large-scale GPU cloud platform, Firmus AI Cloud, is purpose-built to deliver energy-efficient AI compute at scale to customers.
It empowers developers, enterprises, educational institutions, and government users to train and deploy AI models with unmatched efficiency and cost savings. With an ever-growing suite of services and applications, we are committed to delivering a cloud experience that is market-leading, proprietary, and built to scale.
Why you’ll love working here
As an NVIDIA Cloud and Engineering partner in Asia Pacific, you will gain skills, experience, and exposure across the AI industry and be part of shaping what this industry looks like for decades to come.
We are founder-led, not a big corporate. Decisions happen fast, our leaders are accessible, and there's minimum bureaucracy between you and the work.
Ownership comes early. Whatever your role, you will have a direct line to outcomes, helping shape how the business grows as we scale nationally across a long-term, large-scale roadmap.
Work alongside founders and experts in AI infrastructure, energy systems and next-generation compute.
What we build here has impact beyond the business. Our AI Factories are designed to operate as assets to the energy grid to actively strengthen the communities and regions they operate in
rather than drawing from them.
Considering applying? You don't need a perfect background to join our team. If you're driven and curious, there's a path for you. We back our people to grow into new domains and take on challenges beyond their previous experience.
ROLE SUMMARY
Firmus Technologies is seeking a skilled Site Reliability Engineer to join our Operations team, supporting the daily operations and maintenance of our AI-accelerated high-performance computing (HPC) infrastructure. This role will work closely with Field Service Engineers, HPC and Network Engineering teams, and assist the Global Operations Centre (GOC). This is a unique opportunity to contribute directly to the stability and growth of cutting-edge AI infrastructure.
KEY RESPONSIBILITIES
- Support in the deployment, configuration, and maintenance of various high-end GPU servers, storage servers, networking equipment and software components in highly secure environments.
- Perform hardware diagnostics, systems functionality and firmware updates as required.
- Collaborate with engineering teams to assist in tailored customer environments deployment (eg: bare-metal systems, HPC Clusters, Kubernetes, Slurm etc).
- Serve as first line of engineering support for onsite operational issues, including troubleshooting hardware, network and software problems, and firmware compliance.
- Troubleshoot incidents, escalate critical issues and provide feedback to appropriate teams for improvements.
- Participate in an on-call rotation to ensure 24/7 availability and responsiveness to critical issues.
- Provide technical support to the GOC Support Specialist team in troubleshooting compute infrastructure related problems.
- Document incident details, resolutions, and lessons learned to enhance future problem-solving.
- Maintain clear, accurate, and up-to-date documentation to promote effective knowledge sharing across the team.
- Communicate effectively with GOC, HPC Engineers, internal teams, stakeholders, and end-users to ensure alignment on issue resolution.
- Take part in team meetings and knowledge-sharing sessions to foster collaboration and continuous learning.
SKILLS AND EXPERIENCE
- Bachelor’s degree in computer engineering, computer science, or a related technical field.
- 5+ years of experience in field service technical areas.
- Strong understanding of server hardware technology, firmware lifecycle, Linux environments and troubleshooting hardware problems, with adherence to physical and system-level security standards.
- Experience with scripting languages (eg: Bash, Python)
- Familiarity with using configuration management, CICD tools, workload manager and cluster softwares (eg: Slurm, Kubernetes, Nvidia BCM) and Observability tools (eg: Prometheus, Grafana, ELK, etc)
- Excellent problem-solving and analytical skills.
- Ability to work independently and as part of a team.
- Strong communication skills, both written and verbal.
Location
This role is based in Brooklyn, Melbourne, Australia.
Employment Basis
Full-time
Create a Job Alert
Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field