Site Reliability Engineer - Automation Specialist
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
About the Role
As an SRE - Automation Specialist, you will focus on automating firmware upgrades, scripting solutions for hardware from key vendors like NVIDIA, Dell, Supermicro, and HP, and proactively identifying issues to implement automated fixes. Leveraging skills in Python, Bash, Linux, and Kubernetes, you will enhance datacenter efficiency, reduce manual interventions, and support scalable AI infrastructure at xAI.
Responsibilities
- Develop and maintain scripts in Python and Bash for handling firmware packages, performing upgrades, and automating the entire process across Linux and Kubernetes environments.
- Work with hardware from vendors such as NVIDIA, Dell, Supermicro, and HP to ensure seamless firmware integration, testing, and deployment in the datacenter.
- Identify operational problems in real-time, design automated fixes or workflows to resolve them, and implement scalable solutions to prevent recurrence.
- Collaborate with Datacenter Operations Technicians to deploy automation tools, troubleshoot firmware-related issues, and optimize processes for high-availability systems.
- Integrate automation scripts into CI/CD pipelines or orchestration tools like Kubernetes for efficient scaling and management.
- Monitor and refine automated processes, ensuring they align with datacenter reliability goals and minimize downtime.
- Document automation scripts, firmware upgrade procedures, and problem-solving approaches to build a reusable knowledge base for the team.
- Participate in on-call rotations and incident response, applying automation to accelerate resolutions in the Memphis datacenter.
Required Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- 5+ years of experience in site reliability engineering or automation roles, preferably in datacenter or cloud environments.
- Proficiency in Python, Bash, Linux, and Kubernetes for scripting, automation, and orchestration.
- Hands-on experience with firmware packages, including writing scripts for upgrades and automating deployment processes.
- Familiarity with hardware from vendors like NVIDIA, Dell, Supermicro, and HP, including integration and troubleshooting in production settings.
- Strong problem-solving skills with a proven ability to identify issues and automate fixes to improve system efficiency.
- Experience in high-performance computing or AI infrastructure environments.
- Excellent collaboration skills for working with cross-functional teams in fast-paced settings.
Preferred Qualifications
- Experience automating firmware management in large-scale datacenters or supercomputing clusters.
- Knowledge of additional tools like Ansible, Terraform, ArgoCD or additional containerization tools for enhanced automation.
- Prior work in a startup or tech company like xAI, with contributions to scalable automation systems.
xAI is an equal opportunity employer.
Create a Job Alert
Interested in building your career at xAI? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
