Network Development Engineer (Ops&Deploy)
About xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
About the Role
xAI is building at a furious pace with the latest hardware to help people understand the universe and we are in need of Network Development Engineers (NDEs) with at least 3+ years of experience in deploying or operating large scale production Data Center or Backbone networks.
You will own the availability and/or the deployment of production networks for 𝕏 and xAI, including Data Center, Backbone networks and our primary front and backend networks that train Grok and our customers use for inference. Deployment Engineers will own all aspects of planning and building of green and brownfield network deployments. Operations Engineers will own timely mitigation of network impairments for all layers of our network and the return to service of Network HW and capacity You will be expected to participate in a team oncall rota and to contribute to scaling and maintenance efforts.
Responsibilities
- Deploying or Operating scalable network architectures for AI/HPC workloads, inter-DC and Backbone network fabrics.
- Power user and ability to iterate SW and toolings for network operations, network deployment and monitoring.
- Collaborating with cross-functional teams on data center & backbone buildouts and optimizations.
- Analyzing performance and availability metrics to identify and resolve bottlenecks, availability impairments or inefficient build processes.
- Ensuring high availability, fast deployability and high security of production networks.
Required Qualifications
- A minimum of 3 years in deploying or operating hyper scale networks
- Hands-on experience with networking protocols and tools (e.g., BGP, OSPF, ZTP etc.).
- Experience with Python scripting and in automating tasks, acquiring metrics, and analyzing large data sets.
- Strong problem-solving skills and ability to thrive in a fast-paced, ambiguous setting.
- Bachelor's degree in Computer Science, Electrical Engineering, or a related field (or equivalent experience).
Preferred Qualifications
- Experience designing hyper scale network infrastructure or large-scale GPU clusters and automating their entire deployment process.
- Proven track record in leading on-call rotations, incident response, and team development in high-stakes environments.
- A working understanding of RoCEv2.
Interview Process
After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to an initial interview (45 minutes - 1 hour) during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four interviews:
- Coding Interview
- Network Engineering technologies.
- Manager Interview.
- Meet and greet with the team with a presentation of a large scale solution or problem you owned, start to finish.
Our goal is to finish the main process within one week. We don’t rely on recruiters for assessments. Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet or in person.
xAI is an equal opportunity employer.
Create a Job Alert
Interested in building your career at xAI? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
