Data Center Systems Engineer
Join Sustainable Talent as a Senior Engineering Technician ( Data Center Systems Engineer ) supporting Nvidia and their Colossus quality assurance labs! This is a W-2 full-time contract with openings in Santa Clara, Ca. We offer competitive pay $40-60/hourly based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!
Join an innovative R&D lab where you'll be at the forefront of system-level deployment and maintenance for cutting-edge NVIDIA hardware and infrastructure. This role provides a dynamic and fast-paced environment where you’ll collaborate with cross-functional engineering teams to support the full lifecycle of enterprise-level hardware, including system setup, configuration, troubleshooting, and documentation. This is an ideal position for someone with hands-on experience in data center environments who enjoys problem-solving, maintaining high system uptime, and working with state-of-the-art tools and platforms.
Key Responsibilities:
- Set up, configure, and maintain hardware systems, focusing on clusters, fiber, switches, and configurations within data center environments. Ensure that all systems are correctly configured and optimized for performance.
- Analyze and resolve system-level issues, utilizing logs to understand and address underlying problems. Work with command-line tools to perform root cause analysis and modify existing scripts (primarily Bash and Python) to facilitate troubleshooting.
- Partner closely with infrastructure engineers, rack and stack teams, and other internal stakeholders to ensure smooth deployment, execution, and feedback loops. Support materials management and cluster builds as part of your daily tasks.
- Use tools like Ansible, BCM (Board Command Management), and Jenkins to manage configurations, automate routine tasks, and ensure system stability. Leverage templates and existing scripts to modify and enhance current workflows without extensive new scripting.
- Utilize DCIM tools, specifically Netbox, to manage and track lab resources, maintain accurate documentation, and ensure efficient asset management.
- Provide real-time feedback on system health, deployment status, and lab performance metrics to relevant stakeholders. Ensure that potential risks and technical issues are promptly communicated to maintain high operational standards.
Required Qualifications:
- Experience: 4+ years in a lab or data center environment, with a strong focus on enterprise systems (clusters, fiber, switches, configurations) and system-level troubleshooting.
- Technical Proficiency:
- Command Line: Proficient with command-line interfaces and able to execute troubleshooting commands as needed.
- Scripting: Able to read and modify Bash and Python scripts; no need to develop from scratch, as most workflows utilize templates.
- Tools: Experience with Ansible, BCM (or similar command management tools), and Jenkins. Proficiency with DCIM tools like Netbox is essential.
- Familiar with Unix/Windows environments, enterprise networking, and concepts related to hardware/software layers.
- Strong ability to analyze issues, read system logs, and perform triage. Demonstrates persistence and creativity in resolving complex issues.
- Excellent communication skills and a proven track record of working effectively in cross-functional teams. Ability to work independently while knowing when to seek input from colleagues.
Ways to stand out:
- Familiarity with water-cooling systems and BCM (Board Command Management).
- Experience with rack and stack processes, particularly handling PDUs and power in data centers.
- Basic knowledge of Git/Perforce for version control.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.
Apply for this job
*
indicates a required field