Infrastructure Solutions Engineer
RunPod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI.
As our organization continues its rapid expansion in managing large-scale, distributed systems, we are looking for an Infrastructure Solutions Engineer to join our team.This is a unique opportunity to work at the forefront of AI infrastructure, helping customers design and implement GPU-accelerated solutions that power the next wave of innovation. If you thrive in technical problem-solving, love working with cutting-edge infrastructure, and want to shape the future of cloud-native AI, this role is for you.
As an Infrastructure Solutions Engineer, you’ll play a critical role in helping our customers unlock the full potential of GPU-powered infrastructure. From architecting high-performance solutions for AI/ML workflows to optimizing large-scale deployments in AI datacenters and edge environments, you’ll be at the heart of the action. This is a hybrid role that blends customer interaction, technical consulting, and hands-on engineering. You’ll work with enterprise customers, research teams, and internal stakeholders to deliver infrastructure solutions that are fast, reliable, and scalable. If you’re looking for a role where you can have a tangible impact on groundbreaking AI applications, this is it.
Responsibilities:
- Design, build, and deploy GPU-centric infrastructure solutions that enable customers to accelerate their AI/ML workloads at scale.
- Act as a trusted advisor, helping customers architect high-performance compute environments in GPU datacenters, edge environments, and hybrid cloud scenarios.
- Lead technical onboarding for new customers, guiding them through best practices for managing and scaling GPU infrastructure.
- Develop scripts, tools, and automations that improve efficiency and simplify large-scale GPU deployments.
- Analyze and optimize infrastructure for performance, cost, and reliability — whether it’s for multi-cloud, on-premises AI datacenters, or hybrid models.
- Troubleshoot customer issues related to GPU provisioning, container orchestration, and AI workload performance.
- Work with Product, Sales, and Engineering teams to align customer needs with the development of new features and services.
Requirements:
- 2-4 years of experience with GPU cloud platforms or roles in infrastructure engineering, DevOps, or SRE.
- Experience with monitoring tools like DataDog, Grafana, or ELK (Elasticsearch, Logstash, Kibana) to support large-scale infrastructure environments.
- Hands-on experience with NVIDIA GPUs (A100, H100, or similar) and a strong understanding of how GPUs accelerate AI/ML workloads.
- Expertise with Kubernetes (K8s) and Docker for orchestrating containerized AI/ML workloads.
- Proficiency in Python, Bash, or Go for automation, tooling, and infrastructure management
- Strong communication and interpersonal skills, with experience delivering technical solutions to both technical and non-technical stakeholders.
Preferred:
- Experience supporting AI/ML frameworks like TensorFlow, PyTorch, or JAX in a production environment.
- Familiarity with data center operations, including power, cooling, and rack deployment for GPU-heavy workloads.
- Proficiency with cloud platforms like AWS, GCP, or Azure, with an emphasis on GPU instances, hybrid/multi-cloud deployments, and AI datacenter operations.
What You’ll Receive:
- The competitive base pay for this position ranges from $100,000 - $160,000. Factors that may be used to determine your actual pay may include your specific job related knowledge, skills and experience
- Stock options
- The flexibility of remote work with an inclusive, collaborative team.
- An opportunity to grow with a company that values innovation and user-centric design.
- Generous vacation policy to ensure work-life harmony and well-being.
- Contribute to a company with a global impact based in the US, Canada, and Europe.
RunPod is committed to maintaining a workplace free from discrimination and upholding the principles of equality and respect for all individuals. We believe that diversity in all its forms enhances our team. As an equal opportunity employer, RunPod is committed to creating an inclusive workforce at every level. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, marital status, protected veteran status, disability status, or any other characteristic protected by law.
Apply for this job
*
indicates a required field