Back to jobs
New

HPC Specialist

Montreal

DRW is a diversified trading firm with over 3 decades of experience bringing sophisticated technology and exceptional people together to operate in markets around the world. We value autonomy and the ability to quickly pivot to capture opportunities, so we operate using our own capital and trading at our own risk.

Headquartered in Chicago with offices throughout the U.S., Canada, Europe, and Asia, we trade a variety of asset classes including Fixed Income, ETFs, Equities, FX, Commodities and Energy across all major global markets. We have also leveraged our expertise and technology to expand into three non-traditional strategies: real estate, venture capital and cryptoassets.

We operate with respect, curiosity and open minds. The people who thrive here share our belief that it’s not just what we do that matters–it's how we do it. DRW is a place of high expectations, integrity, innovation and a willingness to challenge consensus.

We are looking for an HPC Specialist to join our AI and Multi Asset Systematic Strategies team. This team builds and operates GPU infrastructure that powers AI and ML workloads. You'll work on the infrastructure stack from bare metal to model serving, combining systems engineering, performance optimization, and infrastructure automation to solve complex problems at the intersection of hardware, networking, and distributed systems.

Responsibilities:

  • Deploy, maintain, and optimize GPU infrastructure for large-scale LLM inference workloads, including provisioning, configuration, and deployment of GPU server fleets.
  • Architect and implement distributed serving solutions for multi-node, multi-GPU model deployments.
  • Manage GPU-enabled Kubernetes clusters for LLM and ML workloads.
  • Configure network infrastructure including load balancers, firewalls, and inter-node communication for GPU clusters.
  • Implement and optimize storage solutions for model weights and inference caches.
  • Troubleshoot performance bottlenecks across the stack: hardware, drivers, networking, and application layer.
  • Research and evaluate emerging GPU technologies, model serving frameworks, and infrastructure optimizations.
  • Collaborate with ML engineers to profile model performance and implement inference acceleration techniques.
  • Drive reliability improvements through monitoring, alerting, capacity planning, and incident response.

Requirements:

  • Bachelor's or Master's degree in Computer Science, Systems Engineering, or related field.
  • 5+ years in DevOps, SRE, or infrastructure engineering roles.
  • Strong experience with GPU infrastructure, model serving frameworks (vLLM, SGLang), and GPU driver management.
  • Hands-on experience optimizing deep learning workloads (inference or training) on GPU clusters.
  • Deep Linux systems knowledge including network configuration, storage optimization, and Kubernetes orchestration.
  • Experience with infrastructure as code tools (Ansible, Terraform, or similar).
  • Strong understanding of distributed systems, networking protocols (TCP/IP, HTTP/2), and load balancing.
  • Proficiency in Python and Bash scripting for automation.
  • Experience with monitoring and observability tools (Prometheus, Grafana, or similar).

For more information about DRW's processing activities and our use of job applicants' data, please view our Privacy Notice at https://drw.com/privacy-notice.

California residents, please review the California Privacy Notice for information about certain legal rights at https://drw.com/california-privacy-notice.

[#LI-KS1] 

Apply for this job

*

indicates a required field

Phone
Resume/CV

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...
Select...