Back to jobs

Member of Technical Staff - Training Cluster Engineer

Freiburg (Germany), San Francisco (USA)

What if the difference between a research breakthrough and a failed experiment is whether your GPUs are actually doing what you think they're doing?

We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M+ downloads. But here's the reality: those models only exist because someone kept thousands of GPUs running smoothly for weeks at a time. Training runs fail. Nodes go dark. Networks saturate. Your job is to make sure that doesn't stop us from pushing the frontier.

What You'll Pioneer

You'll build and maintain the computational infrastructure that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds or fails, whether researchers iterate quickly or wait hours for resources, whether we can scale to the next generation of models or hit a wall.

You'll be the person who:

  • Designs, deploys, and maintains large-scale ML training clusters running SLURM for distributed workload orchestration—the backbone of everything we train
  • Implements comprehensive node health monitoring with automated failure detection and recovery workflows, because at scale, something is always breaking
  • Partners with cloud and colocation providers to ensure cluster availability and performance—translating between their abstractions and our requirements
  • Establishes and enforces security best practices across the entire ML infrastructure stack (network, storage, compute) without creating friction for researchers
  • Builds developer-facing tools and APIs that streamline ML workflows and improve researcher productivity—because infrastructure that's hard to use doesn't get used
  • Collaborates directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning decisions

Questions We're Wrestling With

  • How do you detect and recover from GPU failures in multi-week training runs without losing days of progress?
  • What's the right balance between cluster utilization and researcher flexibility—and how do you enforce it without becoming a bottleneck?
  • When a training run is using 1000+ GPUs, which failure modes matter and which can you safely ignore?
  • How do you optimize NCCL and interconnect settings for models that don't fit established patterns?
  • What does "high availability" actually mean for ML infrastructure, where some downtime is acceptable but data loss never is?
  • How do you provide researchers with enough visibility to debug their jobs without overwhelming them with infrastructure complexity?

We're figuring these out in production, where the cost of being wrong is measured in GPU-hours.

Who Thrives Here

You've managed large-scale compute infrastructure and understand that ML training clusters are their own special kind of challenging. You've been paged at 2am because a training run failed. You've debugged why 512 GPUs are running fine but 1024 aren't. You know the difference between infrastructure that works in theory and infrastructure that works when researchers depend on it.

You likely have:

  • Production experience managing SLURM clusters at scale—not just deploying them, but tuning job scheduling policies, resource allocation strategies, and federation setups
  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments where performance actually matters
  • A proven track record managing GPU clusters, including the unglamorous work of driver management and DCGM monitoring

We'd be especially excited if you:

  • Understand distributed training patterns, checkpointing strategies, and data pipeline optimization well enough to help researchers debug performance issues
  • Have experience with Kubernetes for containerized workloads, particularly in inference or mixed compute environments
  • Know your way around high-performance interconnects (InfiniBand, RoCE) and have tuned NCCL for multi-node training
  • Have managed 1000+ GPU training runs and developed deep intuition for failure modes and recovery patterns
  • Are familiar with high-performance storage solutions (VAST, blob storage) and understand their performance characteristics for ML workloads
  • Have run hybrid training/inference infrastructure with appropriate resource isolation
  • Bring strong scripting skills (Python, Bash) and infrastructure-as-code experience

What We're Building Toward

We're not just maintaining infrastructure—we're building the computational foundation that determines what research is possible. Every hour of cluster downtime prevented is a research experiment that happens faster. Every monitoring system improved is a failure caught before it costs days of training. If that sounds more compelling than keeping existing systems running, we should talk.

Base Annual Salary: $180,000–$300,000 USD


We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Create a Job Alert

Interested in building your career at Black Forest Labs? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf