Back to jobs

Member of Technical Staff - Training Cluster Engineer

Freiburg (Germany), San Francisco (USA)

At Black Forest Labs, we’re on a mission to advance the state of the art in generative deep learning for media, building powerful, creative, and open models that push what’s possible.

Born from foundational research, we continuously create advanced infrastructure to transform ideas into images and videos.

Our team pioneered Latent Diffusion, Stable Diffusion, and FLUX.1 – milestones in the evolution of generative AI. Today, these foundations power millions of creations worldwide, from individual artists to enterprise applications.

We are looking for a Training Cluster Engineer to join our fast growing team

Role and Responsibilities

  • Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration
  • Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows
  • Partner with cloud and colocation providers to ensure cluster availability and performance
  • Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)
  • Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity
  • Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning

What we look for

  • Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation
  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments
  • Proven track record managingGPU clusters, including driver management and DCGM monitoring

Preferred Qualifications

  • Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization
  • Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments
  • Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training
  • Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns
  • Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads
  • Experience running hybrid training/inference infrastructure with appropriate resource isolation
  • Strong scripting skills (Python, Bash) and infrastructure-as-code experience

Base Annual Salary: $180,000 - $300,000 USD 

Create a Job Alert

Interested in building your career at Black Forest Labs? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf