Back to jobs
Member of Technical Staff - ML Infra
Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our ML infra including large GPU training and inference clusters.
Role:
- Design, deploy, and maintain cloud-based ML training (Slurm) and inference (Kubernetes) clusters
- Implement and manage network-based cloud file systems and blob/S3 storage solutions
- Develop and maintain Infrastructure as Code (IaC) for resource provisioning
- Implement and optimize CI/CD pipelines for ML workflows
- Design and implement custom autoscaling solutions for ML workloads
- Ensure security best practices across the ML infrastructure
- Provide developer-friendly tools and practices for efficient ML operations
Ideal Experience:
- Strong proficiency in cloud platforms (AWS, Azure, or GCP) with focus on ML/AI services
- Extensive experience with Kubernetes and Slurm cluster management
- Expertise in Infrastructure as Code tools (e.g., Terraform, Ansible)
- Proven track record in managing and optimizing network-based cloud file systems and object storage
- Experience with CI/CD tools and practices (e.g., CircleCI, GitHub Actions, ArgoCD)
- Strong understanding of security principles and best practices in cloud environments
- Experience with monitoring and observability tools (e.g., Prometheus, Grafana, Loki)
- Familiarity with ML workflows and GPU infrastructure management
- Demonstrated ability to handle complex migrations and breaking changes in production environments
Nice to have:
- Experience with custom autoscaling solutions for ML workloads
- Knowledge of cost optimization strategies for cloud-based ML infrastructure
- Familiarity with MLOps practices and tools
- Experience with high-performance computing (HPC) environments
- Understanding of data versioning and experiment tracking for ML
- Knowledge of network optimization for distributed ML training
- Experience with multi-cloud or hybrid cloud architectures
- Familiarity with container security and vulnerability scanning tools
Apply for this job
*
indicates a required field