Member of Technical Staff - Large Model Data
What if the bottleneck to better generative models isn't architecture or compute, but the quality and scale of the data we train on?
We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M+ downloads. But here's what we've learned: breakthrough models require breakthrough datasets. Not just big datasets—carefully curated, properly processed, deeply understood datasets that push models toward capabilities they couldn't achieve otherwise. That's the infrastructure you'll build.
What You'll Pioneer
You'll create the data systems that make frontier research possible. This isn't traditional data engineering—it's building infrastructure at a scale where billion-image datasets are normal, where video processing pipelines need to run across thousands of GPUs, and where understanding what's in your data is as important as collecting it.
You'll be the person who:
- Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions
- Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines
- Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option
- Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)
- Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs
- Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful
- Transforms user preference data and alternative sources into formats that models can learn from
- Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing
Questions We're Wrestling With
- How do you deduplicate billions of images without accidentally removing the edge cases that make models interesting?
- What does "data quality" actually mean when you're training generative models—and how do you measure it at scale?
- How do you caption video data in ways that capture temporal dynamics, not just individual frames?
- Where are the hidden biases in our datasets, and how do we surface them before they become model biases?
- When does adding more data help, and when does it just add noise?
- How do we build data pipelines that adapt as model requirements change mid-training?
These questions don't have textbook answers—we're figuring them out as we go.
Who Thrives Here
You understand that data engineering at research scale is fundamentally different from traditional data engineering. You've built pipelines that broke, debugged them at scale, and emerged with opinions about what works. You know the difference between data that looks good and data that actually trains well.
You likely have:
- Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
- Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
- Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics
- Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code
- Familiarity with data annotation and captioning processes for ML training datasets
- Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)
We'd be especially excited if you:
- Have built or contributed to large-scale data acquisition systems and understand the operational challenges
- Bring experience with NLP techniques for image/video captioning
- Have implemented data deduplication at billion-record scale and understand the tradeoffs
- Know your way around big data frameworks like Apache Spark or Hadoop
- Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes
- Think deeply about ethical considerations in data collection and usage
What We're Building Toward
We're not just processing data—we're building the foundation that determines what our models can learn. Every pipeline optimization makes training faster. Every data quality improvement makes models better. Every new data source opens new possibilities. If that sounds more compelling than maintaining existing systems, we should talk.
Base Annual Salary: $180,000–$300,000 USD
We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.
Create a Job Alert
Interested in building your career at Black Forest Labs? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field
.png?1754920013)
