Back to jobs

Member of Technical Staff - Large Model Data

Freiburg (Germany), San Francisco (USA)

What if the bottleneck to better generative models isn't architecture or compute, but the quality and scale of the data we train on?

We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M+ downloads. But here's what we've learned: breakthrough models require breakthrough datasets. Not just big datasets—carefully curated, properly processed, deeply understood datasets that push models toward capabilities they couldn't achieve otherwise. That's the infrastructure you'll build.

What You'll Pioneer

You'll create the data systems that make frontier research possible. This isn't traditional data engineering—it's building infrastructure at a scale where billion-image datasets are normal, where video processing pipelines need to run across thousands of GPUs, and where understanding what's in your data is as important as collecting it.

You'll be the person who:

  • Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions
  • Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines
  • Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option
  • Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)
  • Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs
  • Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful
  • Transforms user preference data and alternative sources into formats that models can learn from
  • Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing

Questions We're Wrestling With

  • How do you deduplicate billions of images without accidentally removing the edge cases that make models interesting?
  • What does "data quality" actually mean when you're training generative models—and how do you measure it at scale?
  • How do you caption video data in ways that capture temporal dynamics, not just individual frames?
  • Where are the hidden biases in our datasets, and how do we surface them before they become model biases?
  • When does adding more data help, and when does it just add noise?
  • How do we build data pipelines that adapt as model requirements change mid-training?

These questions don't have textbook answers—we're figuring them out as we go.

Who Thrives Here

You understand that data engineering at research scale is fundamentally different from traditional data engineering. You've built pipelines that broke, debugged them at scale, and emerged with opinions about what works. You know the difference between data that looks good and data that actually trains well.

You likely have:

  • Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
  • Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
  • Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics
  • Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code
  • Familiarity with data annotation and captioning processes for ML training datasets
  • Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)

We'd be especially excited if you:

  • Have built or contributed to large-scale data acquisition systems and understand the operational challenges
  • Bring experience with NLP techniques for image/video captioning
  • Have implemented data deduplication at billion-record scale and understand the tradeoffs
  • Know your way around big data frameworks like Apache Spark or Hadoop
  • Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes
  • Think deeply about ethical considerations in data collection and usage

What We're Building Toward

We're not just processing data—we're building the foundation that determines what our models can learn. Every pipeline optimization makes training faster. Every data quality improvement makes models better. Every new data source opens new possibilities. If that sounds more compelling than maintaining existing systems, we should talk.

Base Annual Salary: $180,000–$300,000 USD


We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Create a Job Alert

Interested in building your career at Black Forest Labs? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf



U.S. Standard Demographic Questions

We invite applicants to share their demographic background. If you choose to complete this survey, your responses may be used to identify areas of improvement in our hiring process.
Select...
Select...
Select...
Select...
Select...
Select...

Voluntary Self-Identification

For government reporting purposes, we ask candidates to respond to the below self-identification survey. Completion of the form is entirely voluntary. Whatever your decision, it will not be considered in the hiring process or thereafter. Any information that you do provide will be recorded and maintained in a confidential file.

As set forth in Black Forest Labs’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Select...
Select...
Race & Ethnicity Definitions

If you believe you belong to any of the categories of protected veterans listed below, please indicate by making the appropriate selection. As a government contractor subject to the Vietnam Era Veterans Readjustment Assistance Act (VEVRAA), we request this information in order to measure the effectiveness of the outreach and positive recruitment efforts we undertake pursuant to VEVRAA. Classification of protected categories is as follows:

A "disabled veteran" is one of the following: a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or a person who was discharged or released from active duty because of a service-connected disability.

A "recently separated veteran" means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.

An "active duty wartime or campaign badge veteran" means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.

An "Armed forces service medal veteran" means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.

Select...

Voluntary Self-Identification of Disability

Form CC-305
Page 1 of 1
OMB Control Number 1250-0005
Expires 04/30/2026

Why are you being asked to complete this form?

We are a federal contractor or subcontractor. The law requires us to provide equal employment opportunity to qualified people with disabilities. We have a goal of having at least 7% of our workers as people with disabilities. The law says we must measure our progress towards this goal. To do this, we must ask applicants and employees if they have a disability or have ever had one. People can become disabled, so we need to ask this question at least every five years.

Completing this form is voluntary, and we hope that you will choose to do so. Your answer is confidential. No one who makes hiring decisions will see it. Your decision to complete the form and your answer will not harm you in any way. If you want to learn more about the law or this form, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

A disability is a condition that substantially limits one or more of your “major life activities.” If you have or have ever had such a condition, you are a person with a disability. Disabilities include, but are not limited to:

  • Alcohol or other substance use disorder (not currently using drugs illegally)
  • Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, HIV/AIDS
  • Blind or low vision
  • Cancer (past or present)
  • Cardiovascular or heart disease
  • Celiac disease
  • Cerebral palsy
  • Deaf or serious difficulty hearing
  • Diabetes
  • Disfigurement, for example, disfigurement caused by burns, wounds, accidents, or congenital disorders
  • Epilepsy or other seizure disorder
  • Gastrointestinal disorders, for example, Crohn's Disease, irritable bowel syndrome
  • Intellectual or developmental disability
  • Mental health conditions, for example, depression, bipolar disorder, anxiety disorder, schizophrenia, PTSD
  • Missing limbs or partially missing limbs
  • Mobility impairment, benefiting from the use of a wheelchair, scooter, walker, leg brace(s) and/or other supports
  • Nervous system condition, for example, migraine headaches, Parkinson’s disease, multiple sclerosis (MS)
  • Neurodivergence, for example, attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder, dyslexia, dyspraxia, other learning disabilities
  • Partial or complete paralysis (any cause)
  • Pulmonary or respiratory conditions, for example, tuberculosis, asthma, emphysema
  • Short stature (dwarfism)
  • Traumatic brain injury
Select...

PUBLIC BURDEN STATEMENT: According to the Paperwork Reduction Act of 1995 no persons are required to respond to a collection of information unless such collection displays a valid OMB control number. This survey should take about 5 minutes to complete.