Job Application for Member of Technical Staff - Large Model Data at Black Forest Labs

What if the bottleneck to better generative models isn't architecture or compute, but the quality and scale of the data we train on?

We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M+ downloads. But here's what we've learned: breakthrough models require breakthrough datasets. Not just big datasets—carefully curated, properly processed, deeply understood datasets that push models toward capabilities they couldn't achieve otherwise. That's the infrastructure you'll build.

What You'll Pioneer

You'll create the data systems that make frontier research possible. This isn't traditional data engineering—it's building infrastructure at a scale where billion-image datasets are normal, where video processing pipelines need to run across thousands of GPUs, and where understanding what's in your data is as important as collecting it.

You'll be the person who:

Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions
Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines
Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option
Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)
Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs
Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful
Transforms user preference data and alternative sources into formats that models can learn from
Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing

Questions We're Wrestling With

How do you deduplicate billions of images without accidentally removing the edge cases that make models interesting?
What does "data quality" actually mean when you're training generative models—and how do you measure it at scale?
How do you caption video data in ways that capture temporal dynamics, not just individual frames?
Where are the hidden biases in our datasets, and how do we surface them before they become model biases?
When does adding more data help, and when does it just add noise?
How do we build data pipelines that adapt as model requirements change mid-training?

These questions don't have textbook answers—we're figuring them out as we go.

Who Thrives Here

You understand that data engineering at research scale is fundamentally different from traditional data engineering. You've built pipelines that broke, debugged them at scale, and emerged with opinions about what works. You know the difference between data that looks good and data that actually trains well.

You likely have:

Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis
Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing
Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics
Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code
Familiarity with data annotation and captioning processes for ML training datasets
Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)

We'd be especially excited if you:

Have built or contributed to large-scale data acquisition systems and understand the operational challenges
Bring experience with NLP techniques for image/video captioning
Have implemented data deduplication at billion-record scale and understand the tradeoffs
Know your way around big data frameworks like Apache Spark or Hadoop
Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes
Think deeply about ethical considerations in data collection and usage

What We're Building Toward

We're not just processing data—we're building the foundation that determines what our models can learn. Every pipeline optimization makes training faster. Every data quality improvement makes models better. Every new data source opens new possibilities. If that sounds more compelling than maintaining existing systems, we should talk.

Base Annual Salary: $180,000–$300,000 USD

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Create a Job Alert

Interested in building your career at Black Forest Labs? Get future opportunities sent straight to your email.

First Name

Last Name

Country

Phone

Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf

LinkedIn Profile

Website

U.S. Standard Demographic Questions

We invite applicants to share their demographic background. If you choose to complete this survey, your responses may be used to identify areas of improvement in our hiring process.

How would you describe your gender identity?

Select...

How would you describe your racial/ethnic background?

Select...

How would you describe your sexual orientation?

Select...

Do you identify as transgender?

Select...

Do you have a disability or chronic condition (physical, visual, auditory, cognitive, mental, emotional, or other) that substantially limits one or more of your major life activities, including mobility, communication (seeing, hearing, speaking), and learning?

Select...

Are you a veteran or active member of the United States Armed Forces?

Select...

Voluntary Self-Identification

For government reporting purposes, we ask candidates to respond to the below self-identification survey. Completion of the form is entirely voluntary. Whatever your decision, it will not be considered in the hiring process or thereafter. Any information that you do provide will be recorded and maintained in a confidential file.

As set forth in Black Forest Labs’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Gender

Select...

Are you Hispanic/Latino?

Select...

Race & Ethnicity Definitions

If you believe you belong to any of the categories of protected veterans listed below, please indicate by making the appropriate selection. As a government contractor subject to the Vietnam Era Veterans Readjustment Assistance Act (VEVRAA), we request this information in order to measure the effectiveness of the outreach and positive recruitment efforts we undertake pursuant to VEVRAA. Classification of protected categories is as follows:

A "disabled veteran" is one of the following: a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or a person who was discharged or released from active duty because of a service-connected disability.

A "recently separated veteran" means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.

An "active duty wartime or campaign badge veteran" means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.

An "Armed forces service medal veteran" means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.

Veteran Status

Select...

Voluntary Self-Identification of Disability

Form CC-305

Page 1 of 1

OMB Control Number 1250-0005

Expires 04/30/2026

Why are you being asked to complete this form?

We are a federal contractor or subcontractor. The law requires us to provide equal employment opportunity to qualified people with disabilities. We have a goal of having at least 7% of our workers as people with disabilities. The law says we must measure our progress towards this goal. To do this, we must ask applicants and employees if they have a disability or have ever had one. People can become disabled, so we need to ask this question at least every five years.

Completing this form is voluntary, and we hope that you will choose to do so. Your answer is confidential. No one who makes hiring decisions will see it. Your decision to complete the form and your answer will not harm you in any way. If you want to learn more about the law or this form, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

A disability is a condition that substantially limits one or more of your “major life activities.” If you have or have ever had such a condition, you are a person with a disability. Disabilities include, but are not limited to:

Alcohol or other substance use disorder (not currently using drugs illegally)
Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, HIV/AIDS
Blind or low vision
Cancer (past or present)
Cardiovascular or heart disease
Celiac disease
Cerebral palsy
Deaf or serious difficulty hearing
Diabetes
Disfigurement, for example, disfigurement caused by burns, wounds, accidents, or congenital disorders
Epilepsy or other seizure disorder
Gastrointestinal disorders, for example, Crohn's Disease, irritable bowel syndrome
Intellectual or developmental disability
Mental health conditions, for example, depression, bipolar disorder, anxiety disorder, schizophrenia, PTSD
Missing limbs or partially missing limbs
Mobility impairment, benefiting from the use of a wheelchair, scooter, walker, leg brace(s) and/or other supports
Nervous system condition, for example, migraine headaches, Parkinson’s disease, multiple sclerosis (MS)
Neurodivergence, for example, attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder, dyslexia, dyspraxia, other learning disabilities
Partial or complete paralysis (any cause)
Pulmonary or respiratory conditions, for example, tuberculosis, asthma, emphysema
Short stature (dwarfism)
Traumatic brain injury

Disability Status

Select...

PUBLIC BURDEN STATEMENT: According to the Paperwork Reduction Act of 1995 no persons are required to respond to a collection of information unless such collection displays a valid OMB control number. This survey should take about 5 minutes to complete.

Member of Technical Staff - Large Model Data

What You'll Pioneer

Questions We're Wrestling With

Who Thrives Here

What We're Building Toward

Apply for this job

U.S. Standard Demographic Questions

Voluntary Self-Identification

Voluntary Self-Identification of Disability