Software Engineer, Data Infrastructure
Thinking Machines Lab's mission is to empower humanity through advancing collaborative general intelligence. We're building a future where everyone has access to the knowledge and tools to make AI work for their unique needs and goals.
We are a small team of scientists, engineers, and builders who've created some of the most widely used AI products, like ChatGPT, Character.ai, Mistral, PyTorch, OpenAI Gym, Fairseq, and Segment Anything.
About The Role
We're looking for a Staff Software Engineer with deep expertise in Data Infrastructure to help build the systems that power our foundation models.
You'll join a small, high-impact team responsible for architecting and scaling the core infrastructure behind distributed training pipelines, multimodal data catalogs, and intelligent processing systems that operate over petabytes of data.
Infrastructure is critical to us: it's the bedrock that enables every breakthrough. You'll work directly with researchers to accelerate experiments, develop new datasets, improve infrastructure efficiency, and enable key insights across our data assets.
If you're excited by distributed systems, large-scale data mining, open-source tools like Spark, Kafka, Beam, Ray, and Delta Lake, and enjoy building from the ground up, we'd love to hear from you.
What You’ll Do
-
Design, build, and operate scalable, fault-tolerant infrastructure for LLM Research: distributed compute, data orchestration, and storage across modalities.
-
Develop high-throughput systems for data ingestion, processing, and transformation — including training data catalogs, deduplication, quality checks, and search.
-
Build systems for traceability, reproducibility, and robust quality control at every stage of the data lifecycle.
-
Implement and maintain monitoring and alerting to support platform reliability and performance.
-
Collaborate with research teams to unlock new features, improve data quality, and accelerate training cycles.
Required Qualifications
-
Have 5+ years of experience in data infrastructure, ideally supporting ML or research use cases.
-
Are fluent in distributed compute frameworks such as Apache Spark and Ray.
-
Have hands-on experience with Kafka, dbt, Terraform, and Airflow.
-
Have experience building a web crawler.
-
Have extensive experience studying and scaling deduplication, data mining, and search.
-
Are deeply familiar with cloud infrastructure, data lake architectures, and batch + streaming pipelines.
-
Have strong knowledge of file formats and storage systems (e.g., Parquet, Delta Lake, etc.) and how they impact performance and scalability.
-
Are proactive about documentation, testing, and empowering your teammates with good tooling.
Strong Candidates May Also Have
-
5+ years of industry experience building large-scale distributed systems.
-
Strong proficiency in Python, SQL, and bonus for Rust.
-
Familiarity with performance tuning and memory management in high-volume data systems.
-
Track record of scaling infrastructure and debugging complex systems in production.
-
Excellent communication and collaboration skills.
Logistics
-
Location: This role is based in San Francisco, California.
-
Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
-
Benefits: Thinking Machines offers competitive health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.
-
Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $300,000-$350,000 USD.
-
We encourage you to apply even if you do not believe you meet every single qualification.
-
As set forth in Thinking Machines' Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.
Create a Job Alert
Interested in building your career at Thinking Machines Lab? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field