Data Manager — Multimodal Medical Foundation Models
About the Role
You will lead data operations for a cutting-edge research group developing 3D medical multimodal foundation modelsand agentic clinical AI systems. These models rely on extremely high-quality, well-structured, and compliant datasets—including 3D medical imaging volumes (MRI, CT, PET), clinical text corpora, annotations, and multimodal metadata.
Your job is to own the end-to-end data lifecycle: acquisition, ingestion, cleaning, versioning, labeling, quality control, governance, and delivery to researchers. You are the central node ensuring our foundation model teams and medical agent teams have clean, scalable, well-documented data pipelines.
This is a pivotal foundational role—without great data, large models cannot be great.
What You Will Work On
Multimodal Medical Data Ops
- Oversee ingestion and processing of 3D medical volumes (DICOM, NIfTI, MHA) and associated clinical texts.
- Build automated pipelines for metadata extraction, de-identification, slice/series validation, and cohort structuring.
- Manage large-scale internal datasets and external research datasets (BraTS, LiTS, MIMIC-CXR, CheXpert, MosMed, etc.).
Data Infrastructure & Versioning
- Implement scalable data storage, cataloging, and retrieval systems for multimodal training data.
- Own dataset version control, lineage tracking, reproducibility, and dataset documentation.
- Collaborate with ML systems engineers on high-throughput data loaders, sharding strategies, and caching mechanisms.
Annotation & Labeling Programs
- Lead medical annotation workflows with radiologists, medical students, and labeling vendors.
- Create guidelines for ROI labeling, segmentation, captioning, report alignment, and case-level curation.
- Build semi-automated labeling pipelines using model-assisted tools.
Data Quality, Compliance & Governance
- Enforce strict standards on data quality, completeness, consistency, and bias control.
- Ensure adherence to medical data privacy, HIPAA-equivalent frameworks, and institutional data-sharing rules.
- Manage PHI de-identification, audit logs, access control, and compliance approvals.
Collaboration with Research & Engineering
- Work closely with foundation-model researchers to understand data needs for model training.
- Partner with agentic system designers to supply structured datasets for clinical reasoning tasks.
- Collaborate with foundational engineers on data access layers, performance bottlenecks, and dataset optimization.
Why This Role Is Critical
- The foundation model relies on high-quality 3D and textual data at scale.
- You shape the data pipelines enabling next-generation medical AI agents.
- You ensure clinical-grade governance, safety, reproducibility, and trust.
- Your systems become the backbone for research, experiments, and deployments.
For candidates motivated by the intersection of data, healthcare, and machine learning, this is a high-impact opportunity.
What We’re Looking For
- Strong experience managing large multimodal or imaging datasets, ideally medical imaging.
- Proficiency with DICOM/DICOMweb, NIfTI, PACS systems, and medical imaging toolkits (dicompyler, pydicom, MONAI, ITK).
- Experience with ETL pipelines, distributed data systems, and cloud/on-prem storage.
- Knowledge of metadata standards, ontologies, and text–image linking strategies.
- Comfortable working with Python, SQL, and data tooling (Airflow, Prefect, Dagster, DBT, Delta Lake, etc.).
- Understanding of data privacy, de-identification, and compliance requirements in healthcare.
- Strong communication skills and the ability to coordinate between engineers, researchers, clinicians, and data partners.
Nice to Have
- Experience with vector databases, multimodal retrieval, or embedding store design.
- Familiarity with annotation tools (Labelbox, CVAT, iMerit, custom MONAI Label pipelines).
- Prior work with clinical NLP datasets or multilingual Indian medical corpora.
- Experience conducting bias audits, dataset characterization, or quality scoring at scale.
- Contributions to open datasets, benchmarks, or data documentation frameworks.
What We Offer
- Competitive compensation.
- Access to one of the most ambitious medical multimodal datasets in the region.
- Collaboration with scientists building India’s first 3D multimodal medical foundation model.
- Autonomy to design data systems from the ground up.
- A mission-driven team working to transform clinical care with agentic AI.
Create a Job Alert
Interested in building your career at SAIGroup? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field