Data Engineer - Healthcare and AI/LLM
About the Team
RA Capital’s Data Engineering team is responsible for ensuring high-quality, reliable, and accessible data throughout the organization. We emphasize data integrity, compliance, and usability to support strategic decision-making across RA Capital. Our team oversees the complete data lifecycle—partnering with internal stakeholders and external vendors—to build scalable data infrastructure that fuels a data-driven culture.
About the Role
We are seeking a skilled Data Engineer with healthcare data experience and a strong interest in AI/LLM-powered data access to join our Data Engineering team. This role is pivotal in designing and maintaining robust data pipelines—with a focus on healthcare datasets like claims, provider data, and patient records—and extending that data accessibility through AI-driven solutions.
The ideal candidate will possess deep technical knowledge in data engineering, experience with healthcare data standards, and a working understanding of large language model (LLM) systems and the Model Context Protocol (MCP). You’ll help bridge structured enterprise data with AI interfaces that power self-service and natural language query workflows.
Responsibilities
- Design, build, and optimize end-to-end enterprise data pipelines for ingesting and integrating healthcare vendor data, especially claims data.
- Develop and maintain robust ETL processes and data integrations between data warehouses (e.g., Databricks) and downstream applications.
- Write production-level Python and SQL code to standardize, reconcile, and match healthcare data, applying NLP and ML techniques when needed.
- Develop scalable data models in Databricks to support efficient reporting and analytics across clinical, financial, and operational datasets.
- Implement rigorous data quality controls and validation checks to ensure data accuracy and compliance with healthcare standards (e.g., HIPAA).
- Collaborate with external healthcare data vendors to define delivery specifications and transformation logic.
- Partner with internal IT, analytics, and business stakeholders to align data efforts with organizational objectives.
- Work closely with AI/ML engineers and product teams to support LLM-based data access layers above Hasura or similar GraphQL engines.
- Contribute to the integration and evaluation of Model Context Protocol (MCP) in real-world applications, enabling scalable, secure, and interpretable LLM usage.
- Document data architectures, pipelines, workflows, and processes for both technical and non-technical audiences.
- Provide Tier 1 support for monitoring data flows and resolving pipeline or integration issues.
- Ensure ongoing compliance with data governance and security standards.
Key Skills & Experience
- 1–2+ years working with healthcare data, including claims, structured and unstructured EMR/EHR, provider, and payer data. Familiarity with healthcare ontologies (ICD, CPT, NPI, etc.) strongly preferred.
- Expertise in building scalable ETL/ELT pipelines and data integration workflows.
- Strong skills in Python, SQL, and Spark. Experience with Java is a plus.
- Hands-on experience with Databricks; familiarity with AWS (S3, EC2, EBS) preferred.
- Strong understanding of data validation, quality assurance, and compliance practices in a healthcare setting.
- Exposure to LLM applications and AI-driven data interfaces, particularly in structured enterprise data environments.
- Familiarity with Model Context Protocol (MCP) and how it supports contextual integrity, auditability, and chain-of-thought in AI/LLM-based data access.
- Proven ability to manage external data vendors and collaborate on schema, format, and delivery improvements.
- Ability to clearly convey technical details to non-technical stakeholders and align data projects with business needs.
Key Requirements
- Master’s degree or higher from a top Computer Science or Data Science program.
- 1–2+ years of experience in data engineering, software development, and managing production-grade pipelines in a healthcare environment, preferred.
- Must be based in Boston area
- Ability to work a hybrid schedule in our Boston office
- Must be authorized to work in the United States.
Apply for this job
*
indicates a required field