Job Application for Product Data Lead at Impact.com

About impact.com

impact.com is the world’s leading commerce partnership marketing platform, transforming the way businesses grow by enabling them to discover, manage, and scale partnerships across the entire customer journey. From affiliates and influencers to content publishers, brand ambassadors, and customer advocates, impact.com empowers brands to drive trusted, performance-based growth through authentic relationships. Its award-winning products—Performance (affiliate), Creator (influencer), and Advocate (customer referral)—unify every type of partner into one integrated platform. As consumers increasingly rely on recommendations from people and communities they trust, impact.com helps brands show up where it matters most. Today, over 5,000 global brands, including Walmart, Uber, Shopify, Lenovo, L’Oréal, and Fanatics, rely on impact.com to power more than 225,000 partnerships that deliver measurable business results.

About the Role

We're seeking a Lead Data Scientist specializing in Product Data Quality to join our Cape Town Data Science team. This role combines deep individual contribution with meaningful technical leadership across the team.

You'll own the most complex, highest-leverage work in product data quality: spanning catalog hygiene, transaction matching, classification modeling, deduplication, and global product identity. You work across both the structured catalog universe and the messier, larger-scale sales transaction universe, building models and infrastructure that power search, recommendations, and business intelligence. Beyond your own delivery, you'll set technical standards, mentor senior scientists, and drive cross-functional alignment in ways that multiply the team's output.

This role demands both analytical depth and strong engineering capability. You take models from research to production, build scalable data pipelines, and create monitoring infrastructure that makes the product data foundation trustworthy and continuously improving. You also bring a systems perspective — identifying the architectural and process changes that prevent whole classes of data quality problems from recurring. Your work directly influences search relevance, recommendation quality, match rates, and reporting accuracy across the business.

Core Responsibilities

Product Classification & Taxonomy Modeling

Develop, deploy, and maintain ML models for automated product categorization and taxonomy assignment across hierarchical category structures — owning the most architecturally complex modeling challenges in the domain.
Drive step-change improvements in classification accuracy through advanced feature engineering (text, attributes, embeddings), model architecture decisions, and rigorous evaluation on both catalog and sales transaction data.
Define production monitoring standards for classification models; establish drift detection patterns, retraining triggers, and quality SLOs that others on the team adopt.
Act as the technical authority on taxonomy edge cases, new product types, and evolving category structures; collaborate with category experts and Product teams to shape taxonomy definitions.

Catalog & Sales Universe Data Quality

Lead deep-dive analyses into catalog completeness, consistency, and correctness across retailers, categories, and product attributes.
Own data quality analytics for the sales transaction universe — a larger, messier dataset than catalog — measuring match rates, diagnosing systemic gaps, and identifying root causes of unmatched transactions and misattributed products.
Define and evolve the canonical KPI framework for catalog and transaction health (attribute coverage, schema compliance, match rates, GPID coverage, freshness); build and maintain the monitoring systems that make quality trends visible and actionable.
Drive remediation of systemic quality failures; translate findings into cross-functional recommendations that produce durable process and data improvements.

Global Product ID (GPID) Coverage & Matching

Assess and own GPID coverage and accuracy across both catalog and sales transaction data; develop the analytical view of gaps by category, retailer, and brand.
Architect and improve matching algorithms to link sales transactions to catalog products, handling missing GPIDs, naming inconsistencies, and category misclassification at scale — combining rule-based, probabilistic, and learned approaches.
Quantify the downstream impact of GPID enrichment and matching improvements on search, deduplication, and reporting; use this to drive partner and brand engagement via the Tiger Team.

Deduplication & Entity Resolution

Lead the design and implementation of deduplication pipelines that handle catalog and transaction data at scale; define the architectural patterns, heuristics, and ML-based approaches for variant grouping and entity resolution.
Set the quality bar for precision/recall tradeoffs in duplicate detection; establish evaluation frameworks the broader team builds against.
Measure the impact of deduplication on search quality, recommendation accuracy, and reporting; iterate on models to reduce false positives and improve precision.
Drive productionization of deduplication and entity linking infrastructure in partnership with Data Engineering and Platform.

Manufacturer Data Quality & Brand Engagement

Own evaluation of manufacturer-level attribute consistency (brand name, MPN, manufacturer identifiers) across catalogs and transactions.
Detect and quantify systemic issues at the brand and retailer level; build scorecards and partner with the Tiger Team to drive data quality improvements at source.
Create feedback loops to measure progress on remediation initiatives; track and communicate impact over time.

Product Search & Retrieval Infrastructure

Research and prototype improvements to product search and retrieval pipelines — including vector search, semantic similarity, and embedding-based matching — and own the path from prototype to production.
Lead infrastructure decisions around vector databases (FAISS, Pinecone, Weaviate) and design retrieval pipelines that combine text, structured attributes, and embeddings at scale.
Evaluate search relevance and ranking quality; drive iteration on indexing strategies, query preprocessing, and re-ranking models.

Product Graph & Relational Modeling

Build and maintain product graph infrastructure capturing relationships between products, variants, brands, categories, retailers, and transactions.
Apply graph-based techniques (community detection, link analysis, centrality) to identify product families, detect duplicates, and surface hierarchy insights.
Partner with Data Platform teams on scalable graph storage and query design (Neo4j, graph extensions in BigQuery).

Insights, Monitoring & Reporting

Systematically identify, classify, and prioritize product data quality issues; produce clear summaries, visualizations, and actionable recommendations for stakeholders at all levels.
Build and maintain dashboards and recurring reports for key product data KPIs; establish alerting and anomaly detection systems that proactively surface degradation and model performance issues.

Engineering & Production Deployment

Take models and analytics prototypes from POC to production independently — owning deployment, testing, monitoring, and iteration without requiring engineering partnership.
Build robust, scalable data pipelines and ML workflows using production-grade practices: versioning, CI/CD, testing, observability.
Collaborate with MLOps and Data Engineering to ensure production readiness: reliability, latency, drift monitoring, and SLOs.

Technical Leadership & Mentorship

Serve as a senior technical voice in the product data quality domain: conduct design and code reviews, establish coding and modeling standards, and ensure the team's output meets a high bar for production readiness.
Actively mentor Senior and mid-level Data Scientists — through pairing, reviews, feedback, and structured guidance — helping them grow their modeling depth, engineering skills, and stakeholder communication.
Contribute to hiring: help define the technical bar, conduct interviews, and provide calibrated assessments of candidates.
Represent the team in cross-functional technical discussions; be a credible voice on data quality in planning forums with Product and Engineering.

Qualifications

Required

Experience: 7+ years in data science, ML engineering, or analytics engineering, with at least 3+ years focused on product data, catalog quality, entity resolution, search/retrieval, or e-commerce/marketplace analytics. Clear progression in scope and complexity over time.
Technical leadership: Demonstrated experience operating above the Senior level — setting technical standards, leading complex initiatives end-to-end, and raising the bar for those around you, without requiring a management title to do so.
Engineering strength: Proven ability to build production-grade data pipelines and deploy ML models independently; strong software engineering fundamentals (code quality, testing, version control, CI/CD).
Data quality expertise: Deep, firsthand experience analyzing and improving large-scale structured data quality problems — completeness, consistency, accuracy, deduplication, entity resolution — at the scale and messiness of real transaction data.
ML & classification depth: Extensive track record building and deploying classification models, ranking systems, or search/retrieval pipelines in production, with ownership of the full lifecycle.
Technical skills:

Expert-level Python and SQL; advanced proficiency with ML libraries (scikit-learn, XGBoost, LightGBM, PyTorch/TensorFlow) and large-scale data tools (pandas, PySpark).
Deep experience with entity resolution, fuzzy matching, clustering, embeddings, and similarity-based techniques at scale.
Strong production ML fundamentals: model versioning, monitoring, evaluation, drift detection, retraining, A/B testing.
Experience designing data quality monitoring systems and anomaly detection infrastructure.

Analytical rigor: Strong foundation in statistics and ML; ability to design experiments, validate models, interpret results, and communicate findings with business context and clarity.
Stakeholder collaboration: Track record of working cross-functionally with Product, Engineering, and business teams; ability to translate technical complexity into actionable recommendations and drive alignment without authority.
Mentorship: Evidence of coaching or developing more junior team members technically.
Education: Bachelor's in a quantitative field (CS, Statistics, Math, Engineering, or similar); Master's/PhD preferred.

Preferred / Nice to Have

Experience with vector search and embeddings (sentence transformers, OpenAI embeddings, BERT-based models) and vector databases.
Familiarity with search and retrieval systems (Elasticsearch, Solr, semantic search, BM25, hybrid ranking) and deep understanding of how data quality cascades into relevance.
Experience with graph databases and graph analytics applied to product or entity data.
Advanced NLP for product data: text classification, NER, attribute extraction, title/description parsing, semantic similarity.
Experience with multimodal modeling (combining text, images, and structured attributes for classification or retrieval).
Familiarity with global product identifier standards (GTIN/UPC/EAN, MPN, SKU hierarchies, GS1/GDSN).
Experience designing record linkage systems at scale: blocking strategies, probabilistic matching, hierarchical clustering.
Proficiency with GCP tools (BigQuery, Vertex AI, Dataflow, Cloud Run, Looker) and/or Databricks/Spark for large-scale processing and deployment.
Exposure to master data management or data governance practices in product or catalog contexts.
Experience with recommendation systems and understanding of how product data quality propagates into personalization and ranking quality.

What Sets You Apart

Product data obsession. You care deeply about data quality and understand exactly how poor catalog hygiene cascades into broken user experiences, reporting errors, and operational drag.
Engineering mindset. You don't just build prototypes — you ship them. You write clean, tested, production-ready code and own the full lifecycle from research to deployment.
Detective instincts. You love digging into messy data, finding patterns, and uncovering root causes — whether it's a systematic retailer issue, a subtle duplicate cluster, or a classification edge case hiding in the long tail.
Multiplier instinct. You make the people around you better. You invest in others' growth through reviews, mentorship, and building systems that are understandable and extensible — not just effective.
Systems perspective. You see beyond the immediate problem to the architectural and process changes that prevent whole classes of issues from recurring. You build for the team, not just for yourself.
Stakeholder fluency. You translate messy data findings into clear, actionable recommendations and build genuine trust with brands, retailers, Product, and Engineering teams.
Pragmatic prioritization. You balance comprehensiveness with impact, consistently focusing on the 20% of issues that drive 80% of quality problems and business value.
Comfort with ambiguity. You thrive in evolving data ecosystems, defining your own quality metrics and technical approaches when the problem space is still being shaped.

Benefits and Perks:

At impact.com, we believe that when you’re happy and fulfilled, you do your best work. That’s why we’ve built a benefits package that supports your well-being, growth, and work-life balance.

Flexible Working: Our Responsible PTO policy means you can take the time off you need to rest and recharge. We're committed to a positive work-life balance and provide a flexible environment that allows you to be happy and fulfilled in both your career and your personal life.
Health and Wellness: Your well-being is a priority. Our mental health and wellness benefit includes up to 12 fully covered therapy/coaching sessions per year, with additional dependent coverage. We also offer a monthly gym reimbursement policy to support your physical health.
A Stake in Our Growth: We offer Restricted Stock Units (RSUs) as part of our total compensation, giving you a stake in the company's growth with a 3-year vesting schedule, pending Board approval.
Investing in Your Growth: We’re committed to your continuous learning. Take advantage of our free Coursera subscription and our PXA courses.
Parental Support: We offer a generous parental leave policy, 26 weeks of fully paid leave for the primary caregiver and 13 weeks fully paid leave for the secondary caregiver.
Technology Financial Support: We provide a technology stipend to help you set up your home office and a monthly allowance to cover your internet expenses

impact.com is proud to be an equal opportunity workplace. All employees and applicants for employment shall be given fair treatment and equal employment opportunity regardless of their race, ethnicity or ancestry, color or caste, religion or belief, age, sex (including gender identity, gender reassignment, sexual orientation, pregnancy/maternity), national origin, weight, neurodivergence, disability, marital and civil partnership status, caregiving status, veteran status, genetic information, political affiliation, or other prohibited non-merit factors.

Create a Job Alert

Interested in building your career at Impact.com? Get future opportunities sent straight to your email.

Product Data Lead