Senior Data Scientist, Product Data
About impact.com
impact.com is the world’s leading commerce partnership marketing platform, transforming the way businesses grow by enabling them to discover, manage, and scale partnerships across the entire customer journey. From affiliates and influencers to content publishers, brand ambassadors, and customer advocates, impact.com empowers brands to drive trusted, performance-based growth through authentic relationships. Its award-winning products—Performance (affiliate), Creator (influencer), and Advocate (customer referral)—unify every type of partner into one integrated platform. As consumers increasingly rely on recommendations from people and communities they trust, impact.com helps brands show up where it matters most. Today, over 5,000 global brands, including Walmart, Uber, Shopify, Lenovo, L’Oréal, and Fanatics, rely on impact.com to power more than 225,000 partnerships that deliver measurable business results.
About the Role
We're seeking a Senior Data Scientist specializing in Product Data Quality to join our Cape Town Data Science team. In this role, you'll own the analytical and technical foundation of product data quality across our ecosystem—spanning catalog hygiene, transaction matching, classification modeling, deduplication, and global product identity. You'll work across both the structured catalog universe and the messier, larger-scale sales transaction universe, building models and infrastructure that power search, recommendations, and business intelligence. This is a high-impact role that demands both analytical depth and strong engineering capabilities: you'll take models from research to production, build scalable data pipelines, and create the monitoring infrastructure that makes our product data foundation trustworthy and continuously improving. Your work will directly influence search relevance, recommendation quality, match rates, and reporting accuracy across the business.
Core Responsibilities
Product classification & taxonomy modeling
- Develop, deploy, and maintain ML models for automated product categorization and taxonomy assignment across hierarchical category structures.
- Improve classification accuracy through feature engineering (text, attributes, embeddings), model iteration, and robust evaluation on both catalog and sales transaction data.
- Monitor production model performance; identify and remediate misclassification patterns that impact search, recommendations, and reporting.
- Collaborate with category experts and Product teams to refine taxonomy definitions, handle edge cases, and adapt to new product types.
Catalog & sales universe data quality
- Conduct deep-dive analyses into catalog completeness, consistency, and correctness across retailers, categories, and product attributes.
- Own data quality analytics for the sales transaction universe—a larger, messier dataset than catalog—measuring match rates, diagnosing gaps (unmatched transactions, misattributed products), and identifying systematic failures.
- Define and track catalog and transaction health KPIs (attribute coverage, schema compliance, match rates, GPID coverage, freshness); identify root causes and drive remediation.
- Build monitoring systems and dashboards to track data quality trends across retailers, categories, and time periods.
Global Product ID (GPID) coverage & matching
- Assess GPID (GTIN/UPC/EAN) coverage and accuracy across both catalog and sales transaction data; identify gaps by category, retailer, and brand.
- Build and improve matching algorithms to link sales transactions to catalog products, handling missing GPIDs, naming inconsistencies, and category misclassification.
- Quantify the impact of GPID enrichment and matching improvements on search, deduplication, and reporting accuracy.
- Partner with external data providers and brands to improve GPID coverage and resolve identifier conflicts.
Deduplication & entity resolution
- Identify product variants (size, color, packaging) and duplicates within and across retailer catalogs using clustering, entity resolution, embeddings, and similarity-based techniques.
- Build scalable deduplication pipelines that handle catalog and transaction data at scale; define patterns, heuristics, and ML-based approaches for variant grouping.
- Measure the impact of deduplication on search quality, recommendation accuracy, and reporting; iterate on models to reduce false positives and improve precision.
- Support Data Engineering and Platform teams in productionizing deduplication and entity linking infrastructure.
Manufacturer data quality & brand engagement
- Evaluate the consistency and accuracy of manufacturer-level attributes (brand name, MPN, manufacturer identifiers) across catalogs and transactions.
- Detect systemic issues at the brand and retailer level; build scorecards and engage brands (via the Tiger Team) to drive data quality improvements.
- Create feedback loops to measure manufacturer data quality and track progress on remediation initiatives.
Product search & retrieval infrastructure
- Research and prototype improvements to product search and retrieval pipelines, including vector search, semantic similarity, and embedding-based matching.
- Explore and implement vector database infrastructure (e.g., FAISS, Pinecone, Weaviate) to support fast, scalable product retrieval and similarity search.
- Contribute to the design and optimization of retrieval pipelines that combine text, attributes, and embeddings for search and recommendations.
- Evaluate search relevance and ranking quality; iterate on indexing strategies, query preprocessing, and re-ranking models.
Product graph & relational modeling
- Build and maintain product graph infrastructure that captures relationships between products, variants, brands, categories, retailers, and transactions.
- Use graph-based techniques (community detection, link analysis, centrality) to identify product families, detect duplicates, and surface insights on product hierarchies.
- Partner with Data Platform teams to design scalable graph storage and query patterns (e.g., Neo4j, graph extensions in BigQuery).
Insights, monitoring & reporting
- Systematically identify, classify, and prioritize product data quality issues; create clear summaries, visualizations, and actionable recommendations for stakeholders.
- Build and maintain dashboards and recurring reports for key product data KPIs (match rates, GPID coverage, duplicate rates, classification accuracy, attribute completeness).
- Establish alerting and anomaly detection systems to proactively surface data quality degradation and model performance issues.
Engineering & production deployment
- Take models and analytics prototypes from POC to production, with or without engineering partnership—owning deployment, testing, monitoring, and iteration.
- Build robust, scalable data pipelines and ML workflows using production-grade tools and best practices (versioning, CI/CD, testing, observability).
- Collaborate with MLOps and Data Engineering teams to ensure production readiness: reliability, latency, drift monitoring, and SLOs.
Qualifications
Required
- Experience: 5+ years in data science, ML engineering, or analytics engineering, with at least 2+ years focused on product data, catalog quality, entity resolution, search/retrieval, or e-commerce/marketplace analytics.
- Engineering strength: Proven ability to build production-grade data pipelines and deploy ML models independently; strong software engineering fundamentals (code quality, testing, version control, CI/CD).
- Data quality expertise: Demonstrated experience analyzing and improving large-scale structured data quality (completeness, consistency, accuracy, deduplication, entity resolution).
- ML & classification experience: Track record building and deploying classification models, ranking systems, or search/retrieval pipelines in production.
- Technical skills:
- Strong Python and SQL; proficiency with ML libraries (scikit-learn, XGBoost, LightGBM, PyTorch/TensorFlow) and data manipulation tools (pandas, PySpark).
- Experience with entity resolution, fuzzy matching, clustering, embeddings, and similarity-based techniques (Levenshtein distance, cosine similarity, nearest-neighbor search).
- Familiarity with production ML workflows (model versioning, monitoring, evaluation, retraining, A/B testing).
- Experience with data profiling, anomaly detection, and exploratory analysis at scale.
- Analytical rigor: Strong foundation in statistics and ML; ability to design experiments, validate models, interpret results, and communicate insights with business context.
- Stakeholder collaboration: Experience working cross-functionally with Product, Engineering, and business teams; ability to translate technical work into actionable recommendations.
- Education: Bachelor's in a quantitative field (CS, Statistics, Math, Engineering, or similar); Master's/PhD preferred.
Preferred / Nice to have
- Experience with vector search and embeddings (sentence transformers, OpenAI embeddings, BERT-based models) and vector databases (FAISS, Pinecone, Weaviate, Milvus, pgvector).
- Familiarity with search and retrieval systems (Elasticsearch, Solr, semantic search, BM25, hybrid ranking) and understanding how data quality impacts relevance.
- Experience with graph databases and graph analytics (Neo4j, NetworkX, graph algorithms for clustering and link prediction).
- Knowledge of NLP techniques for product data (text classification, named entity recognition, attribute extraction, title/description parsing, semantic similarity).
- Experience with multimodal modeling (combining text, images, and structured attributes for classification or retrieval).
- Familiarity with global product identifiers (GTIN/UPC/EAN, MPN, SKU hierarchies) and standards organizations (GS1, GDSN).
- Experience with deduplication and record linkage at scale (blocking strategies, probabilistic matching, hierarchical clustering).
- Familiarity with GCP tools (BigQuery, Vertex AI, Dataflow, Cloud Run, Looker) and/or Databricks/Spark for large-scale processing and deployment.
- Exposure to master data management (MDM) or data governance practices in product or catalog contexts.
- Experience with recommendation systems or understanding how product data quality impacts personalization and ranking.
What sets you apart
- Product data obsession: You care deeply about data quality and understand how poor catalog hygiene cascades into user experience, business reporting, and operational inefficiencies.
- Engineering mindset: You don't just build prototypes—you ship them. You write clean, tested, production-ready code and can own the full lifecycle from research to deployment.
- Detective instincts: You love digging into messy data, finding patterns, and uncovering root causes—whether it's a systematic retailer issue, a subtle duplicate cluster, or a classification edge case.
- Pragmatic prioritization: You balance comprehensiveness with impact, focusing on the 20% of issues that drive 80% of quality problems and business value.
- Search & retrieval intuition: You understand how product data powers search and recommendations, and you know how to build infrastructure (embeddings, vector DBs, graphs) that makes these systems work at scale.
- Stakeholder fluency: You translate messy data findings into clear, actionable recommendations and build trust with brands, retailers, Product, and Engineering teams.
- Comfort with ambiguity: You thrive in evolving data ecosystems, defining your own quality metrics and technical roadmaps when the problem space is still being shaped.
Benefits and Perks:
At impact.com, we believe that when you’re happy and fulfilled, you do your best work. That’s why we’ve built a benefits package that supports your well-being, growth, and work-life balance.
- Flexible Working: Our Responsible PTO policy means you can take the time off you need to rest and recharge. We're committed to a positive work-life balance and provide a flexible environment that allows you to be happy and fulfilled in both your career and your personal life.
- Health and Wellness: Your well-being is a priority. Our mental health and wellness benefit includes up to 12 fully covered therapy/coaching sessions per year, with additional dependent coverage. We also offer a monthly gym reimbursement policy to support your physical health.
- A Stake in Our Growth: We offer Restricted Stock Units (RSUs) as part of our total compensation, giving you a stake in the company's growth with a 3-year vesting schedule, pending Board approval.
- Investing in Your Growth: We’re committed to your continuous learning. Take advantage of our free Coursera subscription and our PXA courses.
- Parental Support: We offer a generous parental leave policy, 26 weeks of fully paid leave for the primary caregiver and 13 weeks fully paid leave for the secondary caregiver.
- Technology Financial Support: We provide a technology stipend to help you set up your home office and a monthly allowance to cover your internet expenses
impact.com is proud to be an equal opportunity workplace. All employees and applicants for employment shall be given fair treatment and equal employment opportunity regardless of their race, ethnicity or ancestry, color or caste, religion or belief, age, sex (including gender identity, gender reassignment, sexual orientation, pregnancy/maternity), national origin, weight, neurodivergence, disability, marital and civil partnership status, caregiving status, veteran status, genetic information, political affiliation, or other prohibited non-merit factors.
Create a Job Alert
Interested in building your career at Impact.com? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field