Director of Platform Engineering
About Upshop
Upshop is the market leader in Total Store Operation solutions for the Grocery and C-Store markets. We offer an AI-powered, SaaS platform connecting Fresh, Center, eCommerce, and DSD department operations to deliver a simplified, smarter, more connected store experience. Customers running Upshop realize significant improvements in sales, shrink, food safety and sustainability across the entire store. 450+ retail chain accounts trust our software in over 50k+ stores, 35 countries, and 3 continents.
Overview of the role
We are seeking an experienced and strategic Director of Platform Engineering to lead the team responsible for the infrastructure, tools, and processes that power our mission-critical platform. This system is at the heart of food retail operations and plays a vital role in ensuring the seamless operation of the global food supply chain, especially in the US. As the Director of Platform Engineering, you will own and evolve the platform's operational excellence, ensuring it is scalable, reliable, and cost-efficient, while enabling an exceptional developer experience across the organization.
Requirements
Technical Expertise:
- Extensive experience with cloud hosting platforms (Azure and GCP preferred).
- Proven expertise in building and managing CI/CD pipelines and developer experience tools.
- Strong background in observability, monitoring, tracing, and alerting technologies.
- Deep understanding of SRE practices and deployment strategies (multi-region, rolling, and canary).
- Proficient in infrastructure as code (e.g., Terraform, Ansible) and high availability systems.
- Experience with cost management and optimization in cloud environments.
- Experience with monitoring and configuring Azure Service Plans, Azure functions, Cosmos, Service Bus, Event Grid, SignalR, Azure Table storage and other Azure technologies.
- Strong experience in networking, network security, and security operations.
Leadership Skills:
- Proven ability to build and lead high-performing technical teams.
- Exceptional communication and collaboration skills across technical and non-technical audiences.
- Experience managing incident response processes and fostering a culture of operational excellence.
Strategic Thinking:
- Visionary leadership with the ability to balance long-term strategy with day-to-day operations.
- Commitment to proactively detecting and resolving system issues before they impact customers.
- Passion for building robust systems that are critical to societal infrastructure, such as the global food supply chain, especially in the US.
Preferred Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- Minimum of 10 years of experience in platform engineering, site reliability engineering, or related technical roles, with at least 5 years in leadership positions.
- Strong experience in Microsoft Azure, Serverless, Containers, Event-Driven systems (Kafka, Service Bus), and Relational and Document Data Stores.
- Experience in mission-critical systems, especially in industries like retail, logistics, or supply chain.
- Familiarity with synthetic testing and application performance monitoring tools (e.g., Datadog, New Relic, Prometheus).
- Track record of successful disaster recovery planning and execution.
Key Responsibilities
Infrastructure & Deployment:
- Lead the design, implementation, and optimization of tools and processes that accelerate development and deployment while enhancing the developer experience.
- Drive multi-region deployments, rolling updates, and canary deployments, ensuring minimal disruption to production systems.
Observability & Monitoring:
- Establish and maintain comprehensive observability and monitoring systems to ensure complete visibility into platform health.
- Develop a “one pane of glass” solution for system-wide health monitoring, enabling proactive issue detection and resolution.
Site Reliability Engineering:
- Champion best practices in SRE to deliver high availability and optimal performance of mission-critical systems.
Incident Management:
- Develop and oversee incident management and escalation processes, coordinating across Platform Engineering and other teams to swiftly address and resolve critical issues.
Cloud Hosting & Cost Management:
- Manage cloud hosting platforms (primarily Azure, with some GCP), foster vendor relationships, and optimize cloud usage and costs.
- Monitor and manage cloud expenditures, aligning costs with organizational goals while maintaining platform performance and scalability.
Continuous Monitoring:
- Collaborate with QA and Engineering to implement synthetic tests and monitor application metrics for production deployments, ensuring consistent reliability.
Team Leadership & Security:
- Build, mentor, and retain a high-performing Platform Engineering team, fostering a culture of collaboration and continuous improvement.
- Partner with Security to address code dependency vulnerabilities and with Engineering to manage end-of-life dependencies effectively.
Disaster Recovery & Best Practices:
- Own and continuously improve disaster recovery (DR) plans and processes, ensuring regular testing and validation.
- Ensure adherence to best practices such as infrastructure as code, high availability configurations, and operational resilience.
Architecture & Scaling:
- Work closely with Engineering teams to design platform architectures that are scalable, reliable, and efficient, meeting current and future needs.
What We Offer
- Competitive salary and benefits package.
- Opportunity to lead a team responsible for a platform critical to the global food supply chain, especially in the US.
- A collaborative, innovative environment where your impact will be both strategic and hands-on.
- The ability to shape and influence the technical direction of a mission-critical system.
Join Us
If you are ready to lead a world-class Platform Engineering team and take ownership of a system critical to the global food supply chain, we encourage you to apply and help us drive the future of food supply technology.
Apply for this job
*
indicates a required field