Back to jobs
New

Principal Debug and SRE Lead

Gdańsk, Pomeranian Voivodeship, Poland; Warszawa, Masovian Voivodeship, Poland

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions must evolve to unify innovations in software models, compilers, platforms, networking, and semiconductors. Our diverse team of technologists have developed a high performance RISC-V CPU from scratch, and share a passion for AI and a deep desire to build the best AI platform possible. We value collaboration, curiosity, and a commitment to solving hard problems. We are growing our team and looking for contributors of all seniorities.

Tenstorrent is building next-generation AI systems powered by custom silicon, large-scale distributed infrastructure, and advanced software platforms. The Debug & Site Reliability Engineering team is responsible for ensuring the reliability, observability, and operational excellence of the environments that enable AI hardware and software development. This team partners closely with silicon, firmware, software, validation, and infrastructure engineers to diagnose complex issues, improve engineering workflows, and keep critical systems operating at scale.

As the Principal Debug & Site Reliability Engineering Lead, you will combine deep technical expertise with engineering leadership to guide a team responsible for debugging complex hardware and software interactions, improving operational efficiency, and driving long-term reliability across development infrastructure. Your work will help accelerate engineering productivity while shaping the processes, tooling, and technical direction that support Tenstorrent's next generation AI platforms.

This role is hybrid, based out of Kraków, Poland.

We welcome candidates at various experience levels for this role. During the interview process, candidates will be assessed for the appropriate level, and offers will align with that level, which may differ from the one in this posting.

 

Who You Are

  • Experienced technical leader with 10+ years building and operating complex software, infrastructure, site reliability, or systems engineering environments.
  • Strong Linux systems expert with deep experience debugging issues across operating systems, networking, distributed services, hardware, and firmware.
  • Proficient in automation and software development using Python, C++, Go, Bash, or similar languages, with experience building scalable engineering tools.
  • Familiar with observability platforms such as Prometheus, Grafana, OpenTelemetry, ELK, or similar technologies used to monitor large-scale production systems.
  • Passionate about mentoring engineers, driving technical execution, and improving reliability through automation, operational excellence, and cross-functional collaboration.

What We Need

  • Lead a team responsible for the reliability, observability, and operational health of engineering infrastructure supporting AI hardware and software development.
  • Drive root-cause analysis and resolution of complex issues spanning silicon, firmware, operating systems, networking, distributed software, and development infrastructure.
  • Build and improve debugging methodologies, monitoring systems, automation, and engineering workflows that increase productivity and reduce operational overhead.
  • Partner closely with silicon, firmware, software, validation, and infrastructure teams to prioritize work, resolve critical issues, and improve platform reliability.
  • Mentor engineers, establish technical direction, and drive execution across key initiatives that support long-term engineering success.

What You Will Learn

  • How next-generation AI hardware and software platforms are developed, validated, and deployed at scale.
  • Advanced debugging techniques across silicon, firmware, operating systems, networking, infrastructure, and distributed software.
  • How custom RISC-V processors, AI accelerators, and large-scale AI compute clusters are monitored, operated, and optimized.
  • How engineering organizations coordinate across hardware and software disciplines to deliver highly reliable AI infrastructure.
  • How technical leadership influences the architecture, reliability, and operational strategy behind one of the industry's most ambitious AI computing platforms.

 

Tenstorrent offers a highly competitive compensation package and benefits, and we are an equal opportunity employer.

 

This offer of employment is contingent upon the applicant being eligible to access U.S. export-controlled technology. Due to U.S. export laws, including those codified in the U.S. Export Administration Regulations (EAR), the Company is required to ensure compliance with these laws when transferring technology to nationals of certain countries (such as EAR Country Groups D:1, E1, and E2). These requirements apply to persons located in the U.S. and all countries outside the U.S. As the position offered will have direct and/or indirect access to information, systems, or technologies subject to these laws, the offer may be contingent upon your citizenship/permanent residency status or ability to obtain prior license approval from the U.S. Commerce Department or applicable federal agency. If employment is not possible due to U.S. export laws, any offer of employment will be rescinded.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...