Back to jobs
New

Distinguished Engineer - Inference Serving Network and Storage

Austin, Texas, United States

About us

Graphcore is a globally recognized leader in Artificial Intelligence computing systems. The company designs advanced semiconductors and data center hardware that provide the specialized processing power needed to drive AI innovation, while delivering the efficiency required to support its broader adoption.

As part of the SoftBank Group, Graphcore is a member of an elite family of companies responsible for some of the world’s most transformative technologies.

Job Summary

We are seeking a Distinguished Engineer to lead the networking and storage architecture for a new inference serving initiative. This is a chief technologist role for the serving fabric and data path, responsible for defining and driving the end-to-end strategy for networking, storage, observability, provisioning, and automation in support of large-scale AI inference services.

You will shape core technical decisions that directly influence product capability, service differentiation, and competitive advantage. On the networking side, you will lead the design of the serving fabric, inter-partition latency path, management network, QoS and transport tuning, segmentation, observability, and automation. In terms of storage, you will define the architecture for model artifact storage, checkpoint distribution, KV and session tiering and restore, telemetry and log storage, and backup and disaster recovery.  

Storage is expected to be a critical component of inference serving at scale, particularly for KV cache management, state movement, and service resiliency. You will therefore set technical direction across both networking and storage domains as first-class pillars of the platform.

This is a Grade 7 role for a recognized expert and thought leader who can convert strategic thinking into tangible group-level impact, lead a small team, and have influence across functions and external partners.

The Team

You will be in the System Engineering group and work across organizational boundaries with ML software, applied AI, hardware and systems, inference service teams, and other platform and infrastructure groups. You will also engage closely with external partners responsible for key elements of the inference service stack, as well as business counterparts who depend on differentiated service capabilities, reliability, and scale.

This role requires strong technical leadership without relying solely on formal authority. You will be expected to align stakeholders, make architectural trade-offs clear, and drive execution across multiple teams while raising the technical bar for the broader organization.

Responsibilities and Duties

  • Define and coordinate the networking architecture for inference serving, including serving fabric build, inter-partition latency path optimization, and management network architecture.  
  • Lead the strategy for QoS, transport tuning, traffic isolation, segmentation, and service differentiation to support multiple inference SLAs and workload classes.
  • Drive the build of monitoring, resource prioritization, and automated management frameworks for network and storage systems at production scale.  
  • Define the storage architecture for model artifact repositories, checkpoint distribution, session state, telemetry and log storage, backup, and disaster recovery.
  • Lead the design of KV cache storage, tiering, restore, and movement mechanisms as a core platform capability for large-scale inference serving.
  • Optimize network and storage subsystems for demanding AI and HPC workloads, balancing throughput, latency, resiliency, cost, and operational simplicity.
  • Work with ML software and inference service teams to develop infrastructure that supports current methods for deploying large language models. Methods include disaggregated prefill/decode paths, continuous batching, and large-model scaling techniques.  
  • Guide architecture for scaling models that use tensor, pipeline, expert, and other parallelism strategies, ensuring the serving infrastructure supports efficient execution and state movement.
  • Establish performance models, benchmarks, and tuning methodologies for end-to-end serving behavior, including tail latency, throughput stability, and recovery characteristics.
  • Lead a small multi-functional team while providing technical direction and architectural oversight across a wider matrixed organization.
  • Influence roadmap, standards, and implementation choices across internal teams and external partners.
  • Act as the senior technical authority for this domain, identifying risks early, resolving complex trade-offs, and ensuring the platform evolves in line with business and product needs.

Candidate Profile

Essentials

 

  • MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent practical experience.
  • Significant industry experience, typically 15+ years, in large-scale systems, distributed infrastructure, or platform architecture.
  • Deep expertise in networking and storage software at scale, including architecture, implementation, configuration, and performance optimization.
  • Proven experience designing and operating networking and storage systems for demanding applications in AI, HPC, or large-scale cloud environments.
  • Strong understanding of high-performance transport, congestion and flow control, QoS, segmentation, telemetry, and production observability.
  • Strong understanding of distributed storage architectures, artifact distribution, checkpointing, caching, replication, backup, disaster recovery, and operational resilience.
  • Demonstrated ability to architect low-latency, high-throughput systems where network and storage behavior materially affect application performance.
  • Experience leading highly ambiguous, cross-functional technical initiatives with impact across multiple teams or product areas.
  • Strong communication and influencing skills, with the ability to align senior technical and business stakeholders.
  • Track record as a recognized expert who drives strategy, shapes technical direction, and delivers solutions beyond existing precedents.

Desirable

  • Familiarity with innovative LLM serving techniques and infrastructure requirements.
  • Experience with prefill/decode disaggregated inference, continuous batching, and differentiated inference services with multiple SLA and QoS tiers.
  • Understanding of model scaling and serving approaches involving tensor, pipeline, expert, and related parallelism techniques.
  • Experience with KV cache management, tiering, restore, and memory/storage trade-offs in inference systems.
  • Knowledge of modern inference serving algorithms, schedulers, and system-level optimization techniques.
  • Experience working with external technology partners, suppliers, or ecosystem collaborators in the delivery of complex infrastructure platforms.
  • Background in production-grade automation and provisioning systems for large infrastructure estates

 


Benefits
  
In addition to a competitive salary, Graphcore offers a competitive benefits package. We welcome people of different backgrounds and experiences; we’re committed to building an inclusive work environment that makes Graphcore a great home for everyone. We offer an equal opportunity process and understand that there are visible and invisible differences in all of us. We can provide a flexible approach to interview and encourage you to chat to us if you require any reasonable adjustments.   

Create a Job Alert

Interested in building your career at Graphcore? Get future opportunities sent straight to your email.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...

Voluntary Self-Identification

For government reporting purposes, we ask candidates to respond to the below self-identification survey. Completion of the form is entirely voluntary. Whatever your decision, it will not be considered in the hiring process or thereafter. Any information that you do provide will be recorded and maintained in a confidential file.

As set forth in Graphcore’s Equal Employment Opportunity policy, we do not discriminate on the basis of any protected group status under any applicable law.

Select...
Select...
Race & Ethnicity Definitions

If you believe you belong to any of the categories of protected veterans listed below, please indicate by making the appropriate selection. As a government contractor subject to the Vietnam Era Veterans Readjustment Assistance Act (VEVRAA), we request this information in order to measure the effectiveness of the outreach and positive recruitment efforts we undertake pursuant to VEVRAA. Classification of protected categories is as follows:

A "disabled veteran" is one of the following: a veteran of the U.S. military, ground, naval or air service who is entitled to compensation (or who but for the receipt of military retired pay would be entitled to compensation) under laws administered by the Secretary of Veterans Affairs; or a person who was discharged or released from active duty because of a service-connected disability.

A "recently separated veteran" means any veteran during the three-year period beginning on the date of such veteran's discharge or release from active duty in the U.S. military, ground, naval, or air service.

An "active duty wartime or campaign badge veteran" means a veteran who served on active duty in the U.S. military, ground, naval or air service during a war, or in a campaign or expedition for which a campaign badge has been authorized under the laws administered by the Department of Defense.

An "Armed forces service medal veteran" means a veteran who, while serving on active duty in the U.S. military, ground, naval or air service, participated in a United States military operation for which an Armed Forces service medal was awarded pursuant to Executive Order 12985.

Select...

Voluntary Self-Identification of Disability

Form CC-305
Page 1 of 1
OMB Control Number 1250-0005
Expires 04/30/2026

Why are you being asked to complete this form?

We are a federal contractor or subcontractor. The law requires us to provide equal employment opportunity to qualified people with disabilities. We have a goal of having at least 7% of our workers as people with disabilities. The law says we must measure our progress towards this goal. To do this, we must ask applicants and employees if they have a disability or have ever had one. People can become disabled, so we need to ask this question at least every five years.

Completing this form is voluntary, and we hope that you will choose to do so. Your answer is confidential. No one who makes hiring decisions will see it. Your decision to complete the form and your answer will not harm you in any way. If you want to learn more about the law or this form, visit the U.S. Department of Labor’s Office of Federal Contract Compliance Programs (OFCCP) website at www.dol.gov/ofccp.

How do you know if you have a disability?

A disability is a condition that substantially limits one or more of your “major life activities.” If you have or have ever had such a condition, you are a person with a disability. Disabilities include, but are not limited to:

  • Alcohol or other substance use disorder (not currently using drugs illegally)
  • Autoimmune disorder, for example, lupus, fibromyalgia, rheumatoid arthritis, HIV/AIDS
  • Blind or low vision
  • Cancer (past or present)
  • Cardiovascular or heart disease
  • Celiac disease
  • Cerebral palsy
  • Deaf or serious difficulty hearing
  • Diabetes
  • Disfigurement, for example, disfigurement caused by burns, wounds, accidents, or congenital disorders
  • Epilepsy or other seizure disorder
  • Gastrointestinal disorders, for example, Crohn's Disease, irritable bowel syndrome
  • Intellectual or developmental disability
  • Mental health conditions, for example, depression, bipolar disorder, anxiety disorder, schizophrenia, PTSD
  • Missing limbs or partially missing limbs
  • Mobility impairment, benefiting from the use of a wheelchair, scooter, walker, leg brace(s) and/or other supports
  • Nervous system condition, for example, migraine headaches, Parkinson’s disease, multiple sclerosis (MS)
  • Neurodivergence, for example, attention-deficit/hyperactivity disorder (ADHD), autism spectrum disorder, dyslexia, dyspraxia, other learning disabilities
  • Partial or complete paralysis (any cause)
  • Pulmonary or respiratory conditions, for example, tuberculosis, asthma, emphysema
  • Short stature (dwarfism)
  • Traumatic brain injury
Select...

PUBLIC BURDEN STATEMENT: According to the Paperwork Reduction Act of 1995 no persons are required to respond to a collection of information unless such collection displays a valid OMB control number. This survey should take about 5 minutes to complete.