
Staff Technical Lead for Inference & ML Performance
fal is pioneering the next generation of generative-media infrastructure. We're pushing the boundaries of model inference performance to power seamless creative experiences at unprecedented scale. We're looking for a Staff Technical Lead for Inference & ML Performance, someone who blends deep technical expertise with strategic vision, guiding a team to build and optimize state-of-the-art inference systems. This role is intense yet deeply impactful. Apply if you're ready to lead the future of inference performance at a fast-paced, high-growth frontier.
Why this role matters
You’ll shape the future of fal’s inference engine and ensure our generative models achieve best-in-class performance. Your work directly impacts our ability to rapidly deliver cutting-edge creative solutions to users, from individual creators to global brands.
What you'll do
Day-to-day
|
What success looks like
|
---|---|
Set technical direction. Guide your team (kernels, applied performance, ML compilers, distributed inference) to build high-performance inference solutions.
|
fal’s inference engine consistently outperforms industry benchmarks in throughput, latency, and efficiency.
|
Hands-on IC leadership. Personally contribute to critical inference performance enhancements and optimizations.
|
You regularly ship code that significantly improves model serving performance.
|
Collaborate closely with research & applied ML teams. Influence model inference strategies and deployment techniques.
|
Seamless integration of inference innovations rapidly moves from research to production deployment.
|
Drive advanced performance optimizations. Implement model parallelism, kernel optimization, and compiler strategies.
|
Performance bottlenecks are quickly identified and eliminated, dramatically enhancing inference speed and scalability.
|
Mentor and scale your team. Coach and expand your team of performance-focused engineers.
|
Your team independently innovates, proactively solves complex performance challenges, and consistently levels up their skills.
|
You might be a fit if you
- Are deeply experienced in ML performance optimization. You've optimized inference for large-scale generative models in production environments.
- Understand the full ML performance stack. From PyTorch, TensorRT, TransformerEngine, Triton to CUTLASS kernels, you’ve navigated and optimized them all.
- Know inference inside-out. Expert-level familiarity with advanced inference techniques: quantization, kernel authoring, compilation, model parallelism (TP, context/sequence parallel, expert parallel), distributed serving and profiling.
- Lead from the front. You're a respected IC who enjoys getting hands-on with the toughest problems, demonstrating excellence to inspire your team.
- Thrive in cross-functional collaboration. Comfortable interfacing closely with applied ML teams, researchers, and stakeholders.
Nice-to-haves
- Experience building inference engines specifically for diffusion and generative media models
- Track record of industry-leading performance improvements (papers, open-source contributions, benchmarks)
- Leadership experience in scaling technical teams
What you'll get
One of the highest impact roles at one of the fastest growing companies (revenue is growing 40% MoM, we are 60x+ RR compared to last year, raised Series A/B/C within the last 12 months) with a world changing vision: hyperscaling human creativity.
Sound like your calling? Share your proudest optimization breakthrough, open-source contribution, or performance milestone with us. Let's set new standards for inference performance, together.
Apply for this job
*
indicates a required field