Senior Software Engineer, AI & Applications
Role Summary
As a Senior Software Engineer on the AI and Applications team, you'll own the control plane that powers AI workload submission across Firmus AI Platforms. You'll design and build unified job submission APIs, CLI, and web interfaces for training, inference, and fine-tuning workloads on Kubernetes and Slurm—implementing RBAC, multi-tenant isolation, resource quotas, and intelligent scheduling policies (priority classes, pre-emption, fairness). You'll create template catalog for pre-built training and inference recipes, wire observability pipelines for per-job GPU metrics cost tracking and expose telemetry APIs for platform monitoring. This role requires deep Kubernetes and Slurm expertise, strong distributed systems knowledge, and close collaboration with infra, platform, and LLM engineering teams to deliver a seamless, production-grade job orchestration experience for hyperscaler customers.
Key Responsibilities
- Design and build unified job submission APIs, CLI, and web UI for all AI workload types (training, inference, fine-tuning) on Kubernetes and Slurm with Firmus AI Factory context (tenant isolation, resource requests, metadata tagging, observability hooks).
- Implement comprehensive job metadata models and schemas: track job ID, job type, tenant, user, resource requirements, priority class, timestamps, lineage, execution status.
- Integrate authentication/authorization (RBAC) and resource quotas; enforce multi-tenant isolation at submission time across all job types.
- Build AI job scheduling and orchestration layer: priority classes, preemption policies, fairness algorithms, resource quota enforcement, and intelligent job routing.
- Build the AI Factory template catalog: discovery, parameter validation, and manifest generation for training templates, inference serving templates, and fine-tuning recipes.
- Wire job submissions to observability pipeline: inject labels/annotations (job_id, tenant, user, model_name, job_type) so metrics are tagged per-job.
- Expose job-level telemetry APIs (GPU metrics, cost accrual, MFU progression for training; latency, throughput, tokenomics for inferencing) for platform telemetry and monitoring.
- Extend job submission to handle inference workloads: design inference job specifications (model, batch size, latency SLA, cost constraints); integrate with inference serving APIs.
- Coordinate with platform team on observability dashboard integration, with LLM engineers on template design, and with ModelOps on reliability standards.
Skills & Experience
- 5–7 years of backend engineering experience building production APIs and distributed systems (Python, Go, or Java).
- Deep Kubernetes expertise: understand Job controllers, Pod specs, resource requests/limits, RBAC, network policies, debugging.
- Hands-on Slurm experience: job submission, resource allocation, job queues, sbatch scripting.
- Strong distributed systems knowledge: understand scheduling algorithms, fairness, preemption, resource management.
- Strong data modelling: can design clear schemas for job metadata, handle versioning and migrations, ensure backward compatibility.
- DevOps mindset: comfortable with observability, logging, tracing, and production troubleshooting.
- Experience with streaming APIs and real-time webhooks, and system-level integration patterns.
Key Competencies
- Job Orchestration & Scheduling: shipped job scheduling or workflow systems at scale; understands job lifecycle, failure modes, and scheduling policies.
- Multi-Tenancy Design: can architect fair resource allocation, quota enforcement, pre-emption, and data isolation across job types.
- API Design: RESTful or gRPC APIs that are intuitive and extensible; handles versioning gracefully.
- Systems Architecture: understands how job submission connects to training, inferencing, observability, cost tracking, and incident response.
- Cross-Domain Partnership: works closely with infra team, platform team, LLM engineers; clear handoff points and API contracts.
Success Metrics
- Unified orchestration adoption increases: teams use the standard job interface rather than bespoke/manual pathways.
- Scheduling effectiveness & fairness improves: predictable scheduling under contention with reduced noisy-neighbor impact.
- Orchestration reliability stays high: jobs reliably start, run, and complete across K8s/Slurm/inference integrations.
- End-to-end workflow automation increases: higher share of workflows complete without human intervention (e.g., train→register→serve).
- Interface stability & compatibility remains strong: the orchestration API evolves without breaking users.
Location & Reporting
- Singapore or Australia (Launceston, TAS or Sydney, NSW)
- Reporting to Head of AI & Applications
Employment Basis
Full-time
Diversity
At Firmus, we are committed to building a diverse and inclusive workplace. We encourage applications from candidates of all backgrounds who are passionate about creating a more sustainable future through innovative engineering solutions.
Join us in our mission to revolutionize the AI industry through sustainable practices and cutting-edge engineering. Apply now to be part of shaping the future of sustainable AI infrastructure.
Create a Job Alert
Interested in building your career at Firmus Technologies ? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field