Lead Site Reliability Engineer (SRE)
Position: Lead Site Reliability Engineer (SRE)
Company: Optimal Market Technologies, LLC
Location: Chicago or NYC, hybrid
Salary: $150,000 - $250,000 base salary + discretionary bonus (commensurate with experience)
Position Overview:
We're seeking a Lead Site Reliability Engineer to oversee our production systems administration. We are growing from hands-on, individual-knowledge work to an engineering-run discipline: automated, reliable, and built on published standards. You will build and lead our systems administration function, professionalize how we run our infrastructure, and reduce key-person risk, partnering with the development team to keep the firm running reliably and moving fast. This is a hands-on role; you will build and operate what you put in place. You will report to the CTO.
Our environment:
We run an automated trading system with single-digit-millisecond latency requirements on bare-metal Linux. We use Azure for development environments, storage, and offline studies, not for production execution. Systems are written in C++, Python and SQL. We are actively modernizing, upgrading technologies (e.g. CentOS7 to RHEL9), and have legacy and new systems running in parallel. You will lead rollout of a stream of technology changes.
Primary Responsibilities:
Infrastructure and systems administration:
- Own how production runs across colocation and the cloud: deployment, capacity, and failover.
- Build and lead the systems administration function: mentor existing staff, set how the function works, and hire as we grow.
- Set and publish the engineering standards and strategy for how we run production.
- Hands-on Linux and network administration; automate routine work through Infrastructure as Code.
- Manage vendors and service agreements; advise on build-vs-contract-out.
- Own infrastructure security: hardening, access control, recoverable backups, and security incident response.
Production support, incident response, resilience, and performance:
- Assist first-line production support, reducing reliance on the development team.
- Be accountable for production stability: track what breaks and why, and turn repeat firefighting into automation that prevents it.
- Own incident response, on-call, and post-incident review; coverage is market-hours plus a support rotation.
- Own recovery runbooks, and recovery drills.
- Automate client self-service for common issues and access to their own data, reducing manual support work.
- Partner with the development team on deployments, and on performance tracking and capacity planning.
What We're Looking For:
- Strong scripting and automation skills.
- Strong hands-on Linux and network administration.
- Expertise with Infrastructure as Code (we are open on which tools).
- Experience using AI tools, ideally Claude Code.
- A track record owning production support and incident response.
- Experience managing and developing technical staff.
- The ability to bring structure, standards, and strategy to a function as it grows and matures.
- Strong communication; effective with senior stakeholders and a small team.
Preferred Qualifications:
- Experience at a start-up, or building a new line or function inside a larger firm; comfortable under resource constraints and automation-first by instinct.
- Experience in real-time critical systems.
Salary Range
$150,000 - $250,000 USD
Apply for this job
*
indicates a required field
