
Senior Site Reliability Engineer (Crypto Exchange)
We are working with a decentralised exchange which looks to innovate on providing the best of CEXs and DEXs, focusing on building a safe, simple and scalable platform for trading. They differentiate themselves by offering institutional level systems and support whilst remaining on-chain and decentralised.
Seeking a Senior Site Reliability Engineer to join our team in ensuring the stability, scalability, and performance of a cutting-edge platform. You will balance production reliability with engineering-driven automation, reducing manual processes through innovative tooling and process improvements. This role requires a strong commitment to on-call ownership and a passion for building resilient, observable, and self-healing infrastructure.
Key Responsibilities
- Design, implement, and maintain scalable infrastructure for a high-performance, low-latency trading platform.
- Operate and enhance Kubernetes and Nomad-based environments to ensure system stability, scalability, and security.
- Develop infrastructure automation and deployment pipelines using Terraform, Ansible, ArgoCD, and GitHub Actions.
- Collaborate with engineering teams to streamline service onboarding, automate repetitive tasks, and improve deployment efficiency.
- Enhance observability and reliability through improved logging, metrics, tracing, and alerting using the Grafana ecosystem.
- Perform root cause analysis and postmortems for production incidents, driving continuous improvements in system resilience and incident response.
- Work with security and compliance teams to ensure infrastructure meets regulatory and organizational standards.
- Support multi-environment deployments (dev, staging, testnet, mainnet) with a focus on safe rollouts, rollbacks, and configuration management.
- Contribute to capacity planning, cost optimization, and infrastructure scaling strategies to support platform growth.
Experience & Skills Requirements
- 5+ years of relevant experience as DevOps/ SRE Engineers.
- Proven ability to participate in an on-call rotation, demonstrating ownership in incident response and a focus on long-term system stability.
- Extensive experience operating and maintaining low-latency, distributed systems in production environments.
- Proficiency with cloud-native platforms and container orchestration tools, including AWS, GCP, Kubernetes, and Nomad.
- Strong knowledge of Linux/Unix internals and the TCP/IP networking stack.
- Proficiency in one or more of: Bash, Go, or Python.
- Expertise in root cause analysis, performance tuning, and system-level debugging in complex service architectures.
- Experience building and managing end-to-end infrastructure, including infrastructure as code, CI/CD pipelines, and monitoring systems.
- Familiarity with modern GitOps workflows and tools such as GitHub Actions, ArgoCD, Argo Workflows, and Argo Events.
- Ability to own production systems end-to-end, from infrastructure as code to automated monitoring and deployment workflows.
- Pragmatic approach with a focus on depth, ownership, and a bias for action over broad familiarity.
- Bonus: Experience with the Aeron messaging system is a strong advantage.
Create a Job Alert
Interested in building your career at Hyphen Connect Limited? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field