Back to jobs
Senior/Staff Site Reliability Engineer - Infrastructure
OKX will be prioritising applicants who have a current right to work in Singapore, and do not require OKX's sponsorship of a visa.
Who We Are
At OKX, we believe that the future will be reshaped by Crypto, ultimately contributing to every individual's freedom. OKX began as a crypto exchange giving millions of people access to crypto trading and over time becoming among the largest platforms in the world. In recent years, we have developed one of the most connected Web3 wallets used by millions to access decentralized crypto applications (dApps). OKX is a trusted brand by hundreds of large institutions seeking access to crypto markets on a reliable platform that seamlessly connects with global banking and payments. In the last year, OKX has expanded into new markets including Australia, Brazil, Netherlands, Singapore and Turkey, with plans to launch in the US, Belgium and the UAE.
We are deeply committed to shaping a fairer, more transparent and accessible society through blockchain technology. This is why we publish proof of reserves monthly, and continue to ship new innovative security features.
About the Team
The Service Reliability Engineering team envisions ensuring service stability as one of the company's core competitive advantages. By building end-to-end, chain-level risk management capabilities, we aim to achieve sustainable, automated identification and analysis of stability risks, transitioning from "reactive governance" to "proactive governance". This approach allows us to preemptively address more stability issues, improving user experience.
What You’ll Be Doing
- Ensure stability and optimize big data platforms (e.g., Alibaba Cloud DataWorks, AWS EMR, AWS DataBricks, Spark, Flink) and data warehouses (e.g., MaxCompute, Hologres, Hive, Clickhouse, StarRocks).
- Develop a deep understanding of middleware architecture and principles (e.g., Kafka, Spring Cloud, Nacos, Apollo, Kong Gateway) to ensure high performance and usability.
- Optimize existing runtime environments (e.g., KVM, Docker, Kubernetes, JVM) to maximize resource utilization and maintain stable service operation.
- Understand network architecture and security to provide guidance on infrastructure stability based on network architecture and security layers, ensuring secure, stable, and efficient network communications.
- Lead chaos engineering exercises, coordinating with business units to validate system robustness and recovery capabilities through simulated failure scenarios.
- Respond rapidly to and troubleshoot system failures, continuously optimising monitoring strategies to minimize system downtime and ensure service continuity and stability.
- Drive infrastructure automation and intelligence to enhance SRE work efficiency and quality.
- Collaborate closely with development teams, providing technical support and advice on infrastructure to foster continuous product improvement and innovation.
What We Look For In You
- Bachelor’s degree or higher in Computer Science or a related field, with over 8 years of experience in large-scale internet or cloud computing platform development, SRE, or operations.
- In-depth knowledge of big data platforms, data warehouses, middleware, runtime environments, and network technology principles and architectures, with practical experience and troubleshooting skills.
- Proficient in Linux system management and optimization; skilled in scripting languages such as Shell and Python, with the ability to develop automation tools and scripts.
- Experienced with container and cloud-native technologies like KVM, Docker, and Kubernetes, including their architectures and principles, with extensive experience in addressing common issues and failures.
- Knowledgeable in network protocols such as TCP, UDP, and QUIC; proficient with network diagnostic tools like TcpDump, TraceRoute, Netstat, and Wireshark, with significant practical experience in resolving network issues.
- Extensive experience with Alibaba Cloud and AWS products, covering architecture, usage, dealing with common issues and failures.
- Preferred experience in service governance systems construction, architecture optimisation, stability assurance construction, capacity management, activity support, and chaos engineering.
- Strong sense of responsibility and team spirit, with excellent problem-solving and analytical skills.
Perks & Benefits
-
Competitive total compensation package
-
L&D programs and Education subsidy for employees' growth and development
-
Various team building programs and company events
-
Wellness and meal allowances
-
Comprehensive healthcare schemes for employees and dependants
-
More that we love to tell you along the process!
Apply for this job
*
indicates a required field