Senior Site Reliability Engineer - SRE - 12 months rolling contract
We want to make work and study more efficient and enjoyable, by providing the best digital paper solution possible. We plan to be the go-to tool for all forms of notes. Our digital paper and learning ecosystem inspires anyone to take notes, share what they know, collaborate with others, and learn as a community
We want to make work and study more efficient and enjoyable, by providing the best digital paper solution possible. We plan to be the go-to tool for all forms of notes. Our digital paper and learning ecosystem inspires anyone to take notes, share what they know, collaborate with others, and learn as a community.
Our Values:
Dream big
—Be visionary, strategic, and open to innovation
Build great things
—Work in service of our users, always improving and pushing higher
Take ownership
—Take responsibility with bold decision-making and bias for action
Win like a sports team
—Be trusting and collaborative while empowering others
Learn and grow fast
—Never stop learning and iterate fast
Share our passion
—Share ideas and practice enthusiasm and joy
Be user obsessed
—Empathetic, inquisitive, practical
About the team:
Our engineering teams are mainly distributed across Europe and Asia. You will be among of the first SREs based in the Americas.
You will be working with the Platform Team, supporting the various product teams.
- Monitoring and Logging: we are currently using Datadog for monitoring, APM, logging, CI/CD optimization, Budget and cost management. Metrics are collected across our agents, taken from the logs using metric filters, and updated directly from lambda function or the application.
- Programming Languages: we have multiple microservices written in Typescript, Go, and Kotlin
- Databases: CockroachDB, MongoDB, Redshift, Postgres
- Infrastructure-as-Code: most of our infrastructure is written and defined in Terraform and currently exploring CDK for self-serviced infra.
- CI/CD: we are currently using GitHub Actions for our backend applications, and CircleCI for our iOS applications.
- Deployments: we have multiple EKS clusters set up either for Blue/Green rollouts or dedicated feature sets. We manage the workload configurations using ArgoCD and Helm.
- We are currently running dedicated stateful clusters for our CockroachDB deployments.
About the role:
Although we are currently enjoying very high SLAs, we’d like to invest more into our service reliability by having a globally distributed set of engineers to cater to a globally distributed set of users.
We believe in SRE best practices and this includes implementing a follow-the-sun model for our on-call rotation.
This is the role for you, if you’re excited to work on the things listed below:
- Design, build, and maintain the Goodnotes infrastructure, ensure it adheres to Dickerson’s Hierarchy of Reliability.
- Design, refine, and execute new and existing playbooks.
- Educate the various teams in SRE best practices. Aid them, from designing, capacity planning, to rolling out new features.
- Be the go-to person for higher-level escalation for applications.
- Improve existing SLAs, and optimise latency and error rates.
- Improve the system monitoring, health reporting, and logging
- Design and implement security, assist in maintaining information security practices and procedures
- Participate in on-call rotation during the Americas Timezone UTC-8 to UTC-5.
- Open to working 5 shifts a week which may include weekends
The skills you will need to be successful in the above:
- Strong experience working in AWS-hosted environment
- Strong experience in supporting production workloads and firefighting.
- Strong knowledge of SRE best practices and common issues.
- Strong experience working with system monitoring tools.
- Strong understanding and experience with distributed databases.
- Solid understanding of Linux and Networking fundamentals.
- Solid background in back-end development, including API usage and creation.
- Solid knowledge of Security for network and containers
- Solid understanding in container orchestration, with a particular emphasis on Kubernetes.
- Solid experience in managing Relational and Non-relational databases, including backup and restore operations.
- Familiarity in automation/configuration management tools, preferably CDK and/or Terraform.
The interview process:
- An introductory call with someone from our talent acquisition team. They want to hear more about your background, what you are looking for, and why you’d like to join Goodnotes
- A hands-on take-home challenge to verify fundamental infrastructure-management skills.
- A 2-hour technical interview call with one of our engineers covering low-level questions and some short practical exercises. This is where you get to see what it would be like working at Goodnotes as well as the chance to ask any engineering questions you may have
- A call with your hiring manager. This is the person who will be managing you day to day, working on your growth and development with you as well as supporting you throughout your career at Goodnotes
- Values interview with another member of the leadership.
What’s in it for you:
- Full-time remote work
- Budget for things like noise-cancelling headphones, setting up your home office, personal development, professional training, and health & wellness
- Sponsored visits to our Hong Kong or London office every 2 years
- Company-wide annual offsite
- Medical insurance for you and your dependents
- This is a 12-month renewable fixed term contract
- We expect 40 hours of work per week (Adjusted with local laws) across 5 days per week covering day hours in American timezones during weekends and 3 weekdays
Note: Employment is contingent upon successful completion of background checks, including verification of employment, education, and criminal records.
Apply for this job
*
indicates a required field