Software Engineer, Reliability & Availability
About NewsBreak
NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, our mission is to foster safer, more vibrant, and authentically connected lives. Through robust collaborations with thousands of local publishers and businesses across the nation, NewsBreak is revolutionizing how a new wave of readers access and engage with essential, locally sourced content & information.
Since our inception in 2015, our trajectory has been nothing short of remarkable. We proudly stand as the nation’s premier local news app.
As a Series-C unicorn startup, our headquarter nestles in the tech hub of Mountain View, California, with other offices in New York City and Seattle. For more information, visit www.newsbreak.com/about
About the role
As a Software Engineer in Reliability & Availability, you will be responsible for ensuring the stability, scalability, and resiliency of our cloud infrastructure and services. Working at the core of SRE, system performance, and availability management, you will design robust solutions to minimize downtime, optimize performance, and enhance system reliability. Your focus will be on AWS cloud infrastructure, Kubernetes (EKS), and big data processing (EMR), implementing high availability, fault tolerance, and self-healing mechanisms for distributed systems. Through automation, proactive monitoring, and incident response, you will help maintain seamless operations across our cloud-native platforms.
Responsibilities
- Ensure service reliability and availability by designing and implementing fault-tolerant architectures leveraging AWS, EKS (Elastic Kubernetes Service), and EMR (Elastic MapReduce).
- Build, automate, and optimize infrastructure for high-performance, scalable, and resilient cloud services.
- Develop monitoring, observability, and alerting solutions to proactively detect and mitigate service degradation and performance bottlenecks.
- Improve service lifecycle management, from capacity planning and launch reviews to post-incident analysis and continuous optimization.
- Enhance auto-scaling mechanisms to dynamically adjust resources and maintain system stability under varying workloads.
- Drive automation using Infrastructure-as-Code (IaC) and CI/CD pipelines to minimize manual intervention and improve service resilience.
- Engage in on-call rotations, manage incidents, conduct blameless postmortems, and drive long-term reliability improvements.
Requirements
- BS or MS in Computer Science, Engineering, or a related field, with at least 2+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
- Strong programming experience in at least one of the following: C, C++, Java, Python, or Go.
- Hands-on experience with cloud platforms (AWS, GCP, or Azure), with a strong emphasis on AWS services (EKS, EMR, EC2, RDS, S3).
- Deep understanding of Kubernetes (EKS) and containerized workloads, including scaling, monitoring, and failure recovery strategies.
- Strong experience with monitoring tools (Prometheus, Grafana,) ,log management (ELK, CloudWatch, Splunk), distributed tracing and profiling solutions.
- Extensive experience supporting production Internet services, troubleshooting performance issues, and implementing high-availability strategies.
- Strong problem-solving and debugging skills, with a systematic approach to incident response, root cause analysis, and continuous improvement.
Annual Base Pay Range
$130,000 - $260,000 USD
Apply for this job
*
indicates a required field