
Senior Reliability Engineer
Barbaricum is a rapidly growing government contractor providing leading-edge support to federal customers, with a particular focus on Defense and National Security mission sets. We leverage more than 17 years of support to stakeholders across the federal government, with established and growing capabilities across Intelligence, Analytics, Engineering, Mission Support, and Communications disciplines. Founded in 2008, our mission is to transform the way our customers approach constantly changing and complex problem sets by bringing to bear the latest in technology and the highest caliber of talent.
Headquartered in Washington, DC's historic Dupont Circle neighborhood, Barbaricum also has a corporate presence in Tampa, FL, Bedford, IN, and Dayton, OH, with team members across the United States and around the world. As a leader in our space, we partner with firms in the private sector, academic institutions, and industry associations with a goal of continually building our expertise and capabilities for the benefit of our employees and the customers we support. Through all of this, we have built a vibrant corporate culture diverse in expertise and perspectives with a focus on collaboration and innovation. Our teams are at the frontier of the Nation's most complex and rewarding challenges. Join our team.
Barbaricum is seeking an experienced Senior Site Reliability Engineer to support the reliability, availability, automation, and operational performance of IT and cloud systems under the Military Community and Family Policy (MC&FP) Outreach and Digital Enterprise Services (MODES) contract. You will help ensure MC&FP systems are reliable, scalable, resilient, and efficiently managed through proactive monitoring, automated incident response, performance optimization, and operational dashboards that support rapid decision-making
Responsibilities:
- Monitor and maintain system reliability, availability, and performance across on-premises, cloud, and hybrid IT environments supporting MC&FP mission requirements.
- Implement proactive performance monitoring, automated alerting, incident response workflows, and resilience engineering practices to reduce downtime and improve operational visibility.
- Develop, maintain, and improve scalable automated infrastructure solutions that support reliable system operations and repeatable service delivery.
- Implement rollback strategies, recovery approaches, and chaos engineering practices to validate resilience, reduce operational risk, and improve system stability.
- Analyze usage patterns, capacity trends, and performance indicators to support dynamic scaling, resource optimization, and system improvement decisions.
- Develop and maintain real-time operational dashboards, reports, and metrics that enable rapid decision-making, leadership awareness, and system optimization.
- Respond to and resolve system outages, impairments, and service disruptions while coordinating with technical teams to minimize mission impact.
- Conduct post-incident reviews to identify root causes, document lessons learned, and implement preventative measures that reduce recurrence.
- Collaborate with software developers, cloud engineers, cybersecurity personnel, and operations teams to improve services, reliability patterns, deployment practices, and operational standards.
- Create and maintain system documentation, configuration standards, operational runbooks, monitoring procedures, and service reliability guidance.
- Automate common operations tasks to reduce manual workloads, improve consistency, and increase system efficiency.
- Implement security best practices across operational activities, infrastructure automation, monitoring, incident response, and system administration functions.
Required Skills:
- Expert knowledge of site reliability engineering practices, system monitoring, incident management, automation, performance tuning, and operational resilience.
- Strong understanding of Windows and Linux administration, infrastructure operations, system configuration, service management, and troubleshooting practices.
- Experience with automation platforms and configuration management tools such as Ansible, Puppet, Chef, or similar technologies.
- Proficiency with scripting languages such as Python, Shell, PowerShell, or similar tools used to automate operational and infrastructure tasks.
- Knowledge of cloud services and infrastructure across AWS, Microsoft Azure, Google Cloud, or comparable cloud environments.
- Strong understanding of network troubleshooting, configuration, connectivity analysis, system dependencies, and performance bottleneck identification.
- Ability to design, interpret, and maintain dashboards, alerts, metrics, logs, and operational reporting that support service health and decision-making.
- Ability to conduct root cause analysis, post-incident reviews, and corrective action planning in complex technical environments.
- Strong problem-solving skills and the ability to work under pressure during outages, impairments, and time-sensitive operational issues.
- Excellent written and verbal communication skills, with the ability to explain technical findings, incident impacts, and reliability recommendations to technical and non-technical stakeholders.
Required Qualifications:
- Bachelor's degree in Computer Science, Information Technology, Systems Engineering, Cybersecurity, or a related field; Master's degree preferred.
- Certifications related to cloud computing, system administration, site reliability engineering, DevSecOps, or automation are beneficial.
- 10+ years of experience in site reliability engineering, systems administration, infrastructure operations, cloud operations, DevSecOps, or a similar technical role, particularly in a government, federal, defense, or secure IT setting.
- Demonstrated experience maintaining reliable, scalable, and efficiently managed IT systems across on-premises, cloud, or hybrid environments.
- Experience developing automated infrastructure, operational scripts, monitoring solutions, dashboards, runbooks, and configuration standards.
- Experience supporting incident response, system outage resolution, post-incident reviews, root cause analysis, and operational improvement initiatives.
- Experience collaborating with development, infrastructure, cloud, cybersecurity, and program teams to improve reliability, security, and service performance.
- DoD Secret Security Clearance.
EEO Commitment
All qualified applicants will receive consideration for employment without regard to sex, race, ethnicity, age, national origin, citizenship, religion, physical or mental disability, medical condition, genetic information, pregnancy, family structure, marital status, ancestry, domestic partner status, sexual orientation, gender identity or expression, veteran or military status, or any other basis prohibited by law.
Create a Job Alert
Interested in building your career at Barbaricum? Get future opportunities sent straight to your email.
Apply for this job
*
indicates a required field