Senior Site Reliability Engineer

GovX

full-time

Posted on: 10/29/2025

Location Type: Remote

Location: Remote • California, Colorado, Florida, New York, Tennessee, Texas, Washington • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $165,000 - $175,000 per year

Job Level

Senior

Tech Stack

AzureCloudDistributed SystemsGrafanaJavaScriptKubernetesLinuxMicroservices.NETNode.jsPrometheus

About the role

Maintain scalable, secure, and reliable cloud services ensuring reliable system operations within Service Level Objectives.
Implement and manage monitoring, alerting, and observability systems using Prometheus, Grafana, and Azure Monitor to proactively identify and resolve issues.
Develop and maintain automation scripts and tools in PowerShell, Bash, and C# to improve deployment efficiency, system reliability, and developer productivity.
Create, refine, and maintain detailed runbooks for production systems to ensure consistent operational procedures and effective incident response.
Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to measure and maintain system reliability.
Collaborate with software engineers and automation engineers to integrate reliability practices into CI/CD pipelines using Azure DevOps.
Design and implement intelligent alerting strategies that ensure high signal-to-noise ratios and enable rapid triage of critical issues.
Participate in incident response, post-incident reviews, and blameless root cause analysis to drive continuous improvement of system reliability and uptime.
Contribute to deployment strategy evolution, including blue-green and canary deployments, to minimize downtime and release risk.
Collaborate closely with Automation Engineers to enhance automated validation and testing of production environments.
Monitor system health, capacity, and performance, providing data-driven insights and recommendations for optimization.
Conduct chaos engineering experiments and resilience testing to proactively identify and address system weaknesses.
Develop and maintain disaster recovery and business continuity plans, including regular failover testing.
Participate in the on-call rotation for platform services, ensuring high availability and rapid incident resolution.
Proactively monitor and respond to production support tickets and alerts within established SLA timeframes, delivering first-level diagnosis, troubleshooting, and escalation as needed to maintain system reliability
Continuously improve incident response playbooks and reduce Mean Time to Recovery (MTTR).
Participate in sprint planning, stand-ups, and retrospectives to ensure alignment with development and operational objectives.
Identify opportunities to improve resiliency, reduce toil, and strengthen the reliability culture across the engineering organization.
Collaborate with security and compliance teams to ensure infrastructure meets regulatory and security standards.
Support cost optimization efforts by monitoring cloud resource usage and recommending efficiency improvements.
Explore and integrate AI/ML-based observability tools for predictive monitoring and anomaly detection.

Requirements

8+ years of professional experience in site reliability, infrastructure, or systems engineering roles.
Proficiency with Azure cloud infrastructure, services, and resource management
Experience in operating systems, network concepts, protocols, and architecture. Microsoft/Linux operating systems, active directory, OSI.
Technical ability in Node JS, .NET/C# and knowledge of both current and legacy architecture, software development practices, and conventions.
Strong experience with Rest APIs
Hands-on experience with containerization and orchestration using Kubernetes and microservices architecture.
Strong automation and scripting skills in PowerShell, Bash.
Experience with Infrastructure as Code tools for provisioning and configuration management.
Deep understanding of CI/CD processes and tools, preferably using Azure DevOps.
Experience implementing and managing observability solutions including Azure Monitor, Application Insights, and Log Analytics Workspaces, Prometheus and Grafana.
Strong problem-solving, analytical, and troubleshooting abilities in distributed systems and cloud environments.
Ability to write, maintain, and execute operational runbooks and automation for incident management and recovery.
Ability to work self-directed, plan and execute projects involving multiple technical resources and stakeholders.
Excellent communication and collaboration skills, with the ability to work across software development, infrastructure, and operations teams.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

Azure cloud infrastructurePowerShellBashC#Node JS.NETKubernetesRest APIsInfrastructure as CodeCI/CD

Soft skills

problem-solvinganalyticaltroubleshootingcommunicationcollaborationself-directedproject planningexecutionincident managementcontinuous improvement