
Senior Site Reliability Engineer
GovX
full-time
Posted on:
Location Type: Remote
Location: Remote • California, Colorado, Florida, New York, Tennessee, Texas, Washington • 🇺🇸 United States
Visit company websiteSalary
💰 $165,000 - $175,000 per year
Job Level
Senior
Tech Stack
AzureCloudDistributed SystemsGrafanaJavaScriptKubernetesLinuxMicroservices.NETNode.jsPrometheus
About the role
- Maintain scalable, secure, and reliable cloud services ensuring reliable system operations within Service Level Objectives.
- Implement and manage monitoring, alerting, and observability systems using Prometheus, Grafana, and Azure Monitor to proactively identify and resolve issues.
- Develop and maintain automation scripts and tools in PowerShell, Bash, and C# to improve deployment efficiency, system reliability, and developer productivity.
- Create, refine, and maintain detailed runbooks for production systems to ensure consistent operational procedures and effective incident response.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to measure and maintain system reliability.
- Collaborate with software engineers and automation engineers to integrate reliability practices into CI/CD pipelines using Azure DevOps.
- Design and implement intelligent alerting strategies that ensure high signal-to-noise ratios and enable rapid triage of critical issues.
- Participate in incident response, post-incident reviews, and blameless root cause analysis to drive continuous improvement of system reliability and uptime.
- Contribute to deployment strategy evolution, including blue-green and canary deployments, to minimize downtime and release risk.
- Collaborate closely with Automation Engineers to enhance automated validation and testing of production environments.
- Monitor system health, capacity, and performance, providing data-driven insights and recommendations for optimization.
- Conduct chaos engineering experiments and resilience testing to proactively identify and address system weaknesses.
- Develop and maintain disaster recovery and business continuity plans, including regular failover testing.
- Participate in the on-call rotation for platform services, ensuring high availability and rapid incident resolution.
- Proactively monitor and respond to production support tickets and alerts within established SLA timeframes, delivering first-level diagnosis, troubleshooting, and escalation as needed to maintain system reliability
- Continuously improve incident response playbooks and reduce Mean Time to Recovery (MTTR).
- Participate in sprint planning, stand-ups, and retrospectives to ensure alignment with development and operational objectives.
- Identify opportunities to improve resiliency, reduce toil, and strengthen the reliability culture across the engineering organization.
- Collaborate with security and compliance teams to ensure infrastructure meets regulatory and security standards.
- Support cost optimization efforts by monitoring cloud resource usage and recommending efficiency improvements.
- Explore and integrate AI/ML-based observability tools for predictive monitoring and anomaly detection.
Requirements
- 8+ years of professional experience in site reliability, infrastructure, or systems engineering roles.
- Proficiency with Azure cloud infrastructure, services, and resource management
- Experience in operating systems, network concepts, protocols, and architecture. Microsoft/Linux operating systems, active directory, OSI.
- Technical ability in Node JS, .NET/C# and knowledge of both current and legacy architecture, software development practices, and conventions.
- Strong experience with Rest APIs
- Hands-on experience with containerization and orchestration using Kubernetes and microservices architecture.
- Strong automation and scripting skills in PowerShell, Bash.
- Experience with Infrastructure as Code tools for provisioning and configuration management.
- Deep understanding of CI/CD processes and tools, preferably using Azure DevOps.
- Experience implementing and managing observability solutions including Azure Monitor, Application Insights, and Log Analytics Workspaces, Prometheus and Grafana.
- Strong problem-solving, analytical, and troubleshooting abilities in distributed systems and cloud environments.
- Ability to write, maintain, and execute operational runbooks and automation for incident management and recovery.
- Ability to work self-directed, plan and execute projects involving multiple technical resources and stakeholders.
- Excellent communication and collaboration skills, with the ability to work across software development, infrastructure, and operations teams.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Azure cloud infrastructurePowerShellBashC#Node JS.NETKubernetesRest APIsInfrastructure as CodeCI/CD
Soft skills
problem-solvinganalyticaltroubleshootingcommunicationcollaborationself-directedproject planningexecutionincident managementcontinuous improvement