GovX

Senior Site Reliability Engineer

GovX

full-time

Posted on:

Location Type: Remote

Location: Remote • California, Colorado, Florida, New York, Tennessee, Texas, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $165,000 - $175,000 per year

Job Level

Senior

Tech Stack

AzureCloudDistributed SystemsGrafanaJavaScriptKubernetesLinuxMicroservices.NETNode.jsPrometheus

About the role

  • Maintain scalable, secure, and reliable cloud services ensuring reliable system operations within Service Level Objectives.
  • Implement and manage monitoring, alerting, and observability systems using Prometheus, Grafana, and Azure Monitor to proactively identify and resolve issues.
  • Develop and maintain automation scripts and tools in PowerShell, Bash, and C# to improve deployment efficiency, system reliability, and developer productivity.
  • Create, refine, and maintain detailed runbooks for production systems to ensure consistent operational procedures and effective incident response.
  • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to measure and maintain system reliability.
  • Collaborate with software engineers and automation engineers to integrate reliability practices into CI/CD pipelines using Azure DevOps.
  • Design and implement intelligent alerting strategies that ensure high signal-to-noise ratios and enable rapid triage of critical issues.
  • Participate in incident response, post-incident reviews, and blameless root cause analysis to drive continuous improvement of system reliability and uptime.
  • Contribute to deployment strategy evolution, including blue-green and canary deployments, to minimize downtime and release risk.
  • Collaborate closely with Automation Engineers to enhance automated validation and testing of production environments.
  • Monitor system health, capacity, and performance, providing data-driven insights and recommendations for optimization.
  • Conduct chaos engineering experiments and resilience testing to proactively identify and address system weaknesses.
  • Develop and maintain disaster recovery and business continuity plans, including regular failover testing.
  • Participate in the on-call rotation for platform services, ensuring high availability and rapid incident resolution.
  • Proactively monitor and respond to production support tickets and alerts within established SLA timeframes, delivering first-level diagnosis, troubleshooting, and escalation as needed to maintain system reliability
  • Continuously improve incident response playbooks and reduce Mean Time to Recovery (MTTR).
  • Participate in sprint planning, stand-ups, and retrospectives to ensure alignment with development and operational objectives.
  • Identify opportunities to improve resiliency, reduce toil, and strengthen the reliability culture across the engineering organization.
  • Collaborate with security and compliance teams to ensure infrastructure meets regulatory and security standards.
  • Support cost optimization efforts by monitoring cloud resource usage and recommending efficiency improvements.
  • Explore and integrate AI/ML-based observability tools for predictive monitoring and anomaly detection.

Requirements

  • 8+ years of professional experience in site reliability, infrastructure, or systems engineering roles.
  • Proficiency with Azure cloud infrastructure, services, and resource management
  • Experience in operating systems, network concepts, protocols, and architecture. Microsoft/Linux operating systems, active directory, OSI.
  • Technical ability in Node JS, .NET/C# and knowledge of both current and legacy architecture, software development practices, and conventions.
  • Strong experience with Rest APIs
  • Hands-on experience with containerization and orchestration using Kubernetes and microservices architecture.
  • Strong automation and scripting skills in PowerShell, Bash.
  • Experience with Infrastructure as Code tools for provisioning and configuration management.
  • Deep understanding of CI/CD processes and tools, preferably using Azure DevOps.
  • Experience implementing and managing observability solutions including Azure Monitor, Application Insights, and Log Analytics Workspaces, Prometheus and Grafana.
  • Strong problem-solving, analytical, and troubleshooting abilities in distributed systems and cloud environments.
  • Ability to write, maintain, and execute operational runbooks and automation for incident management and recovery.
  • Ability to work self-directed, plan and execute projects involving multiple technical resources and stakeholders.
  • Excellent communication and collaboration skills, with the ability to work across software development, infrastructure, and operations teams.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Azure cloud infrastructurePowerShellBashC#Node JS.NETKubernetesRest APIsInfrastructure as CodeCI/CD
Soft skills
problem-solvinganalyticaltroubleshootingcommunicationcollaborationself-directedproject planningexecutionincident managementcontinuous improvement