Salary
💰 $135,000 - $145,000 per year
Tech Stack
AzureCloudGrafanaPrometheusPythonTerraform
About the role
- Ensure availability, performance, scalability, and operational efficiency of the Informatix cloud platform.
- Reduce manual operational toil through automation and engineering solutions.
- Serve as a primary contributor to the on-call rotation to maintain 24/7 uptime for production systems.
- Proactively monitor and continuously improve SLAs, SLOs, and SLIs across critical services.
- Develop and maintain observability tooling including logging, metrics, and tracing (e.g., Azure Monitor, OpenTelemetry, Prometheus).
- Conduct postmortems and root cause analysis; implement fixes to prevent repeat incidents.
- Design and maintain automated incident detection and response systems; establish runbooks and escalation protocols.
- Identify and eliminate manual operational toil through scripting and automation.
- Contribute to chaos testing and failure injection to proactively uncover weaknesses.
- Promote a culture of operational excellence through data-driven reliability practices.
- Proactively communicate status.
Requirements
- 5+ years of experience in Site Reliability Engineering, systems engineering, or DevOps roles.
- Expertise in monitoring and observability platforms (e.g., Grafana, Prometheus, ELK, Azure Monitor).
- Solid background in incident response, root cause analysis, and on-call rotations.
- Deep knowledge of Microsoft Azure, including containerized services (AKS), networking, and storage.
- Strong automation and scripting experience (e.g., Python, Bash, PowerShell).
- Familiarity with IaC tools such as Terraform, Bicep, or ARM templates.
- Experience implementing SLIs/SLOs, operational dashboards, and error budgets.
- Comfortable designing for resiliency, failover, and graceful degradation.
- Knowledge of compliance frameworks (e.g., SOC 2, HITRUST, IEC 62304) is a plus.
- Strong written and verbal communication with a focus on transparency and learning.
- BS/MS in Computer Science, Engineering, or related technical field preferred.
- 5+ years in production engineering roles with direct ownership of critical systems.
- Microsoft certifications a plus.
- For US roles requiring hospital access: must be eligible for and maintain hospital credentials and applicable vaccination requirements.