Solutions Engineer, Software – Site Reliability Engineer

Liberty Mutual Insurance

full-time

Posted on: 9/10/2025

Location: 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $134,000 - $254,000 per year

Job Level

SeniorLead

Tech Stack

AndroidAWSAzureCloudDockerGoJavaJenkinsJMeterKubernetesMicroservicesPythonServiceNowSplunkTerraformTypeScriptWordPress

About the role

Lead the end-to-end delivery of reliability solutions that meet customer needs while aligning with technology guardrails and strategic roadmaps.
Define and implement SLOs, SLIs, and error-budget policies; integrate them with CI/CD pipelines and automated quality gates.
Design and build cloud-native reliability tooling—auto-scaling, self-healing, blue/green and canary release frameworks—leveraging AWS services (EKS, Lambda, Fargate, Auto Scaling, Route 53, CloudWatch).
Implement and extend observability platforms (metrics, logs, traces, events) using Datadog, SPLUNK, and AWS native services.
Drive Gen-AI/ML experimentation for anomaly detection, predictive scaling, and automated incident triage; transition validated prototypes into production platforms.
Champion infrastructure-as-code (Terraform, CloudFormation, CDK) and GitOps workflows to ensure repeatable, auditable changes.
Embed chaos engineering and resilience testing (Gremlin, Litmus, ChaosMesh, Fault Injection Simulator) into release pipelines.
Optimize incident management processes: blameless post-mortems, rapid root-cause analysis, actionable runbooks, and continuous learning loops.
Collaborate with Quality Engineering, Security, Architecture, and Delivery teams to create an end-to-end DevTestOps ecosystem.
Mentor and coach engineers, fostering a culture of reliability, automation, and customer-centric thinking.
Stay current on emerging technologies—container orchestration, service mesh, serverless, edge computing, Gen-AI for ops—and apply relevant innovations to ongoing work.
Document architectures, reliability standards, and operational playbooks for maintainability and knowledge transfer.

Requirements

Bachelor’s or master’s degree in computer science, Engineering, or a related discipline (or equivalent experience).
10+ years of hands-on engineering experience, with at least 5 years focused on SRE, DevOps, or large-scale cloud operations.
Deep knowledge of containerization (Docker, Kubernetes/EKS), service mesh (Istio, Linkerd), and microservice architectures.
Practical experience with observability stacks (Datadog, Splunk).
Proficiency in at least one programming language (Python, Go, Java, TypeScript, or similar).
Familiarity with CI/CD systems (GitHub Actions, Azure DevOps, Jenkins) and release strategies (blue/green, canary, feature flags).
Hands-on exposure to chaos-engineering and resilience testing tools (Gremlin, ChaosMesh) and load/performance tools (k6, JMeter, LoadRunner).
Experience with incident management platforms (ServiceNow) and running blameless post-mortems.
Strong communication, facilitation, consensus-building, and stakeholder-management skills.
Relevant certifications (AWS DevOps, Kubernetes, Observability platforms) are a plus.