Liberty Mutual Insurance

Solutions Engineer, Software – Site Reliability Engineer

Liberty Mutual Insurance

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $134,000 - $254,000 per year

Job Level

SeniorLead

Tech Stack

AndroidAWSAzureCloudDockerGoJavaJenkinsJMeterKubernetesMicroservicesPythonServiceNowSplunkTerraformTypeScriptWordPress

About the role

  • Lead the end-to-end delivery of reliability solutions that meet customer needs while aligning with technology guardrails and strategic roadmaps.
  • Define and implement SLOs, SLIs, and error-budget policies; integrate them with CI/CD pipelines and automated quality gates.
  • Design and build cloud-native reliability tooling—auto-scaling, self-healing, blue/green and canary release frameworks—leveraging AWS services (EKS, Lambda, Fargate, Auto Scaling, Route 53, CloudWatch).
  • Implement and extend observability platforms (metrics, logs, traces, events) using Datadog, SPLUNK, and AWS native services.
  • Drive Gen-AI/ML experimentation for anomaly detection, predictive scaling, and automated incident triage; transition validated prototypes into production platforms.
  • Champion infrastructure-as-code (Terraform, CloudFormation, CDK) and GitOps workflows to ensure repeatable, auditable changes.
  • Embed chaos engineering and resilience testing (Gremlin, Litmus, ChaosMesh, Fault Injection Simulator) into release pipelines.
  • Optimize incident management processes: blameless post-mortems, rapid root-cause analysis, actionable runbooks, and continuous learning loops.
  • Collaborate with Quality Engineering, Security, Architecture, and Delivery teams to create an end-to-end DevTestOps ecosystem.
  • Mentor and coach engineers, fostering a culture of reliability, automation, and customer-centric thinking.
  • Stay current on emerging technologies—container orchestration, service mesh, serverless, edge computing, Gen-AI for ops—and apply relevant innovations to ongoing work.
  • Document architectures, reliability standards, and operational playbooks for maintainability and knowledge transfer.

Requirements

  • Bachelor’s or master’s degree in computer science, Engineering, or a related discipline (or equivalent experience).
  • 10+ years of hands-on engineering experience, with at least 5 years focused on SRE, DevOps, or large-scale cloud operations.
  • Deep knowledge of containerization (Docker, Kubernetes/EKS), service mesh (Istio, Linkerd), and microservice architectures.
  • Practical experience with observability stacks (Datadog, Splunk).
  • Proficiency in at least one programming language (Python, Go, Java, TypeScript, or similar).
  • Familiarity with CI/CD systems (GitHub Actions, Azure DevOps, Jenkins) and release strategies (blue/green, canary, feature flags).
  • Hands-on exposure to chaos-engineering and resilience testing tools (Gremlin, ChaosMesh) and load/performance tools (k6, JMeter, LoadRunner).
  • Experience with incident management platforms (ServiceNow) and running blameless post-mortems.
  • Strong communication, facilitation, consensus-building, and stakeholder-management skills.
  • Relevant certifications (AWS DevOps, Kubernetes, Observability platforms) are a plus.