Reliability Engineer - Production Support

The Hartford

full-time

Posted on: 8/21/2025

Location: 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $90,320 - $135,480 per year

Job Level

Mid-LevelSenior

Tech Stack

AssemblyAWSAzureCloudDockerGoJavaKubernetesLinuxPythonRayScalaSDLCSplunkTCP/IPTerraform

About the role

Operations Focus: Assists with instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency, and resiliency.
Contributes to the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
DevSecOps Solution Responsibilities: Build the necessary tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
Enhance the delivery flow by building the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
Progressively implement preventative controls and build increased automation and self-healing capabilities.
Continue to improve cost efficiency baselines.
IT / Data Engineering Responsibilities: Participate in the elimination of toil by creating automation or engineering autonomous solutions requiring minimal manual effort (e.g., covering OS patching to CICD to infrastructure configuration mgmt.)
Ability to build reliable and performant data systems to support data delivery
Ability to build scalable SDLC environments using COTS, SaaS, PaaS products to support Data Pipeline needs
IT Ops Responsibilities: Promote operational excellence. Participate in the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business.
Demonstrate end-to-end ownership.
Partner with infrastructure Product teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes.
Take proactive measures to prevent high impactful incidents.
Achieve and maintain the continuity of Hartford and third-party assets that support a business function.
Accountable for keeping the IT application and infrastructure metadata repositories current.
Application Focus: Promote the reliability (such as availability, capacity, performance) of the solution. Participate in on-call activities to mitigate incidents as quickly as possible.
Participate in the development of effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation.
Build and operate reliable and performant data systems and services that enable the business to make data-driven decisions
Engage with the service consumers to define functional and non-functional requirements for the solutions.
Participate with training, best-practices, and sample code to enable consumers to take advantage of the solution to the best degree possible.
Partner with the RE and Software Engineering teams to collect ongoing feedback and improvement backlog items for the Infrastructure Product teams.
Leverage analyst reviews, vendor offerings, client success stories to evolve the portfolio.
Participate in relevant vendor / community / industry conferences.
Build and maintain Governance policies especially to Data masking (PII management), data lifecycle management needs

Requirements

DevOps Mindset
Enjoy solving difficult engineering problems and don’t mind getting your hands dirty
Maintains personal responsibility and commitment to respond to and address incidents quickly
Good Software engineering skills ideally with experience in Java, Python, .Net and/or Go
Understanding of Linux system internals, are familiar with the TCP/IP stack, network routing and load balancing
Approach troubleshooting systematically and have a deep sense of ownership for whatever you work on
Ability to root cause sources of instability in a high-traffic, distributed system
Experience with configuration and troubleshooting of Linux, Java/Scala, Docker / Kubernetes systems
Understanding of large-scale complex systems from a reliability perspective
Passion for resolving reliability issues and identifying strategies to mitigate going forward
Knowledge of Performance and Observability tools such as Dynatrace, SumoLogic, TrueSight, CloudWatch, CloudTrail, AWS X-Ray, Splunk, and related tools
Willingness to work in an ever-changing environment
Passion about automation and innovations that improve productivity
Experience with IAC tools such as Terraform, Cloud Formation etc.
Degree in Computer Science or related discipline with a minimum of 3-5 years of work experience in IT systems operations and/or application development.
Some experience in an RE role.
Experience with building, supporting Enterprise Contact Center platforms, systems, omni channel applications (voice, chat, SMS, email, social, etc.)

Reliability Engineer - Production Support

Salary

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

DevSecOps Engineer

Manager, Service Reliability Engineering

Lead Site Reliability Principal Architect

Cloud DevSecOps Engineer

Senior Software Engineer – Reliability Engineer