The Hartford

Reliability Engineer - Production Support

The Hartford

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Salary

💰 $90,320 - $135,480 per year

Job Level

Mid-LevelSenior

Tech Stack

AssemblyAWSAzureCloudDockerGoJavaKubernetesLinuxPythonRayScalaSDLCSplunkTCP/IPTerraform

About the role

  • Operations Focus: Assists with instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency, and resiliency.
  • Contributes to the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
  • DevSecOps Solution Responsibilities: Build the necessary tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
  • Enhance the delivery flow by building the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
  • Progressively implement preventative controls and build increased automation and self-healing capabilities.
  • Continue to improve cost efficiency baselines.
  • IT / Data Engineering Responsibilities: Participate in the elimination of toil by creating automation or engineering autonomous solutions requiring minimal manual effort (e.g., covering OS patching to CICD to infrastructure configuration mgmt.)
  • Ability to build reliable and performant data systems to support data delivery
  • Ability to build scalable SDLC environments using COTS, SaaS, PaaS products to support Data Pipeline needs
  • IT Ops Responsibilities: Promote operational excellence. Participate in the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business.
  • Demonstrate end-to-end ownership.
  • Partner with infrastructure Product teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes.
  • Take proactive measures to prevent high impactful incidents.
  • Achieve and maintain the continuity of Hartford and third-party assets that support a business function.
  • Accountable for keeping the IT application and infrastructure metadata repositories current.
  • Application Focus: Promote the reliability (such as availability, capacity, performance) of the solution. Participate in on-call activities to mitigate incidents as quickly as possible.
  • Participate in the development of effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation.
  • Build and operate reliable and performant data systems and services that enable the business to make data-driven decisions
  • Engage with the service consumers to define functional and non-functional requirements for the solutions.
  • Participate with training, best-practices, and sample code to enable consumers to take advantage of the solution to the best degree possible.
  • Partner with the RE and Software Engineering teams to collect ongoing feedback and improvement backlog items for the Infrastructure Product teams.
  • Leverage analyst reviews, vendor offerings, client success stories to evolve the portfolio.
  • Participate in relevant vendor / community / industry conferences.
  • Build and maintain Governance policies especially to Data masking (PII management), data lifecycle management needs

Requirements

  • DevOps Mindset
  • Enjoy solving difficult engineering problems and don’t mind getting your hands dirty
  • Maintains personal responsibility and commitment to respond to and address incidents quickly
  • Good Software engineering skills ideally with experience in Java, Python, .Net and/or Go
  • Understanding of Linux system internals, are familiar with the TCP/IP stack, network routing and load balancing
  • Approach troubleshooting systematically and have a deep sense of ownership for whatever you work on
  • Ability to root cause sources of instability in a high-traffic, distributed system
  • Experience with configuration and troubleshooting of Linux, Java/Scala, Docker / Kubernetes systems
  • Understanding of large-scale complex systems from a reliability perspective
  • Passion for resolving reliability issues and identifying strategies to mitigate going forward
  • Knowledge of Performance and Observability tools such as Dynatrace, SumoLogic, TrueSight, CloudWatch, CloudTrail, AWS X-Ray, Splunk, and related tools
  • Willingness to work in an ever-changing environment
  • Passion about automation and innovations that improve productivity
  • Experience with IAC tools such as Terraform, Cloud Formation etc.
  • Degree in Computer Science or related discipline with a minimum of 3-5 years of work experience in IT systems operations and/or application development.
  • Some experience in an RE role.
  • Experience with building, supporting Enterprise Contact Center platforms, systems, omni channel applications (voice, chat, SMS, email, social, etc.)