Salary
💰 $90,320 - $135,480 per year
Tech Stack
AssemblyAWSAzureCloudDockerGoJavaKubernetesLinuxPythonRayScalaSDLCSplunkTCP/IPTerraform
About the role
- Operations Focus: Assists with instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health - availability, performance, quality, currency, and resiliency.
- Contributes to the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
- DevSecOps Solution Responsibilities: Build the necessary tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
- Enhance the delivery flow by building the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
- Progressively implement preventative controls and build increased automation and self-healing capabilities.
- Continue to improve cost efficiency baselines.
- IT / Data Engineering Responsibilities: Participate in the elimination of toil by creating automation or engineering autonomous solutions requiring minimal manual effort (e.g., covering OS patching to CICD to infrastructure configuration mgmt.)
- Ability to build reliable and performant data systems to support data delivery
- Ability to build scalable SDLC environments using COTS, SaaS, PaaS products to support Data Pipeline needs
- IT Ops Responsibilities: Promote operational excellence. Participate in the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business.
- Demonstrate end-to-end ownership.
- Partner with infrastructure Product teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes.
- Take proactive measures to prevent high impactful incidents.
- Achieve and maintain the continuity of Hartford and third-party assets that support a business function.
- Accountable for keeping the IT application and infrastructure metadata repositories current.
- Application Focus: Promote the reliability (such as availability, capacity, performance) of the solution. Participate in on-call activities to mitigate incidents as quickly as possible.
- Participate in the development of effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation.
- Build and operate reliable and performant data systems and services that enable the business to make data-driven decisions
- Engage with the service consumers to define functional and non-functional requirements for the solutions.
- Participate with training, best-practices, and sample code to enable consumers to take advantage of the solution to the best degree possible.
- Partner with the RE and Software Engineering teams to collect ongoing feedback and improvement backlog items for the Infrastructure Product teams.
- Leverage analyst reviews, vendor offerings, client success stories to evolve the portfolio.
- Participate in relevant vendor / community / industry conferences.
- Build and maintain Governance policies especially to Data masking (PII management), data lifecycle management needs
Requirements
- DevOps Mindset
- Enjoy solving difficult engineering problems and don’t mind getting your hands dirty
- Maintains personal responsibility and commitment to respond to and address incidents quickly
- Good Software engineering skills ideally with experience in Java, Python, .Net and/or Go
- Understanding of Linux system internals, are familiar with the TCP/IP stack, network routing and load balancing
- Approach troubleshooting systematically and have a deep sense of ownership for whatever you work on
- Ability to root cause sources of instability in a high-traffic, distributed system
- Experience with configuration and troubleshooting of Linux, Java/Scala, Docker / Kubernetes systems
- Understanding of large-scale complex systems from a reliability perspective
- Passion for resolving reliability issues and identifying strategies to mitigate going forward
- Knowledge of Performance and Observability tools such as Dynatrace, SumoLogic, TrueSight, CloudWatch, CloudTrail, AWS X-Ray, Splunk, and related tools
- Willingness to work in an ever-changing environment
- Passion about automation and innovations that improve productivity
- Experience with IAC tools such as Terraform, Cloud Formation etc.
- Degree in Computer Science or related discipline with a minimum of 3-5 years of work experience in IT systems operations and/or application development.
- Some experience in an RE role.
- Experience with building, supporting Enterprise Contact Center platforms, systems, omni channel applications (voice, chat, SMS, email, social, etc.)