UFG Insurance

Site Reliability Engineer

UFG Insurance

full-time

Posted on:

Location Type: Hybrid

Location: Cedar Rapids • Iowa • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $123,865 - $163,368 per year

Job Level

SeniorLead

Tech Stack

ITSMJavaLinuxPythonSQLTCP/IPVMware

About the role

  • Implement tooling to monitor system health, capacity, and performance at all levels, from hardware through the VMs and all the way to the end-user interface.
  • Work with the production management team to troubleshoot incidents, restore service, and identify root causes.
  • Recommend architectural and implementation of changes to products delivered by development teams based on their performance in test, performance, and production environments.
  • Support continuous improvement of ITIL processes through automation, data driven insights, and proactive problem identification.
  • Documents and Integrate SRE practices into the ITIL framework, including incident, change, and problem management workflows.
  • Develop automation for system provisioning, monitoring, deployment, and recovery to reduce manual effort and human error.
  • Develop and maintain comprehensive runbooks, standard operating procedures (SOPs), and knowledge base articles for recurring operational tasks and incident response actions.
  • Collaborate with development teams to design resilient architecture and implement best practices for reliability and observability.
  • Enhance observability by developing and maintaining dashboards, alerts, and performance analytics.
  • Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
  • Develop and update problem management documentation, ensuring known errors and workarounds are captured within the ITSM system.
  • Manage incident response and participate in on-call rotations to ensure service reliability.
  • Define, document and track key reliability metrics (SLIs, SLOs, SLAs) and implement continuous improvement initiatives.
  • Drive post-incident reviews (PIRs) and develop actionable insights to prevent future occurrences.
  • Partner with security teams to ensure systems meet compliance, security, and governance standards.
  • Evaluate and recommend new tools, technologies, and frameworks to improve operational efficiency.
  • Monitor network systems, servers, and applications.
  • Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
  • Use all necessary tools to investigate performance and reliability of systems in testing environments.
  • Provide detailed and specific guidance on ways to eliminate bottlenecks, improve resilience, and optimize speed and reliability.
  • Provide mentorship and technical support to other members of Production Management.

Requirements

  • Bachelor’s degree in information technology, Computer Science, or a related field, or equivalent experience
  • 10+ years of experience in progressively more demanding enterprise-scale technology roles
  • 3+ years of experience as a Site Reliability Engineer or Senior DevOps Engineer
  • 3+ years in software development, architecture, or related engineering discipline
  • Advanced experience with multiple enterprise monitoring and observability tools, including Dynatrace, PRTG, DTrace, SolarWinds, and similar
  • Complete Windows fluency mandatory; similar strengths in LINUX and Unisys Mainframe environments helpful
  • Excellent problem-solving and communication skills, with the ability to collaborate across cross-functional teams.
  • Unparalleled understanding of: advanced networking concepts and complete expertise in the entire TCP/IP stack
  • VM (VMware and HyperV) and physical compute performance and tuning, including networking and storage performance
  • VM (Java, Python, Browser, and similar VM environments) threading, garbage collection, and general performance
  • SQL Server expertise, including troubleshooting queries, indexes, and general performance
  • Experience with unstructured database performance
  • General understanding of LLM/SLM implementations and GPU implementations
  • Proficiency in automation and scripting languages
  • Good understanding of ITIL processes (Incident, Change, Problem, and Service Level Management).
Benefits
  • Annual incentive compensation
  • Medical, dental, vision & life insurance
  • Accident, critical Illness & short-term disability insurance
  • Retirement plans with employer contributions
  • Generous time-off program
  • Programs designed to support the employee well-being and financial security.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringDevOps EngineeringSoftware DevelopmentArchitecturePerformance TuningAutomationScripting LanguagesSQL ServerNetworking ConceptsObservability
Soft skills
Problem-solvingCommunicationCollaborationMentorship