
Site Reliability Engineer
UFG Insurance
full-time
Posted on:
Location Type: Hybrid
Location: Cedar Rapids • Iowa • 🇺🇸 United States
Visit company websiteSalary
💰 $123,865 - $163,368 per year
Job Level
SeniorLead
Tech Stack
ITSMJavaLinuxPythonSQLTCP/IPVMware
About the role
- Implement tooling to monitor system health, capacity, and performance at all levels, from hardware through the VMs and all the way to the end-user interface.
- Work with the production management team to troubleshoot incidents, restore service, and identify root causes.
- Recommend architectural and implementation of changes to products delivered by development teams based on their performance in test, performance, and production environments.
- Support continuous improvement of ITIL processes through automation, data driven insights, and proactive problem identification.
- Documents and Integrate SRE practices into the ITIL framework, including incident, change, and problem management workflows.
- Develop automation for system provisioning, monitoring, deployment, and recovery to reduce manual effort and human error.
- Develop and maintain comprehensive runbooks, standard operating procedures (SOPs), and knowledge base articles for recurring operational tasks and incident response actions.
- Collaborate with development teams to design resilient architecture and implement best practices for reliability and observability.
- Enhance observability by developing and maintaining dashboards, alerts, and performance analytics.
- Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
- Develop and update problem management documentation, ensuring known errors and workarounds are captured within the ITSM system.
- Manage incident response and participate in on-call rotations to ensure service reliability.
- Define, document and track key reliability metrics (SLIs, SLOs, SLAs) and implement continuous improvement initiatives.
- Drive post-incident reviews (PIRs) and develop actionable insights to prevent future occurrences.
- Partner with security teams to ensure systems meet compliance, security, and governance standards.
- Evaluate and recommend new tools, technologies, and frameworks to improve operational efficiency.
- Monitor network systems, servers, and applications.
- Contribute to capacity planning, performance tuning, and resilience testing to ensure system health.
- Use all necessary tools to investigate performance and reliability of systems in testing environments.
- Provide detailed and specific guidance on ways to eliminate bottlenecks, improve resilience, and optimize speed and reliability.
- Provide mentorship and technical support to other members of Production Management.
Requirements
- Bachelor’s degree in information technology, Computer Science, or a related field, or equivalent experience
- 10+ years of experience in progressively more demanding enterprise-scale technology roles
- 3+ years of experience as a Site Reliability Engineer or Senior DevOps Engineer
- 3+ years in software development, architecture, or related engineering discipline
- Advanced experience with multiple enterprise monitoring and observability tools, including Dynatrace, PRTG, DTrace, SolarWinds, and similar
- Complete Windows fluency mandatory; similar strengths in LINUX and Unisys Mainframe environments helpful
- Excellent problem-solving and communication skills, with the ability to collaborate across cross-functional teams.
- Unparalleled understanding of: advanced networking concepts and complete expertise in the entire TCP/IP stack
- VM (VMware and HyperV) and physical compute performance and tuning, including networking and storage performance
- VM (Java, Python, Browser, and similar VM environments) threading, garbage collection, and general performance
- SQL Server expertise, including troubleshooting queries, indexes, and general performance
- Experience with unstructured database performance
- General understanding of LLM/SLM implementations and GPU implementations
- Proficiency in automation and scripting languages
- Good understanding of ITIL processes (Incident, Change, Problem, and Service Level Management).
Benefits
- Annual incentive compensation
- Medical, dental, vision & life insurance
- Accident, critical Illness & short-term disability insurance
- Retirement plans with employer contributions
- Generous time-off program
- Programs designed to support the employee well-being and financial security.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringDevOps EngineeringSoftware DevelopmentArchitecturePerformance TuningAutomationScripting LanguagesSQL ServerNetworking ConceptsObservability
Soft skills
Problem-solvingCommunicationCollaborationMentorship