Berkshire Hathaway Homestate Companies - Workers Compensation Division

Service Reliability Engineer

Berkshire Hathaway Homestate Companies - Workers Compensation Division

full-time

Posted on:

Location Type: Hybrid

Location: PlanoTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $130,930 - $166,690 per year

Job Level

About the role

  • Ensure the reliability, availability, and performance of key Business Applications, IT services and infrastructure by monitoring system health and identifying potential risks.
  • Implement proactive measures, such as performance tuning and capacity planning, to avoid service disruptions.
  • Maintain and improve service-level objectives (SLOs) and service-level agreements (SLAs) across systems and services.
  • Monitor and respond to incidents, troubleshooting issues across the entire stack (network, systems, software, applications).
  • Conduct root cause analysis (RCA) of system failures and recommend or implement long-term solutions to prevent recurrence.
  • Participate in on-call rotations, ensuring timely resolution of incidents and minimizing downtime.
  • Collaborate with development and operations teams to improve system observability and alerting through monitoring tools.
  • Contribute to the design and implementation of scalable, highly available, and fault-tolerant architectures for distributed systems.
  • Manage cloud infrastructure and services (Azure, AWS, or Google Cloud) to ensure efficient resource utilization and scalability.
  • Work with security teams to implement best practices for security hardening, encryption, and compliance in production environments.
  • Implement comprehensive monitoring, logging, and alerting systems using tools such as Dynatrace, Prometheus, Grafana, Datadog, or Splunk.
  • Create and maintain dashboards that provide real-time insights into system performance, availability, and key reliability metrics.
  • Periodically perform SRE manual and automated operations to ensure proper performance of corporate enterprise applications and systems.

Requirements

  • EDUCATION: Bachelor's degree in Computer Science, Information Technology, or related field, required.
  • CERTIFICATIONS: Certification in cloud platforms (e.g., Microsoft Azure Administrator), preferred.
  • A minimum of 7 years of experience as a Service Reliability Engineer, DevOps Engineer, or Systems Engineer, with hands-on exposure to networking, systems, and software development, required.
  • Strong experience with cloud platforms (e.g., Azure, AWS, or Google Cloud), including experience managing cloud-based services and infrastructure, required.
  • Experience with scripting or programming languages (e.g., Powershell, Python, Go, Bash, Java) for automation and tooling required.
  • Experience with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, Datadog, Splunk) required.
  • Hands-on experience with CI/CD pipelines and version control (e.g. ADO, Git, GitLab CI) required.
  • Experience with database management and optimization, including relational databases (SQL) and NoSQL databases (e.g., COSMOS, MongoDB) preferred.
  • Experience working in Agile development environments, with familiarity in tools like Microsoft ADO & Teams or Confluence preferred.
  • Proficiency with automation and configuration management tools (e.g., Ansible, Terraform, Puppet, Chef) required.
  • Solid understanding of networking protocols (TCP/IP, DNS, HTTP/HTTPS, SSL/TLS) and troubleshooting network-related issues required.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes required.
  • Knowledge of distributed systems, microservices architecture, and challenges related to these, such as consistency, fault tolerance, and distributed logging, preferred.
  • Exposure to site reliability engineering best practices, such as chaos engineering and blameless post-mortems preferred.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
cloud platformsscripting languagesmonitoring toolsCI/CD pipelinesdatabase managementnetworking protocolscontainerization technologiesautomation toolsperformance tuningcapacity planning
Soft Skills
collaborationtroubleshootingincident responseroot cause analysisproblem-solvingcommunicationproactive measuresteamworktime managementadaptability
Certifications
Microsoft Azure Administrator