Berkshire Hathaway Homestate Companies - Workers Compensation Division

Service Reliability Engineer – Internal

Berkshire Hathaway Homestate Companies - Workers Compensation Division

full-time

Posted on:

Location Type: Hybrid

Location: SacramentoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $130,930 - $166,090 per year

Job Level

About the role

  • Ensure the reliability, availability, and performance of key Business Applications, IT services and infrastructure by monitoring system health and identifying potential risks.
  • Implement proactive measures, such as performance tuning and capacity planning, to avoid service disruptions.
  • Maintain and improve service-level objectives (SLOs) and service-level agreements (SLAs) across systems and services.
  • Monitor and respond to incidents, troubleshooting issues across the entire stack (network, systems, software, applications).
  • Conduct root cause analysis (RCA) of system failures and recommend or implement long-term solutions to prevent recurrence.
  • Participate in on-call rotations, ensuring timely resolution of incidents and minimizing downtime.
  • Collaborate with development and operations teams to improve system observability and alerting through monitoring tools.
  • Contribute to the design and implementation of scalable, highly available, and fault-tolerant architectures for distributed systems.
  • Collaborate with software engineers and architects to optimize system architecture for high reliability and performance.
  • Manage cloud infrastructure and services (Azure, AWS, or Google Cloud) to ensure efficient resource utilization and scalability.
  • Design and manage networking components, including load balancers, firewalls, VPNs, and DNS, ensuring secure, scalable, and resilient network infrastructure.
  • Troubleshoot network performance issues, including latency, packet loss, and bandwidth bottlenecks.
  • Work with security teams to implement best practices for security hardening, encryption, and compliance in production environments.
  • Implement comprehensive monitoring, logging, and alerting systems using tools such as Dynatrace, Prometheus, Grafana, Datadog, or Splunk.
  • Create and maintain dashboards that provide real-time insights into system performance, availability, and key reliability metrics.
  • Set up monitoring for key infrastructure components (e.g., servers, databases, microservices) and define actionable alerts.
  • Conduct capacity and performance testing to ensure the systems can handle increasing traffic and workloads.
  • Perform periodic (e.g. daily, monthly, annually) SRE manual and automated operations to ensure proper performance of corporate enterprise applications and systems.
  • Work with business applications across various environments, including on-premises, hybrid, and cloud systems.
  • Work with the infrastructure and cloud teams to ensure that application environments are stable, secure, and meet business performance expectations.
  • Support the transition of applications from on-premises environments to cloud or hybrid architectures, working closely with senior IT leadership on cloud migration strategies.
  • Ensure proper governance and performance monitoring for applications in all environments, proactively identifying areas for optimization.
  • Develop and implement procedures for regular audits, risk assessments, and disaster recovery plans for critical applications.
  • Ensure that QA processes adhere to relevant industry standards and regulatory requirements (e.g., ISO, GDPR, HIPAA).
  • Develop and maintain test documentation, including test plans, test cases, test scripts, and test data management.

Requirements

  • EDUCATION: Bachelor's degree in Computer Science, Information Technology, or related field, required.
  • CERTIFICATIONS: Certification in cloud platforms (e.g., Microsoft Azure Administrator), preferred.
  • A minimum of 7 years of experience as a Service Reliability Engineer, DevOps Engineer, or Systems Engineer, with hands-on exposure to networking, systems, and software development, required.
  • Strong experience with cloud platforms (e.g., Azure, AWS, or Google Cloud), including experience managing cloud-based services and infrastructure, required.
  • Experience with scripting or programming languages (e.g., Powershell, Python, Go, Bash, Java) for automation and tooling required.
  • Experience with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, Datadog, Splunk) required.
  • Hands-on experience with CI/CD pipelines and version control (e.g. ADO, Git, GitLab CI) required.
  • Experience with database management and optimization, including relational databases (SQL) and NoSQL databases (e.g., COSMOS, MongoDB) preferred.
  • Experience working in Agile development environments, with familiarity in tools like Microsoft ADO & Teams or Confluence preferred.
  • Proficiency with automation and configuration management tools (e.g., Ansible, Terraform, Puppet, Chef) required.
  • Solid understanding of networking protocols (TCP/IP, DNS, HTTP/HTTPS, SSL/TLS) and troubleshooting network-related issues required.
  • Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes required.
  • Knowledge of distributed systems, microservices architecture, and challenges related to these, such as consistency, fault tolerance, and distributed logging, preferred.
  • Exposure to site reliability engineering best practices, such as chaos engineering and blameless post-mortems preferred.
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
cloud platformsscripting languagesnetworkingdatabase managementmonitoring toolsCI/CD pipelinesautomation toolsperformance tuningcapacity planningroot cause analysis
Soft Skills
collaborationproblem-solvingcommunicationincident responseanalytical thinkingtime managementadaptabilityleadershiporganizational skillsattention to detail
Certifications
Microsoft Azure Administrator