
Service Reliability Engineer – Internal
Berkshire Hathaway Homestate Companies - Workers Compensation Division
full-time
Posted on:
Location Type: Hybrid
Location: Sacramento • California • United States
Visit company websiteExplore more
Salary
💰 $130,930 - $166,090 per year
Tech Stack
About the role
- Ensure the reliability, availability, and performance of key Business Applications, IT services and infrastructure by monitoring system health and identifying potential risks.
- Implement proactive measures, such as performance tuning and capacity planning, to avoid service disruptions.
- Maintain and improve service-level objectives (SLOs) and service-level agreements (SLAs) across systems and services.
- Monitor and respond to incidents, troubleshooting issues across the entire stack (network, systems, software, applications).
- Conduct root cause analysis (RCA) of system failures and recommend or implement long-term solutions to prevent recurrence.
- Participate in on-call rotations, ensuring timely resolution of incidents and minimizing downtime.
- Collaborate with development and operations teams to improve system observability and alerting through monitoring tools.
- Contribute to the design and implementation of scalable, highly available, and fault-tolerant architectures for distributed systems.
- Collaborate with software engineers and architects to optimize system architecture for high reliability and performance.
- Manage cloud infrastructure and services (Azure, AWS, or Google Cloud) to ensure efficient resource utilization and scalability.
- Design and manage networking components, including load balancers, firewalls, VPNs, and DNS, ensuring secure, scalable, and resilient network infrastructure.
- Troubleshoot network performance issues, including latency, packet loss, and bandwidth bottlenecks.
- Work with security teams to implement best practices for security hardening, encryption, and compliance in production environments.
- Implement comprehensive monitoring, logging, and alerting systems using tools such as Dynatrace, Prometheus, Grafana, Datadog, or Splunk.
- Create and maintain dashboards that provide real-time insights into system performance, availability, and key reliability metrics.
- Set up monitoring for key infrastructure components (e.g., servers, databases, microservices) and define actionable alerts.
- Conduct capacity and performance testing to ensure the systems can handle increasing traffic and workloads.
- Perform periodic (e.g. daily, monthly, annually) SRE manual and automated operations to ensure proper performance of corporate enterprise applications and systems.
- Work with business applications across various environments, including on-premises, hybrid, and cloud systems.
- Work with the infrastructure and cloud teams to ensure that application environments are stable, secure, and meet business performance expectations.
- Support the transition of applications from on-premises environments to cloud or hybrid architectures, working closely with senior IT leadership on cloud migration strategies.
- Ensure proper governance and performance monitoring for applications in all environments, proactively identifying areas for optimization.
- Develop and implement procedures for regular audits, risk assessments, and disaster recovery plans for critical applications.
- Ensure that QA processes adhere to relevant industry standards and regulatory requirements (e.g., ISO, GDPR, HIPAA).
- Develop and maintain test documentation, including test plans, test cases, test scripts, and test data management.
Requirements
- EDUCATION: Bachelor's degree in Computer Science, Information Technology, or related field, required.
- CERTIFICATIONS: Certification in cloud platforms (e.g., Microsoft Azure Administrator), preferred.
- A minimum of 7 years of experience as a Service Reliability Engineer, DevOps Engineer, or Systems Engineer, with hands-on exposure to networking, systems, and software development, required.
- Strong experience with cloud platforms (e.g., Azure, AWS, or Google Cloud), including experience managing cloud-based services and infrastructure, required.
- Experience with scripting or programming languages (e.g., Powershell, Python, Go, Bash, Java) for automation and tooling required.
- Experience with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, Datadog, Splunk) required.
- Hands-on experience with CI/CD pipelines and version control (e.g. ADO, Git, GitLab CI) required.
- Experience with database management and optimization, including relational databases (SQL) and NoSQL databases (e.g., COSMOS, MongoDB) preferred.
- Experience working in Agile development environments, with familiarity in tools like Microsoft ADO & Teams or Confluence preferred.
- Proficiency with automation and configuration management tools (e.g., Ansible, Terraform, Puppet, Chef) required.
- Solid understanding of networking protocols (TCP/IP, DNS, HTTP/HTTPS, SSL/TLS) and troubleshooting network-related issues required.
- Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes required.
- Knowledge of distributed systems, microservices architecture, and challenges related to these, such as consistency, fault tolerance, and distributed logging, preferred.
- Exposure to site reliability engineering best practices, such as chaos engineering and blameless post-mortems preferred.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
cloud platformsscripting languagesnetworkingdatabase managementmonitoring toolsCI/CD pipelinesautomation toolsperformance tuningcapacity planningroot cause analysis
Soft Skills
collaborationproblem-solvingcommunicationincident responseanalytical thinkingtime managementadaptabilityleadershiporganizational skillsattention to detail
Certifications
Microsoft Azure Administrator