
Service Reliability Engineer
Berkshire Hathaway Homestate Companies - Workers Compensation Division
full-time
Posted on:
Location Type: Hybrid
Location: Plano • Texas • United States
Visit company websiteExplore more
Salary
💰 $130,930 - $166,690 per year
Tech Stack
About the role
- Ensure the reliability, availability, and performance of key Business Applications, IT services and infrastructure by monitoring system health and identifying potential risks.
- Implement proactive measures, such as performance tuning and capacity planning, to avoid service disruptions.
- Maintain and improve service-level objectives (SLOs) and service-level agreements (SLAs) across systems and services.
- Monitor and respond to incidents, troubleshooting issues across the entire stack (network, systems, software, applications).
- Conduct root cause analysis (RCA) of system failures and recommend or implement long-term solutions to prevent recurrence.
- Participate in on-call rotations, ensuring timely resolution of incidents and minimizing downtime.
- Collaborate with development and operations teams to improve system observability and alerting through monitoring tools.
- Contribute to the design and implementation of scalable, highly available, and fault-tolerant architectures for distributed systems.
- Manage cloud infrastructure and services (Azure, AWS, or Google Cloud) to ensure efficient resource utilization and scalability.
- Work with security teams to implement best practices for security hardening, encryption, and compliance in production environments.
- Implement comprehensive monitoring, logging, and alerting systems using tools such as Dynatrace, Prometheus, Grafana, Datadog, or Splunk.
- Create and maintain dashboards that provide real-time insights into system performance, availability, and key reliability metrics.
- Periodically perform SRE manual and automated operations to ensure proper performance of corporate enterprise applications and systems.
Requirements
- EDUCATION: Bachelor's degree in Computer Science, Information Technology, or related field, required.
- CERTIFICATIONS: Certification in cloud platforms (e.g., Microsoft Azure Administrator), preferred.
- A minimum of 7 years of experience as a Service Reliability Engineer, DevOps Engineer, or Systems Engineer, with hands-on exposure to networking, systems, and software development, required.
- Strong experience with cloud platforms (e.g., Azure, AWS, or Google Cloud), including experience managing cloud-based services and infrastructure, required.
- Experience with scripting or programming languages (e.g., Powershell, Python, Go, Bash, Java) for automation and tooling required.
- Experience with monitoring and observability tools (e.g., Dynatrace, Prometheus, Grafana, Datadog, Splunk) required.
- Hands-on experience with CI/CD pipelines and version control (e.g. ADO, Git, GitLab CI) required.
- Experience with database management and optimization, including relational databases (SQL) and NoSQL databases (e.g., COSMOS, MongoDB) preferred.
- Experience working in Agile development environments, with familiarity in tools like Microsoft ADO & Teams or Confluence preferred.
- Proficiency with automation and configuration management tools (e.g., Ansible, Terraform, Puppet, Chef) required.
- Solid understanding of networking protocols (TCP/IP, DNS, HTTP/HTTPS, SSL/TLS) and troubleshooting network-related issues required.
- Familiarity with containerization technologies like Docker and orchestration tools like Kubernetes required.
- Knowledge of distributed systems, microservices architecture, and challenges related to these, such as consistency, fault tolerance, and distributed logging, preferred.
- Exposure to site reliability engineering best practices, such as chaos engineering and blameless post-mortems preferred.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
cloud platformsscripting languagesmonitoring toolsCI/CD pipelinesdatabase managementnetworking protocolscontainerization technologiesautomation toolsperformance tuningcapacity planning
Soft Skills
collaborationtroubleshootingincident responseroot cause analysisproblem-solvingcommunicationproactive measuresteamworktime managementadaptability
Certifications
Microsoft Azure Administrator