
Principal Site Reliability Engineer – Automotive
Red Hat
full-time
Posted on:
Location Type: Hybrid
Location: Raleigh • North Carolina • United States
Visit company websiteExplore more
Salary
💰 $151,510 - $249,950 per year
Job Level
About the role
- Architect, design and lead the implementation of the RHIVOS product SRE initiative.
- Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
- Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
- Review team contributions to software correcting errors and provide constructive feedback.
- Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
- Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
- Configure and maintain software production infrastructure and tooling.
- Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
- Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
- Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
- Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
- Collaborate on incident retrospective reviews and corrective items implementation.
- Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
- Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
- Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
- Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
- Helpout/backup RHIVOS Raleigh lab SRE when needed.
Requirements
- 8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
- Linux administration expertise
- Advanced experience of Kubernetes/OpenShift administration and application development
- Advanced experience of automation services like Ansible or Terraform
- Advanced experience of CI/CD platforms like GitLab CI, Tekton and Pipelines as a code (optionally GitHub Actions etc)
- Advanced experience and experience with monitoring platforms and technologies
- Advanced experience and experience of AWS technologies
- Experience with open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
- Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
- Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability
- Previous experience with the Site Reliability Engineer (SRE) model and software development using Python or GoLang.
- Ability to work in the Raleigh office when needed
Benefits
- Comprehensive medical, dental, and vision coverage
- Flexible Spending Account - healthcare and dependent care
- Health Savings Account - high deductible medical plan
- Retirement 401(k) with employer match
- Paid time off and holidays
- Paid parental leave plans for all new parents
- Leave benefits including disability, paid family medical leave, and paid military leave
- Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Linux systemsinfrastructure-as-codeKubernetesOpenShiftAnsibleTerraformCI/CDGitLab CIAWSPython
Soft Skills
communicationleadershipmentoringcollaborationproblem-solvingcontinuous improvementfeedbacktrainingincident responsedata-driven decision making