Red Hat

Principal Site Reliability Engineer – Automotive

Red Hat

full-time

Posted on:

Location Type: Hybrid

Location: RaleighNorth CarolinaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $151,510 - $249,950 per year

Job Level

About the role

  • Architect, design and lead the implementation of the RHIVOS product SRE initiative.
  • Instrument metrics to support Service Level Objectives (SLO), Service Level Indicators (SLI) and Service Level Agreements (SLA) for critical services.
  • Utilize metrics designed and built into the software to analyze system performance and identify performance bottlenecks, underutilized hardware or scale the infrastructure design.
  • Review team contributions to software correcting errors and provide constructive feedback.
  • Lead and participate in incident response and postmortems, help identify steps to minimize Mean Time To Resolution (MTTR).
  • Regularly contribute to internal workshops and training to upskill the team as the product architecture evolves.
  • Configure and maintain software production infrastructure and tooling.
  • Serve as an internal expert on infrastructure and tooling, including software production pipelines, providing guidance to engineering teams and making high-level recommendations to improve efficiency, reliability, and stability.
  • Create/maintain service monitoring, improve automation, uphold security best practices and respond to various service situations for the software production infrastructure.
  • Resolve service incidents by use of existing operating procedures, investigate outage causes and coordinate incident resolution across various service teams.
  • Act as a leader and mentor to your less experienced colleagues, bring and drive continuous improvement ideas and help the team to benefit from technology evolution, such as AI tools utilization.
  • Collaborate on incident retrospective reviews and corrective items implementation.
  • Proactively identify and eliminate toil by automating manual, repetitive, and error-prone processes.
  • Coordinate your actions with other Red Hat teams such as IT and Product Security to ensure our infrastructure meets quality expectations.
  • Implement monitoring, alerting and escalation plans in the event of an infrastructure outage or performance problem.
  • Work with service owners to co-define and implement SLIs and SLOs for the services you’ll support, ensure those are met and execute remediation plans if they are not.
  • Helpout/backup RHIVOS Raleigh lab SRE when needed.

Requirements

  • 8+ years of software reliability engineering experience with deep expertise in Linux systems, infrastructure-as-code, and complex, distributed enterprise environments.
  • Linux administration expertise
  • Advanced experience of Kubernetes/OpenShift administration and application development
  • Advanced experience of automation services like Ansible or Terraform
  • Advanced experience of CI/CD platforms like GitLab CI, Tekton and Pipelines as a code (optionally GitHub Actions etc)
  • Advanced experience and experience with monitoring platforms and technologies
  • Advanced experience and experience of AWS technologies
  • Experience with open source monitoring technologies (Grafana, Prometheus, OpenTelemetry)
  • Excellent written and verbal communication skills in English, as you'll be working in a globally distributed team
  • Proven track record for leading and hands on implementing a program/product wide adoption of a data-driven reliability framework by architecting complex, multi-service SLO/SLI standards and institutionalizing error budget policies that effectively balance rapid feature velocity with global system stability
  • Previous experience with the Site Reliability Engineer (SRE) model and software development using Python or GoLang.
  • Ability to work in the Raleigh office when needed
Benefits
  • Comprehensive medical, dental, and vision coverage
  • Flexible Spending Account - healthcare and dependent care
  • Health Savings Account - high deductible medical plan
  • Retirement 401(k) with employer match
  • Paid time off and holidays
  • Paid parental leave plans for all new parents
  • Leave benefits including disability, paid family medical leave, and paid military leave
  • Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Linux systemsinfrastructure-as-codeKubernetesOpenShiftAnsibleTerraformCI/CDGitLab CIAWSPython
Soft Skills
communicationleadershipmentoringcollaborationproblem-solvingcontinuous improvementfeedbacktrainingincident responsedata-driven decision making