Onebrief

Senior Site Reliability Engineer

Onebrief

full-time

Posted on:

Location Type: Hybrid

Location: Tacoma • Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $180,000 - $220,000 per year

Job Level

Senior

Tech Stack

AnsibleAWSCloudGoGrafanaJenkinsKubernetesPrometheusPythonTerraform

About the role

  • You'll own the reliability, scalability, and security of the production application and/or platform.
  • Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
  • Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  • Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents.
  • Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments using Infrastructure-as-Code.
  • Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation.

Requirements

  • 5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
  • Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
  • A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.
  • Technical expertise in Infrastructure as Code: Terraform (or CloudFormation), Ansible.
  • Containers and orchestration: Kubernetes design, deployment, and operations.
  • CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
  • Scripting: proficiency with at least one of Python, Go, or Bash.
  • Cloud: Familiarity with AWS or AWS GovCloud.
  • Observability: Grafana stack, ELK stack, or Datadog.
  • Networking fundamentals: core protocols and secure configurations.
  • Active Security+ or another DoD 8570.01-approved security credential, or the ability to obtain the valid credentials within 3 months of employment.
Benefits
  • Relocation assistance
  • Active Secret Clearance required

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Infrastructure as CodeTerraformAnsibleKubernetesCI/CDGitLab CI/CDJenkinsGitHub ActionsPythonGo
Soft skills
collaborationincident responseroot cause analysiscontinuous improvementcommunicationleadership
Certifications
Security+DoD 8570.01-approved security credential