Senior Site Reliability Engineer

Site Reliability Engineer at Onebrief focusing on reliability and scalability of mission-critical applications in DoD environments and AWS cloud.

Posted 5/19/2026full-timeColorado Springs • Colorado • 🇺🇸 United StatesSenior💰 $180,000 - $220,000 per yearWebsite

Tools & technologies

AnsibleAWSCloudGoGrafanaJenkinsKubernetesPrometheusPythonTerraform

Key responsibilities & impact

You'll own the reliability, scalability, and security of the production application and/or platform.
Implementing a World-Class Observability Platform: Design, implement, and manage our monitoring, logging, and alerting stack (e.g., Prometheus, Loki, Alloy, and Grafana).
Defining and Upholding Reliability: Define, measure, and own alerting that feeds into our Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Leading Incident Response: Act as the incident responder and potentially incident commander during critical incidents.
Automating for Scale and Security: Partner with platform engineers to design, build, and manage secure, resilient Kubernetes clusters and cloud/on-prem environments.
Eliminating Toil and Scaling the Team: Proactively identify and eliminate operational toil by building automation.

What you’ll need

An active Top Secret clearance
5+ years in Platform, DevOps, or Site Reliability Engineering with an infrastructure and operations focus.
Proven partner to DevOps/Platform and application teams; collaborates well across functions and shares context openly.
A deep understanding of incident response processes, with experience conducting thorough root cause analyses and driving continuous improvement.
Infrastructure as Code: Terraform (or CloudFormation), Ansible.
Containers and orchestration: Kubernetes design, deployment, and operations.
CI/CD: experience building and maintaining pipelines (GitLab CI/CD, Jenkins, GitHub Actions).
Scripting: proficiency with at least one of Python, Go, or Bash.
Cloud: Familiarity with AWS or AWS GovCloud.
Observability: Grafana stack, ELK stack, or Datadog.
Networking fundamentals: core protocols and secure configurations.

Comp & perks

Equity: Share in the company's success.
Flexible Work Environment: Remote-first organization* with flexible work hours and unlimited PTO.***(*note that some roles are in-person, on-site with customers)*
Comprehensive Health Coverage: Health, dental, vision, and life insurance.
Retirement Plan: 401(k) plan with company match to secure your future.
Parental Leave: 8 weeks at 100% regardless of state.
Company Retreats: Annual company summit trips.
Home Office Budget: $1,000 per year for home office improvements.

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesTerraformAnsiblePythonGoBashGitLab CI/CDJenkinsGitHub ActionsAWS

Soft Skills

collaborationincident responseroot cause analysiscontinuous improvement

Certifications

Top Secret clearance