Senior Site Reliability Engineer

BrightHire

full-time

Posted on: 12/9/2025

Location Type: Remote

Location: Remote • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Job Level

Senior

Tech Stack

ElasticSearchGrafanaKubernetesPrometheusPythonSQL

About the role

You will own the end-to-end reliability and performance of many of our most critical systems.
Working in lockstep with Product and Engineering, you will design, build, and refine the platform that our application and AI features run on, from Kubernetes and databases through CI/CD and observability.
You will focus on keeping our systems fast, reliable, and easy for developers to work with.
You will work on real infrastructure that supports features people use every day—things like:
Continuing to improve and iterate on our observability stack that includes Kibana, Grafana, OTel, and Elastic.
Database performance improvements by analyzing slow and high-volume queries, tuning indexes, optimizing query patterns and timing, and recommending schema and code changes to keep QPS and latency low.
Kubernetes improvements and upgrades, including deploying new services, improving resource utilization, tightening security, and standardizing deployment patterns across teams.
Improving CI/CD pipelines for both backend and frontend services so engineers can ship quickly and safely, with clear feedback loops, fast build times, and reliable rollbacks.
Enhancing the local developer experience so that running and debugging the app locally feels fast, consistent, and representative of production.
Helping improve our CI/CD and observability for our ML pipeline and models, bringing MLOps best practices into our existing infrastructure.

Requirements

You have real-world experience running production systems and doing SRE, Platform, or DevOps work for web applications or APIs.
You are comfortable working across Kubernetes, CI/CD, databases, and backend services, and you enjoy owning problems end to end.
You have strong experience with Kubernetes in production environments, including cluster upgrades, workload deployments, scaling, and debugging.
You have experience with observability stacks (such as Elasticsearch and Kibana, Prometheus, Grafana, or similar) and can lead efforts like upgrading Kibana to new major versions and improving logs, metrics, and dashboards.
You have worked deeply with relational databases and SQL, know how to profile slow queries, design and tune indexes, and work with engineers to adjust query patterns, timing, and frequency to improve performance.
You are comfortable in at least one backend language (i.e. Python) and can read and modify application code to support infra and performance improvements.
You have experience improving CI/CD pipelines, including build and test speed, deployment workflows, and release strategies (such as blue/green or canary).
You have worked with infrastructure-as-code tools or similar patterns to manage environments in a repeatable way.
You think deeply about developer experience and reliability and use both metrics and empathy to guide your decisions.
You care about security, resiliency, and cost as integral aspects of the systems you build and manage.
You move fast and independently, but you know when to pull in teammates for pairing, reviews, or cross-team alignment.

Benefits

Flexible working hours
Professional development opportunities
Remote work options
Strong observability

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills

KubernetesCI/CDSQLPythonobservabilityinfrastructure-as-codedatabase performance tuningMLOpsbackend servicesperformance optimization

Soft skills

problem ownershipdeveloper experiencereliabilityempathyindependencecollaborationcommunicationleadershipcritical thinkingadaptability