Site Reliability Engineer – SRE

Bright Vision Technologies

Site Reliability Engineer ensuring operational excellence for large-scale systems. Collaborating with development and operations teams to enhance infrastructure reliability and performance.

Posted 5/17/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies

Distributed SystemsGoGrafanaJavaKubernetesLinuxPrometheusPython

About the role

Key responsibilities & impact

Ensure the availability, performance, and operational excellence of large-scale distributed systems in production.
Live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems.
Continuously push the platform toward higher reliability with lower operational toil.
Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services.
Lead incident response and resolution for production issues.
Design and implement comprehensive monitoring, logging, and tracing strategies.
Build and maintain robust on-call processes, runbooks, and escalation paths.
Automate operational toil aggressively by writing production-grade tooling.
Architect and operate large-scale Kubernetes clusters and container-based workloads.
Design CI/CD pipelines that promote safe, frequent, and observable releases.
Lead capacity planning and performance engineering activities.
Partner closely with application development teams to embed reliability practices early in design.
Drive continuous improvement of security posture in collaboration with security teams.
Mentor engineers across the organization on SRE practices.

Requirements

What you’ll need

Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems.
Strong programming skills in at least one of Python, Go, or Java, with the ability to build robust automation and tooling.
Deep, hands-on experience operating Linux at scale, including networking, performance tuning, and systems-level troubleshooting.
Production experience operating Kubernetes and container-based workloads.
Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents.
Hands-on experience designing and operating CI/CD pipelines for both infrastructure and applications.
Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
Demonstrated experience leading incident response and conducting effective post-incident reviews.
Excellent communication and documentation skills.

Benefits

Comp & perks

Comprehensive benefits
Competitive compensation packages
Supportive work-life balance

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonGoJavaKubernetesCI/CDLinuxobservability toolingperformance tuningdistributed system designautomation

Soft Skills

communicationdocumentationmentoringincident responsecollaborationleadershipcontinuous improvementproblem-solvingorganizational skillscapacity planning