Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Bright Vision Technologies

Site Reliability Engineer – SRE

Bright Vision Technologies

Site Reliability Engineer ensuring operational excellence for large-scale systems. Collaborating with development and operations teams to enhance infrastructure reliability and performance.

Posted 5/17/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSeniorWebsite

Tech Stack

Tools & technologies
Distributed SystemsGoGrafanaJavaKubernetesLinuxPrometheusPython

About the role

Key responsibilities & impact
  • Ensure the availability, performance, and operational excellence of large-scale distributed systems in production.
  • Live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems.
  • Continuously push the platform toward higher reliability with lower operational toil.
  • Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services.
  • Lead incident response and resolution for production issues.
  • Design and implement comprehensive monitoring, logging, and tracing strategies.
  • Build and maintain robust on-call processes, runbooks, and escalation paths.
  • Automate operational toil aggressively by writing production-grade tooling.
  • Architect and operate large-scale Kubernetes clusters and container-based workloads.
  • Design CI/CD pipelines that promote safe, frequent, and observable releases.
  • Lead capacity planning and performance engineering activities.
  • Partner closely with application development teams to embed reliability practices early in design.
  • Drive continuous improvement of security posture in collaboration with security teams.
  • Mentor engineers across the organization on SRE practices.

Requirements

What you’ll need
  • Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
  • Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems.
  • Strong programming skills in at least one of Python, Go, or Java, with the ability to build robust automation and tooling.
  • Deep, hands-on experience operating Linux at scale, including networking, performance tuning, and systems-level troubleshooting.
  • Production experience operating Kubernetes and container-based workloads.
  • Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents.
  • Hands-on experience designing and operating CI/CD pipelines for both infrastructure and applications.
  • Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
  • Demonstrated experience leading incident response and conducting effective post-incident reviews.
  • Excellent communication and documentation skills.

Benefits

Comp & perks
  • Comprehensive benefits
  • Competitive compensation packages
  • Supportive work-life balance

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonGoJavaKubernetesCI/CDLinuxobservability toolingperformance tuningdistributed system designautomation
Soft Skills
communicationdocumentationmentoringincident responsecollaborationleadershipcontinuous improvementproblem-solvingorganizational skillscapacity planning