FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer – SRE
Bright Vision TechnologiesSite Reliability Engineer ensuring operational excellence for large-scale systems. Collaborating with development and operations teams to enhance infrastructure reliability and performance.
Tech Stack
Tools & technologiesDistributed SystemsGoGrafanaJavaKubernetesLinuxPrometheusPython
About the role
Key responsibilities & impact- Ensure the availability, performance, and operational excellence of large-scale distributed systems in production.
- Live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems.
- Continuously push the platform toward higher reliability with lower operational toil.
- Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services.
- Lead incident response and resolution for production issues.
- Design and implement comprehensive monitoring, logging, and tracing strategies.
- Build and maintain robust on-call processes, runbooks, and escalation paths.
- Automate operational toil aggressively by writing production-grade tooling.
- Architect and operate large-scale Kubernetes clusters and container-based workloads.
- Design CI/CD pipelines that promote safe, frequent, and observable releases.
- Lead capacity planning and performance engineering activities.
- Partner closely with application development teams to embed reliability practices early in design.
- Drive continuous improvement of security posture in collaboration with security teams.
- Mentor engineers across the organization on SRE practices.
Requirements
What you’ll need- Bachelor’s degree in Computer Science, Engineering, or a related technical discipline.
- Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems.
- Strong programming skills in at least one of Python, Go, or Java, with the ability to build robust automation and tooling.
- Deep, hands-on experience operating Linux at scale, including networking, performance tuning, and systems-level troubleshooting.
- Production experience operating Kubernetes and container-based workloads.
- Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents.
- Hands-on experience designing and operating CI/CD pipelines for both infrastructure and applications.
- Solid understanding of distributed system design, including consistency models, partitioning, and failure semantics.
- Demonstrated experience leading incident response and conducting effective post-incident reviews.
- Excellent communication and documentation skills.
Benefits
Comp & perks- Comprehensive benefits
- Competitive compensation packages
- Supportive work-life balance
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGoJavaKubernetesCI/CDLinuxobservability toolingperformance tuningdistributed system designautomation
Soft Skills
communicationdocumentationmentoringincident responsecollaborationleadershipcontinuous improvementproblem-solvingorganizational skillscapacity planning