FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSCloudDistributed SystemsDockerDynamoDBGrafanaKubernetesMongoDBPostgresPrometheusPythonRedisTerraform
About the role
Key responsibilities & impact- Own the reliability, scalability, and operational excellence of our Cloud-based services.
- Define and enforce reliability standards.
- Drive the adoption of SRE practices across engineering teams.
- Build the systems and tooling that keep our production infrastructure healthy.
- Define, publish, and continuously refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for all critical services, partnering with product and engineering leadership.
- Own the error budget framework: track consumption, enforce error budget policies, and drive reliability investments when budgets are at risk.
- Lead the design and implementation of comprehensive observability platforms — metrics, structured logging, and distributed tracing — to ensure full visibility into production systems.
- Drive toil reduction initiatives by identifying and automating repetitive, manual operational work, targeting measurable reduction in operational burden across teams.
- Design and execute chaos engineering programs to proactively uncover reliability weaknesses in our infrastructure and services before they impact customers.
- Lead blameless postmortem culture: facilitate incident retrospectives, extract systemic learnings, and track corrective action items to completion.
- Build and improve on-call incident response processes, runbooks, and escalation paths; manage and optimize on-call rotation health to prevent burnout.
- Help design, build, and support infrastructure and security technologies within the cloud that offer resiliency, observability, and optimized cost.
- Develop solutions for automated deployment of software and services on our production infrastructure hosted on AWS, applying reliability engineering principles throughout.
- Shape how mission-critical enterprise software solutions are developed and deployed using optimized CI/CD pipelines that embed reliability and quality gates.
- Develop management solutions for services across multiple cloud platforms and data centers, with a focus on fault tolerance and graceful degradation.
- Collaborate with developers to bring new features and services into production using production-readiness reviews and launch checklists.
- Champion reliability engineering best practices across the organization, embedding SRE principles into the software development lifecycle.
- Mentor team members on SRE philosophy, technical decision-making, code reviews, and cloud engineering best practices.
- Participate in roadmap planning, identify areas of improvement, and perform technology evaluation and selection.
Requirements
What you’ll need- 7+ years of experience in scalable, distributed systems architecture.
- 3+ years of hands-on Site Reliability Engineering experience, including ownership of SLOs and error budget management.
- 4+ years of experience with Cloud Platforms, including AWS.
- 4+ years of experience in infrastructure as code (Terraform, AWS CDK).
- 5+ years of experience in scripting using Python, Shell, or a similar language.
- 3+ years of experience with containerization technologies, including Docker.
- 4+ years of experience with orchestration technologies, including Kubernetes.
- Demonstrated experience designing and operating observability stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, or equivalent).
- Experience with incident management platforms and on-call tooling (e.g., PagerDuty, OpsGenie).
- Experience defining and implementing automated service deployments, including provisions for networking, security, reliability, management, reporting, and configuration management.
- Experience with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos, Gremlin, or equivalent).
- Experience managing databases — PostgreSQL, Redis, DynamoDB, MongoDB.
- In-depth understanding of best practices for deployment automation and production-readiness reviews.
- Experience using Git in a team environment (merge requests, branching, push, and pulls).
- CS Degree or equivalent experience.
Benefits
Comp & perks- Health insurance
- Professional development opportunities
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringSLO managementinfrastructure as codeTerraformAWS CDKscriptingPythonShellcontainerizationDocker
Soft Skills
leadershipmentoringcollaborationincident managementblameless postmortem culturecommunicationproblem-solvingroadmap planningtechnical decision-makingtoil reduction
Certifications
CS Degree
