FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAWSCloudJavaScriptPythonTerraformTypeScript
About the role
Key responsibilities & impact- Lead reliability-focused design and readiness reviews for new and existing services, ensuring production readiness, clear rollout and rollback strategies, and strong observability for every launch.
- Build, operate, and continuously improve our observability stack (e.g., logging, metrics, tracing) to provide meaningful dashboards, alerts, and runbooks that enable fast, high-quality incident response across engineering teams.
- Own and evolve incident management practices, including on-call participation, incident response processes, and post-incident reviews that drive long-term remediation and learning across teams.
- Plan and execute disaster recovery exercises and game days to validate our resilience posture, test failover and backup strategies, and systematically reduce single points of failure.
- Perform capacity planning and cost optimization for our cloud infrastructure, helping ensure we run a cost-effective environment that meets performance and availability goals as usage grows.
- Identify and drive down systemic reliability risks across application, infrastructure, and process layers—owning cross-team projects that significantly reduce incident frequency and severity over time.
- Collaborate closely with Developer Experience, Security, and product engineering to embed reliability best practices—testing, rollout patterns, guardrails, and “golden paths”—into shared tools and CI/CD pipelines.
- Participate in and help continuously improve the on-call rotation, using real incidents and near-misses to prioritize automation, better alerting, and clearer documentation.
Requirements
What you’ll need- 5+ years of experience in Site Reliability Engineering, Production Engineering, Infrastructure Engineering, or a closely related role, including hands-on ownership of production systems.
- Strong experience operating modern cloud infrastructure, ideally on AWS, including core services for compute, networking, storage, and security primitives.
- Proficiency with at least one programming language used at Transcend (e.g., JavaScript, Typescript, or Python), and comfort reading and reviewing application code for reliability and performance concerns.
- Hands-on experience with infrastructure-as-code and CI/CD tooling (e.g., Terraform, CloudFormation, or similar; modern build/deploy pipelines) to reliably provision and change infrastructure.
- Deep familiarity with observability and monitoring systems (e.g., Datadog or equivalent), including designing alerts that balance coverage and noise to avoid alert fatigue while protecting customer experience.
- Proven track record running incident response and post-incident analysis, including root cause identification, clear documentation, and driving follow-through on remediation work.
- Excellent communication and collaboration skills, with experience working across multiple engineering teams to align on reliability goals, share context, and influence technical direction without formal authority.
- Comfort participating in an on-call rotation, and experience helping to design or improve on-call processes, runbooks, and escalation paths.
- Minimum level of education: Bachelor’s degree in Computer Science, Engineering, Information Systems, or a related technical field, or equivalent practical experience.
- Demonstrated ability to thrive in a remote-first, high-autonomy environment, managing priorities, communicating asynchronously, and driving projects to completion with limited oversight.
Benefits
Comp & perks- Flexible PTO
- Parental leave
- 401(k) match
- Competitive compensation packages that include employee equity
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringProduction EngineeringInfrastructure EngineeringAWSJavaScriptTypescriptPythoninfrastructure-as-codeCI/CDobservability
Soft Skills
communicationcollaborationincident responseproblem-solvinginfluence without authorityremote workproject managementprioritizationdocumentationadaptability
