Twilio

Software Architect, Reliability Engineering

Twilio

full-time

Posted on:

Location Type: Remote

Location: CaliforniaColoradoUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $227,840 - $335,000 per year

Job Level

About the role

  • Partner with senior technical leaders across Twilio to set and communicate the reliability strategy, translating business goals into measurable outcomes.
  • Influence company-wide architectural decisions while balancing long-term vision with near-term and compliance needs.
  • Lead the design, implementation, and operation of scalable solutions and paved roads that enable reliable, high-traffic services;
  • Influence company-wide architectural decisions to focus on availability, performance, resilience, and cost efficiency using Kubernetes, AWS, Terraform, and modern observability.
  • Ensure integrity and quality across the service lifecycle; design fault-tolerant architectures, incident response, disaster recovery, and capacity/cost management.
  • Collaborate with product and cross-functional teams to identify reliability risks and convert them into actionable designs, programs, and tooling.
  • Establish and champion reliability practices and drive systemic improvements.
  • Mentor and grow engineers and technical leaders
  • Track and apply emerging SRE, cloud, and large-scale systems best practices; introduce pragmatic innovations that improve reliability at scale.

Requirements

  • 15+ years of experience in Reliability Engineering, Software Engineering, DevOps roles with a focus on infrastructure, backend systems, and reliability, including as a principal/architect.
  • Strong experience in driving strategic technical decisions and defining long-term technical vision.
  • In-depth understanding of the role of Reliability Engineering in a large and diverse SaaS organization.
  • Experience driving cross-org technical architecture outcomes.
  • Knowledge of cloud architecture, devops practices, and large-scale systems design with microservices.
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability in high-scale environments.
  • Hands-on experience with Kubernetes (e.g., EKS), deploying and managing stateful services, and cloud services like AWS.
  • Proficiency in infrastructure-as-code tools such as Terraform or CloudFormation for automating infrastructure.
  • Expertise in observability tools (e.g., Prometheus, Grafana, Datadog) for monitoring distributed systems and setting up alerting.
  • Proficient in at least one programming language (e.g., Go, Python, Java) for building automation and tooling.
  • Experience designing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
  • Experience running cross-functional post-incident reviews and driving improvements.
  • Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
  • Proven track record of leading reliability improvements in data-intensive or mission-critical systems and collaborating with engineering teams.
  • Excellent problem-solving, analytical, verbal, and written communication skills, with the ability to work in cross-functional and distributed environments.
  • Demonstrated leadership in mentoring teams, influencing decisions, and balancing long-term objectives with short-term needs.
  • Ability to influence and build effective working relationships with all levels of the organization.
Benefits
  • health care insurance
  • 401(k) retirement account
  • paid sick time
  • paid personal time off
  • paid parental leave
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Reliability EngineeringSoftware EngineeringDevOpscloud architecturemicroservicesinfrastructure-as-codeincident responseobservabilityprogramming (Go, Python, Java)scalable systems design
Soft Skills
problem-solvinganalytical skillscommunication skillsmentoringinfluencing decisionscollaborationleadershipstrategic thinkingcross-functional teamworkrelationship building