Site Reliability Engineer

Twilio

full-time

Posted on: 7/31/2025

Location: California • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $152,500 - $224,200 per year

Job Level

SeniorLead

Tech Stack

AirflowApacheAWSCloudDistributed SystemsEC2GoGrafanaJavaKafkaKubernetesPrometheusPythonTerraform

About the role

Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis.
Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions.
Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena.
Design and implement frameworks and solutions for performance, reliability, and cost-efficiency.
Ensure data quality, integrity, and security throughout the data lifecycle.
Stay current with emerging technologies and best practices in big data technologies.
Mentor early in career engineers and contribute to a culture of continuous learning and improvement.

Requirements

Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems.
Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability.
Hands-on experience with Kubernetes (preferably EKS), including deploying and managing stateful services and operators in Kubernetes environments.
Deep understanding of AWS cloud services, particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB).
Proficiency in infrastructure-as-code tools, such as Terraform or CloudFormation, for managing and automating infrastructure.
Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency.
Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services.
Experience designing and implementing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams.
Excellent problem-solving and analytical skills.
Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment.

Site Reliability Engineer

Salary

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

Senior Infrastructure – DevOps Engineer

Senior Site Reliability Engineer

Azure DevOps Engineer – FedRAMP Healthcare Modernization

Senior DevOps Engineer

AWS DevOps Engineer