Twilio

Site Reliability Engineer

Twilio

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Salary

💰 $152,500 - $224,200 per year

Job Level

SeniorLead

Tech Stack

AirflowApacheAWSCloudDistributed SystemsEC2GoGrafanaJavaKafkaKubernetesPrometheusPythonTerraform

About the role

  • Design, build, and maintain infrastructure and scalable frameworks to support data ingestion, processing, and analysis.
  • Collaborate with stakeholders, analysts, and product teams to understand business requirements and translate them into technical solutions.
  • Architect and implement data streaming solutions using modern data technologies such as Kafka, AWS MSK, Terraform, Hive, Hudi, Presto, Airflow, and cloud-based services like AWS EKS, Lakeformation, Glue and Athena.
  • Design and implement frameworks and solutions for performance, reliability, and cost-efficiency.
  • Ensure data quality, integrity, and security throughout the data lifecycle.
  • Stay current with emerging technologies and best practices in big data technologies.
  • Mentor early in career engineers and contribute to a culture of continuous learning and improvement.

Requirements

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • 8+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering roles with a focus on infrastructure or backend systems.
  • Strong production experience, including operational management, scaling, partitioning strategies, and tuning for performance and reliability.
  • Hands-on experience with Kubernetes (preferably EKS), including deploying and managing stateful services and operators in Kubernetes environments.
  • Deep understanding of AWS cloud services, particularly those relevant to data infrastructure (e.g., EC2, EBS, S3, IAM, MSK, CloudWatch, VPC, ALB/NLB).
  • Proficiency in infrastructure-as-code tools, such as Terraform or CloudFormation, for managing and automating infrastructure.
  • Expertise in observability tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog) to monitor distributed systems and set up alerting for reliability and latency.
  • Proficient in at least one programming language (e.g., Go, Python, Java, or similar) for building automation, tooling, and contributing to platform services.
  • Experience designing and implementing incident response processes, SLOs/SLIs, runbooks, and participating in on-call rotations.
  • Strong understanding of distributed systems principles, including consensus, durability, throughput, and availability tradeoffs.
  • Proven track record of driving reliability improvements in high-scale, data-intensive systems and collaborating with platform and data engineering teams.
  • Excellent problem-solving and analytical skills.
  • Strong verbal & written communication skills, with the ability to work effectively in a cross-functional team environment.