Salary
💰 $255,000 - $405,000 per year
Tech Stack
AWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusSplunkTerraform
About the role
- Design, build, and operate reliable and performant systems used across engineering
- Scale and harden infrastructure that powers AI systems, ensuring systems are highly reliable, observable, performant, and secure
- Identify and fix performance bottlenecks and inefficiencies to support growth to the next order of magnitude
- Dig deep to resolve complex issues and contribute to incident response and postmortems
- Continuously improve automation to reduce manual work and improve internal tooling and developer experience
- Contribute to development of best practices around system reliability and scalability
- Shape technical direction, proactively improve system resilience, and collaborate closely with infra, product, and research teams to support cutting-edge research and global deployments
- Own problems end-to-end and operate across the stack
Requirements
- 4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
- Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company
- A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
- Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
- Proficiency in programming / scripting languages
- Experience with containerization technologies and container orchestration platforms like Kubernetes
- Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks
- Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
- Experience with microservices architecture and service mesh technologies
- Knowledge of security best practices in cloud environments
- Strong understanding of distributed systems, networking, and database technologies
- Excellent problem-solving skills and ability to work in a fast-paced environment