Software Engineer, Infrastructure Reliability

OpenAI

full-time

Posted on: 9/25/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $255,000 - $405,000 per year

Mid-LevelSenior

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusSplunkTerraform

About the role

Design, build, and operate reliable and performant systems used across engineering
Scale and harden infrastructure that powers AI systems, ensuring systems are highly reliable, observable, performant, and secure
Identify and fix performance bottlenecks and inefficiencies to support growth to the next order of magnitude
Dig deep to resolve complex issues and contribute to incident response and postmortems
Continuously improve automation to reduce manual work and improve internal tooling and developer experience
Contribute to development of best practices around system reliability and scalability
Shape technical direction, proactively improve system resilience, and collaborate closely with infra, product, and research teams to support cutting-edge research and global deployments
Own problems end-to-end and operate across the stack

4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company
A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
Proficiency in programming / scripting languages
Experience with containerization technologies and container orchestration platforms like Kubernetes
Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks
Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
Experience with microservices architecture and service mesh technologies
Knowledge of security best practices in cloud environments
Strong understanding of distributed systems, networking, and database technologies
Excellent problem-solving skills and ability to work in a fast-paced environment