OpenAI

Software Engineer, Infrastructure Reliability

OpenAI

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $255,000 - $405,000 per year

Job Level

Mid-LevelSenior

Tech Stack

AWSAzureCloudDistributed SystemsGoogle Cloud PlatformGrafanaKubernetesLinuxMicroservicesPrometheusSplunkTerraform

About the role

  • Design, build, and operate reliable and performant systems used across engineering
  • Scale and harden infrastructure that powers AI systems, ensuring systems are highly reliable, observable, performant, and secure
  • Identify and fix performance bottlenecks and inefficiencies to support growth to the next order of magnitude
  • Dig deep to resolve complex issues and contribute to incident response and postmortems
  • Continuously improve automation to reduce manual work and improve internal tooling and developer experience
  • Contribute to development of best practices around system reliability and scalability
  • Shape technical direction, proactively improve system resilience, and collaborate closely with infra, product, and research teams to support cutting-edge research and global deployments
  • Own problems end-to-end and operate across the stack

Requirements

  • 4+ years of relevant industry experience, with 2+ years leading large scale, complex projects or teams as an engineer or tech lead
  • Proven experience as an reliability engineer, production engineer, or a similar role in a fast-paced, rapidly scaling company
  • A passion for distributed systems at scale with a focus on reliability, scalability, security, and continuous improvement
  • Strong proficiency in cloud infrastructure (like AWS, GCP, Azure) and IaC tools such as Terraform
  • Proficiency in programming / scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Are comfortable working in Linux environments, and with tools like Kubernetes, Terraform, CI/CD pipelines, and modern observability stacks
  • Experience with observability tools such as Datadog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments
  • Strong understanding of distributed systems, networking, and database technologies
  • Excellent problem-solving skills and ability to work in a fast-paced environment
Starling Bank

Software Engineer

Starling Bank
Mid · Seniorfull-time🇦🇺 Australia
Posted: 6 days agoSource: apply.workable.com
AWSCloudGoogle Cloud PlatformGrafanaJavaKubernetesTerraform
NVIDIA

Software Engineering Intern, Infrastructure

NVIDIA
Entryinternship🇩🇪 Germany
Posted: 21 days agoSource: nvidia.wd5.myworkdayjobs.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaMicroservicesPrometheusPythonSQL
Yuxi Global powered by Veritas Automata

Senior Manager, Application Development

Yuxi Global powered by Veritas Automata
Seniorfull-time🇺🇸 United States
Posted: 6 days agoSource: jobs.smartrecruiters.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaJavaJavaScriptJenkinsKubernetes+9 more
Fanatics

Senior Manager, Platform Engineering

Fanatics
Seniorfull-time$168k–$330k / yearNew York · 🇺🇸 United States
Posted: 19 days agoSource: fa-exki-saasfaprod1.fa.ocs.oraclecloud.com
AWSCloudGoGrafanaJavaJenkinsKotlinKubernetesPrometheusPythonTerraform
Pythian

Site Reliability Engineer

Pythian
Mid · Seniorfull-time🇮🇳 India
Posted: 13 days agoSource: jobs.lever.co
AWSCloudDistributed SystemsDockerGoGrafanaKubernetesLinuxMicroservicesOraclePrometheusPython+2 more