ONE

Site Reliability Engineer

ONE

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $140,000 - $180,000 per year

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDistributed SystemsGoGrafanaJavaJavaScriptKubernetesNode.jsPrometheusPythonTerraformTypeScript

About the role

  • Ensure stability, scalability, and security of systems powering OnePay's financial products for millions of customers
  • Design, build, and maintain scalable infrastructure and tooling to improve reliability, performance, and availability across the platform
  • Contribute to the evolution of observability stack, platform libraries, cloud architecture, and CI/CD pipelines
  • Develop automation and monitoring systems to detect, prevent, and remediate incidents before they impact customers
  • Partner closely with product and platform engineering teams to embed reliability best practices in design, development, and deployment
  • Lead root cause analysis and postmortems, driving long-term improvements in resiliency and fault tolerance

Requirements

  • 5+ years of experience as a Software Engineer focused on building and running reliable, large-scale, distributed systems in production
  • 5+ years of operational experience in observability tooling and libraries (metrics, logging, tracing); experience using Datadog or similar tools (Prometheus, Grafana)
  • Proficiency in at least one programming language (Python, Go, Java, or Node.js preferred) for automation and tooling
  • Proficiency in incident management, going on-call, and writing post-mortem reports
  • Excellent collaboration skills with the ability to influence and educate product engineering teams on reliability and observability best practices
  • Hands-on experience with cloud platforms (AWS preferred), container orchestration (Kubernetes), and IAC tools (Terraform, Pulumi)
  • Drive and proactivity; builder and executor mindset
  • Familiarity with functional programming concepts and fp-ts/TypeScript is a plus
  • Authorization to work in the United States (application asks about work authorization and sponsorship)
Pythian

Site Reliability Engineer

Pythian
Mid · Seniorfull-time🇮🇳 India
Posted: 15 days agoSource: jobs.lever.co
AWSCloudDistributed SystemsDockerGoGrafanaKubernetesLinuxMicroservicesOraclePrometheusPython+2 more
Veeam Software

Staff Site Reliability Engineer

Veeam Software
Leadfull-time🇺🇸 United States
Posted: 16 days agoSource: boards.greenhouse.io
AzureCloudDistributed SystemsGoGrafanaJavaJavaScriptKubernetesPrometheusTerraformTypeScript
Yuxi Global powered by Veritas Automata

Senior Manager, Application Development

Yuxi Global powered by Veritas Automata
Seniorfull-time🇺🇸 United States
Posted: 7 days agoSource: jobs.smartrecruiters.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaJavaJavaScriptJenkinsKubernetes+9 more
Vimeo

Site Reliability Engineer III

Vimeo
Mid · Seniorfull-time$117k–$179k / yearNew York · 🇺🇸 United States
Posted: 23 days agoSource: boards.greenhouse.io
AWSChefCloudDistributed SystemsGoGrafanaGraphiteJavaKubernetesLinuxMySQLPHP+4 more
CrowdStrike

Senior Full Stack Software Engineer, Infrastructure

CrowdStrike
Seniorfull-time$120k–$180k / year🇺🇸 United States
Posted: 7 days agoSource: crowdstrike.wd5.myworkdayjobs.com
AWSCloudCyber SecurityDistributed SystemsGoGrafanaGRPCJavaScriptKubernetesLinuxMicroservicesPrometheus+4 more