SearchStax

Staff Site Reliability Engineer, AWS

SearchStax

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Salary

💰 $170,000 - $240,000 per year

Job Level

Lead

Tech Stack

ApacheAWSCloudDistributed SystemsDockerEC2ElasticSearchGoGrafanaJenkinsKubernetesOpen SourcePrometheusPythonTerraform

About the role

  • Lead and own scaling AWS infrastructure to support thousands of servers and high-growth workloads.
  • Design and implement automation frameworks for provisioning, monitoring/logging, scaling, and recovery to minimize manual operations.
  • Continuously evaluate and tune systems for latency, throughput, and cost efficiency.
  • Build resilient, self-healing, and observable systems using SLOs, error budgets, and reliability best practices.
  • Partner closely with Development, QA, and Product Engineering teams to deliver highly available and performant systems.
  • Own on-call processes, lead incident management and root-cause analysis, and implement preventive measures.
  • Mentor engineers, act as technical leader, and set standards for best practices.

Requirements

  • 7+ years in Site Reliability, DevOps, or Infrastructure Engineering roles.
  • Startup experience and track record of scaling infrastructure to thousands of servers.
  • Hands-on mastery of AWS services (EC2, EKS, RDS, S3, CloudFront, VPC, IAM).
  • Proficiency in Infrastructure as Code (Terraform, CloudFormation, or similar tools).
  • Strong automation and scripting skills in Python, Go, or similar languages (beyond basic scripting).
  • Expertise with monitoring & observability tools (Prometheus, Grafana, Loki, ELK/EFK, Datadog).
  • Experience with CI/CD and containers (Docker, Kubernetes, Jenkins or GitHub Actions).
  • Performance engineering experience: identify bottlenecks and optimize systems for scalability and efficiency.
  • Proven problem solving diagnosing complex production issues at scale.
  • Experience designing, deploying, and managing multi-region, highly available AWS architectures in production.
  • Experience owning end-to-end observability and leading production incident response/root-cause analysis.
  • Legal authorization to work in the United States (E-Verify and application questions indicate requirement).