Deel

Site Reliability Engineer

Deel

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Manual Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSCloudDockerGrafanaJavaScriptKafkaKubernetesNode.jsPrometheusRabbitMQTerraform

About the role

  • Maintain uptime and reliability across critical systems, with a focus on scalability, observability, and incident prevention.
  • Design and manage cloud infrastructure using tools like Terraform, Kubernetes, and CI/CD pipelines.
  • Automate operations for routine tasks, monitoring, deployment, and disaster recovery.
  • Support and improve on-call processes, including incident response, retrospectives, and tooling.
  • Collaborate cross-functionally with platform, security, and product teams to implement best practices and ship reliable software.
  • Build systems for visibility—develop dashboards, alerts, and documentation to monitor and report on system health.
  • Contribute to infrastructure projects that improve security, performance, and developer velocity.

Requirements

  • Hands-on experience operating cloud-based systems (AWS preferred)
  • Proficiency with Kubernetes, Helm, Docker
  • Familiarity with CI/CD tooling and deployment pipelines
  • Strong understanding of observability tools (Datadog, Grafana, Prometheus, etc)
  • Ability to troubleshoot issues quickly and communicate clearly
  • Solid scripting or programming fundamentals (Node.js experience is a plus)
  • Good instincts around systems design, incident management, and reliability practices
  • Comfortable working in high-speed, high-scale environments
  • Nice to have: Experience with messaging systems like RabbitMQ, Kafka, or NATS
  • Nice to have: Exposure to internal developer platforms or tooling
  • Nice to have: Prior experience in platform, DevOps, or infrastructure teams
  • Nice to have: Previous experience supporting sandbox, staging, or demo environments