Site Reliability Engineer

Deel

full-time

Posted on: 8/28/2025

Origin: • 🇺🇸 United States

✨ AI Apply

Mid-LevelSenior

AWSCloudDockerGrafanaJavaScriptKafkaKubernetesNode.jsPrometheusRabbitMQTerraform

About the role

Maintain uptime and reliability across critical systems, with a focus on scalability, observability, and incident prevention.
Design and manage cloud infrastructure using tools like Terraform, Kubernetes, and CI/CD pipelines.
Automate operations for routine tasks, monitoring, deployment, and disaster recovery.
Support and improve on-call processes, including incident response, retrospectives, and tooling.
Collaborate cross-functionally with platform, security, and product teams to implement best practices and ship reliable software.
Build systems for visibility—develop dashboards, alerts, and documentation to monitor and report on system health.
Contribute to infrastructure projects that improve security, performance, and developer velocity.

Hands-on experience operating cloud-based systems (AWS preferred)
Proficiency with Kubernetes, Helm, Docker
Familiarity with CI/CD tooling and deployment pipelines
Strong understanding of observability tools (Datadog, Grafana, Prometheus, etc)
Ability to troubleshoot issues quickly and communicate clearly
Solid scripting or programming fundamentals (Node.js experience is a plus)
Good instincts around systems design, incident management, and reliability practices
Comfortable working in high-speed, high-scale environments
Nice to have: Experience with messaging systems like RabbitMQ, Kafka, or NATS
Nice to have: Exposure to internal developer platforms or tooling
Nice to have: Prior experience in platform, DevOps, or infrastructure teams
Nice to have: Previous experience supporting sandbox, staging, or demo environments