Tech Stack
AWSCloudDockerGrafanaJavaScriptKafkaKubernetesNode.jsPrometheusRabbitMQTerraform
About the role
- Maintain uptime and reliability across critical systems, with a focus on scalability, observability, and incident prevention.
- Design and manage cloud infrastructure using tools like Terraform, Kubernetes, and CI/CD pipelines.
- Automate operations for routine tasks, monitoring, deployment, and disaster recovery.
- Support and improve on-call processes, including incident response, retrospectives, and tooling.
- Collaborate cross-functionally with platform, security, and product teams to implement best practices and ship reliable software.
- Build systems for visibility—develop dashboards, alerts, and documentation to monitor and report on system health.
- Contribute to infrastructure projects that improve security, performance, and developer velocity.
Requirements
- Hands-on experience operating cloud-based systems (AWS preferred)
- Proficiency with Kubernetes, Helm, Docker
- Familiarity with CI/CD tooling and deployment pipelines
- Strong understanding of observability tools (Datadog, Grafana, Prometheus, etc)
- Ability to troubleshoot issues quickly and communicate clearly
- Solid scripting or programming fundamentals (Node.js experience is a plus)
- Good instincts around systems design, incident management, and reliability practices
- Comfortable working in high-speed, high-scale environments
- Nice to have: Experience with messaging systems like RabbitMQ, Kafka, or NATS
- Nice to have: Exposure to internal developer platforms or tooling
- Nice to have: Prior experience in platform, DevOps, or infrastructure teams
- Nice to have: Previous experience supporting sandbox, staging, or demo environments