Tech Stack
AWSCloudDistributed SystemsGoJavaKafkaKubernetesMicroservicesTerraform
About the role
- Take ownership of reliability across a multi-region, cloud-native platform powering Oscilar's AI Risk Decisioning™ platform.
- Architect and operate resilient cloud infrastructure (AWS, Pulumi, Kubernetes).
- Lead initiatives to improve availability, latency, and performance at scale.
- Design and evolve CI/CD pipelines for speed, safety, and repeatability.
- Define metrics, alerts, and runbooks forming the observability backbone.
- Run chaos experiments and failure simulations to harden the platform.
- Mentor engineers and set SRE best practices across the company.
Requirements
- Proven track record as a senior SRE, DevOps, or infrastructure engineer in high-scale environments.
- Expert-level skills in AWS and Infrastructure as Code (Pulumi, Terraform).
- Strong programming ability in Go and Java.
- Deep understanding of distributed systems (Kafka, ClickHouse) and microservices architecture.
- Mastery of container orchestration (Kubernetes) and production debugging.
- Strong sense of ownership and judgment to balance velocity with reliability.