Salary
💰 $83,000 - $155,000 per year
Tech Stack
AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonSplunkTerraform
About the role
- Drive reliability and performance by defining SLOs/SLIs, improving observability, and addressing system bottlenecks across cloud environments
- Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments
- Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents and ensure architectural clarity
- Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery
- Build and maintain monitoring systems using Prometheus, Grafana, AppDynamics, and Splunk for real-time alerting and root cause analysis
- Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations
- Maintain and improve live services by measuring and monitoring latency and overall system health
- Leverage and define KPIs to understand service performance and identify corrective actions
- Create, manage, and use dashboards for continuous monitoring and health checks of applications and infrastructure
- Design and implement solutions to customer friction points and improve service lifecycle from inception through sustainment
- Assist in creating and maintaining automation to improve reliability and velocity during maintenance tasks
- Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence
Requirements
- Bachelor’s degree in computer science or a related discipline, or equivalent work experience required; advanced degree preferred
- 5-8 years of related experience (minimum 5 years)
- Experience in the securities or financial services industry is a plus
- Strong expertise in cloud infrastructure (Azure, AWS, or GCP)
- Experience with containerization (Docker, Kubernetes)
- Infrastructure as Code experience (Terraform, Helm)
- Proficiency with observability and monitoring tools: Prometheus, Grafana, AppDynamics, Datadog, Splunk
- Experience with incident response and on-call support
- Programming/scripting skills in Python, Go, or Java
- Deep understanding of SRE principles: SLAs, SLOs, error budgets, postmortems, reliability-focused system design
- Familiarity with automated testing, DevSecOps practices, CI/CD, performance engineering, and security controls
- Strong collaboration and communication skills; experience in Agile environments
- Previous success in technical engineering and coding beyond simple scripts