BNY

Vice President, Site Reliability Engineer

BNY

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $83,000 - $155,000 per year

Job Level

Lead

Tech Stack

AWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonSplunkTerraform

About the role

  • Drive reliability and performance by defining SLOs/SLIs, improving observability, and addressing system bottlenecks across cloud environments
  • Automate infrastructure and operations using Terraform, Kubernetes, and CI/CD tools to eliminate toil and enable scalable, fault-tolerant deployments
  • Collaborate cross-functionally with product, infrastructure, and DevOps teams to reduce incidents and ensure architectural clarity
  • Lead incident management by participating in on-call rotations, conducting postmortems, and implementing automated recovery
  • Build and maintain monitoring systems using Prometheus, Grafana, AppDynamics, and Splunk for real-time alerting and root cause analysis
  • Develop platform tooling and pipelines for container orchestration, third-party integrations, and cloud-native operations
  • Maintain and improve live services by measuring and monitoring latency and overall system health
  • Leverage and define KPIs to understand service performance and identify corrective actions
  • Create, manage, and use dashboards for continuous monitoring and health checks of applications and infrastructure
  • Design and implement solutions to customer friction points and improve service lifecycle from inception through sustainment
  • Assist in creating and maintaining automation to improve reliability and velocity during maintenance tasks
  • Mentor engineers and champion SRE best practices, embedding a reliability-first culture and ensuring technical excellence

Requirements

  • Bachelor’s degree in computer science or a related discipline, or equivalent work experience required; advanced degree preferred
  • 5-8 years of related experience (minimum 5 years)
  • Experience in the securities or financial services industry is a plus
  • Strong expertise in cloud infrastructure (Azure, AWS, or GCP)
  • Experience with containerization (Docker, Kubernetes)
  • Infrastructure as Code experience (Terraform, Helm)
  • Proficiency with observability and monitoring tools: Prometheus, Grafana, AppDynamics, Datadog, Splunk
  • Experience with incident response and on-call support
  • Programming/scripting skills in Python, Go, or Java
  • Deep understanding of SRE principles: SLAs, SLOs, error budgets, postmortems, reliability-focused system design
  • Familiarity with automated testing, DevSecOps practices, CI/CD, performance engineering, and security controls
  • Strong collaboration and communication skills; experience in Agile environments
  • Previous success in technical engineering and coding beyond simple scripts