
Senior Site Reliability Engineer – Cloud and Data Center Services
Bank of America
full-time
Posted on:
Location Type: Office
Location: Jersey City • New Jersey • 🇺🇸 United States
Visit company websiteSalary
💰 $152,600 - $197,900 per year
Job Level
Senior
Tech Stack
AnsibleAWSAzureCloudConsulDistributed SystemsDNSGoGoogle Cloud PlatformGrafanaJavaJenkinsLinuxOpenShiftPrometheusPythonShell ScriptingTerraform
About the role
- Responsible for reliability and support of Foundational Services Platforms and Tools oriented for both on-premises and external clouds (Azure / AWS / GCP)
- Design and build the solutions for non-functional requirements of the platforms including monitoring and resiliency
- Proactively monitor and troubleshoot environment performance issues, connectivity issues, security issues, etc.
- Perform deep dives into systemic and latent reliability issues, incident management, problem management
- Identify, analyze, and resolve infrastructure vulnerabilities and application deployment issues.
- Perform blameless RCA, partner with product engineering and operations teams across the organization to establish sustainable fixes
- Responsible for application onboarding and provide troubleshooting support through the lifecycle of the tools and platforms
- Identify and drive opportunities to improve automation to reduce TOIL and improve operational excellence
- Partner with risk, and compliance teams to bring visibility and implement right controls and remediation of vulnerabilities
- Be a key stakeholder in the design of cloud services and collaborate with architecture, engineering, operations and product teams
- Participate in 24x7 on-call coverage providing L3 platform support, including maintaining the schedule for other personnel
Requirements
- BS /MS degree in Computer Science or related technical field involving systems or equivalent practical experience
- Minimum 5+ years of hands-on experience supporting Site Reliability Engineering, DevOps, or Infrastructure roles
- Experience with Python, Ansible, Golang, Java and shell scripting
- Certification/Expertise in OpenShift architecture, operations, and container orchestration.
- Certification/Deep experience with Terraform and Terraform Enterprise (TFE), including Infrastructure as Code writing
- Certification/Solid understanding of Consul for service discovery and key-value configuration
- Proven track record of building automation in complex environments
- Familiarity with monitoring/observability tools (Prometheus, Grafana, ELK/EFK stacks, etc.)
- Experience in performance, integration, and chaos testing of distributed systems
- Solid knowledge of networking, security, and Linux internals
- Strong understanding and background of working with a complex IAM infrastructure, including Active Directory, Azure AD Connect, Azure AD, and Ping Identity or other SSO solutions
- Advanced knowledge of Linux OS, DNS, DHCP, Kerberos and Windows Authentication
- Experience with CI/CD tools git /Jenkins, GitOps model
- Excellent understanding of Linux /Windows operating systems administration
- Experience in vulnerability remediation
- Systematic problem-solving approach, sense of ownership and drive
- Ability to juggle competing priorities and adapt to changes in project scope
- Excellent interpersonal, organizational and communication (written, verbal, and presentation) skills are a must.
- Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities.
Benefits
- Access to paid time off
- Resources and support to contribute to sustainable growth of business and communities
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonAnsibleGolangJavashell scriptingOpenShiftTerraformConsulmonitoring toolsLinux
Soft skills
problem-solvingownershipadaptabilityinterpersonal skillsorganizational skillscommunication skillsteamworkindependence
Certifications
OpenShift architectureTerraformTerraform Enterpriseservice discoveryvulnerability remediation