Design and implement highly available, fault-tolerant systems supporting critical financial transactions.
Architect infrastructure solutions using AWS best practices, optimizing for cost, performance, and reliability.
Lead complex incident response efforts, coordinating across teams to restore service rapidly.
Drive postmortem processes for high-severity incidents, ensuring action items are identified and completed.
Establish and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for key services.
Design and implement disaster recovery strategies and business continuity plans.
Build advanced Infrastructure as Code solutions using Terraform, including modules, workspaces, and state management.
Architect and optimize multi-cluster EKS environments, including pod autoscaling, cluster autoscaling, and resource optimization.
Design observability strategies using Datadog and Splunk, including metrics, dashboards, and alerting that support proactive detection.
Implement progressive delivery mechanisms (canary and blue-green deployments) within GitOps workflows.
Build automation frameworks that reduce operational toil and improve team efficiency.
Partner with development teams to improve application reliability, including design reviews and architectural guidance.
Mentor junior and intermediate SREs through coaching and code reviews.
Contribute to architectural decisions that impact platform reliability and scalability.
Evangelize SRE best practices across the engineering organization.
Participate in on-call rotations and drive improvements to reduce on-call burden.
Implement and maintain zero-trust security controls across infrastructure.
Ensure systems meet financial services regulatory requirements and internal compliance standards.
Conduct security reviews of infrastructure changes and deployment processes.
Participate in audit preparations and respond to compliance-related inquiries.

Requirements

Bachelor’s degree in Computer Science, Information Systems, or similar emphasis, or equivalent experience.
4 to 7 years of Site Reliability Engineering experience (or equivalent), with a track record operating large-scale production systems.
Deep, hands-on expertise in AWS across a broad range of services and architectural patterns.
Advanced Kubernetes knowledge, including custom resources, operators, and cluster federation concepts.
Expert proficiency in Terraform, including module development, state management, and complex workflow orchestration.
Strong programming skills in Python and/or Go, with ability to develop production-quality tools and services.
Production experience implementing observability at scale using Datadog, Splunk, or similar platforms.
Demonstrated experience establishing and maintaining CI/CD pipelines at enterprise scale.
Deep understanding of GitOps principles and experience with tools such as ArgoCD or Flux.
Proven ability to lead complex incident response and conduct thorough postmortems.
Strong understanding of networking, security, and infrastructure design patterns.
Experience mentoring engineers and conducting technical training.

Benefits

Medical, dental, vision and life insurance
Retirement savings – 401(k) plan with generous company matching contributions (up to 6%), financial advisory services, potential company discretionary contribution, and a broad investment lineup
Tuition reimbursement up to $5,250/year
Business-casual environment that includes the option to wear jeans
Generous paid time off upon hire – including a paid time off program plus ten paid company holidays and three floating holidays each calendar year
Paid volunteer time — 16 hours per calendar year
Leave of absence programs – including paid parental leave, paid short- and long-term disability, and Family and Medical Leave (FMLA)
Business Resource Groups (BRGs) – BRGs facilitate inclusion and collaboration across our business internally and throughout the communities where we live, work and play. BRGs are open to all.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

AWSKubernetesTerraformPythonGoDatadogSplunkCI/CDGitOpszero-trust security

Soft Skills

leadershipmentoringincident responsecommunicationcollaborationcoachingproblem-solvingorganizational skillsproactive detectionpostmortem analysis

Certifications

Bachelor’s degree in Computer ScienceBachelor’s degree in Information Systems