Define and execute the reliability engineering roadmap, aligning infrastructure and AI-native architecture with Filevine’s enterprise growth and platform modernization.
Balance centralized platform capabilities with distributed ownership, ensuring the reliability model scales across a diversifying technology portfolio.
Establish and manage SLO/SLI/error budget frameworks to create a shared language for balancing feature velocity with system stability.
Lead infrastructure cost management (optimization and forecasting), capacity planning, and disaster recovery to meet rigorous enterprise contractual commitments.
Lead and scale a multi-disciplinary organization (DevOps, SRE, DBRE, Tooling), fostering a culture of ownership, high craftsmanship, and clear career growth.
Drive continuous improvement through DORA metrics, incident trend analysis, and systematic toil reduction to enhance service availability and deployment health.
Delivery of self-service tooling, guardrails, and documentation that allow feature teams to operate their own services effectively without bottlenecks.
Act as the primary engineering interface for the CISO to advance compliance posture (FedRAMP, SOC 2, CJIS, ISO) and translate security needs into pragmatic action.
Collaborate with the CTO, CPO, and Architect to communicate risks and investment needs, positioning reliability as a key enabler for enterprise go-to-market success.

Requirements

15+ years of engineering experience, with 7+ years specifically leading infrastructure, reliability, or platform teams at scale in product-driven companies.
Proven track record managing organizations of 40+ engineers across SRE, DevOps, and Tooling, including developing multiple layers of management.
Demonstrated experience evolving reliability operating models to meet the shifting needs of a scaling business.
Deep expertise operating in regulated sectors (Legal Tech, Fintech, Gov, or Healthcare) where compliance and data sensitivity are primary constraints.
Practical, production-hardened understanding of SRE principles, including SLOs, error budgets, toil reduction, and incident management.
Strong technical command of AWS, container orchestration, Terraform (IaC), CI/CD, and modern observability stacks.
Direct experience owning cloud infrastructure budgets and successfully driving meaningful cost optimization and forecasting.
Familiarity with the reliability requirements for modern AI workloads, such as model serving, vector search, and data pipeline integrity.
Ability to engage the C-suite on risk trade-offs and transformation progress with a "builder mentality" that thrives on solving complex, high-stakes problems.

Benefits

Medical, Dental, & Vision Insurance (for full-time employees)
Competitive & Fair Pay
Maternity & paternity leave (for full-time employees)
Short & long-term disability
Opportunity to learn from a dedicated leadership team
Top-of-the-line company swag

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

reliability engineeringSLOSLIerror budgetcost managementcapacity planningdisaster recoveryDORA metricsincident managementtoil reduction

Soft Skills

leadershipcollaborationcommunicationownershiphigh craftsmanshipcontinuous improvementrisk managementproblem-solvingorganizational developmentcareer growth

Certifications

FedRAMPSOC 2CJISISO