Lead global operations for core infrastructure platforms, including Kubernetes (EKS), developer environments, CI/CD systems, and enterprise integrations, ensuring performance, reliability, and security.
Define and execute the roadmap for secure, multi-tenant infrastructure that powers engineering and AI/ML workloads across the company.
Drive automation for build and release processes, incident response, and compliance readiness while maintaining strong reliability standards.
Oversee platform-wide observability, including logging, monitoring, distributed tracing, and SLO instrumentation.
Partner with AI platform teams to operationalize next-gen infrastructure such as GPU provisioning, MCP/Agent Marketplace, and model-serving environments.
Recruit, mentor, and scale a high-performing, globally distributed team that fosters innovation and technical excellence.
Represent infrastructure operations in strategic planning and architectural reviews, influencing company-wide platform investments and direction.
Requirements
Proven experience (10+ years) in software engineering with 3+ years leading high-performing infrastructure or platform teams.
Deep technical expertise in cloud-native systems, Kubernetes, CI/CD platforms, and distributed systems operations.
Experience designing and scaling AI/ML platforms for model training, deployment, and monitoring.
Demonstrated success in building reusable, self-service platforms that accelerate developer productivity.
A growth-minded leader with exceptional communication and collaboration skills who thrives in cross-functional environments.
Benefits
comprehensive medical coverage for you and your family
unlimited PTO
a 401(k) plan with matching
12 weeks of paid parental leave
Employee Stock Purchase Plan
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
KubernetesCI/CDcloud-native systemsdistributed systems operationsAI/ML platformsmodel trainingmodel deploymentobservabilityautomationincident response