Salary
💰 $121,500 - $202,500 per year
Tech Stack
AnsibleAWSCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonSplunkTerraform
About the role
- Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
- Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
- Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance trade-offs for training and inference.
- Architect multi-cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto-rollback) patterns in CI/CD.
- Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
- Evolve CI/CD from repo-local workflows to reusable pipeline templates with quality/performance gates; standardize GitOps objects/guardrails (e.g., Argo CD Applications/Projects, policy-as-code).
- Define org-wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
- Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
- Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
- Own multi-component roadmap initiatives that measurably move platform & reliability OKRs; communicate major changes and incidents to org-wide forums and host cross-team design sessions.
- Partner with teams across the business to enable reliable adoption of ML by hosting internal workshops, publishing playbooks/templates, and advising teams on adopting platform patterns safely.
Requirements
- Bachelor’s degree
- 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles
- Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred
- Experience leading org-wide platform initiatives (e.g., multi-cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers
- Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm)
- Deep expertise with GitOps practices and tools (Argo CD app-of-apps, RBAC, sync policies) as well as policy-as-code (OPA/Kyverno) for safe rollouts
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK)
- Deep, hands-on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns)
- Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (GPUs)