Grainger

Staff Software Engineer, Machine Learning Operations

Grainger

full-time

Posted on:

Origin:  • 🇺🇸 United States • Illinois

Visit company website
AI Apply
Manual Apply

Salary

💰 $121,500 - $202,500 per year

Job Level

Lead

Tech Stack

AnsibleAWSCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonSplunkTerraform

About the role

  • Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
  • Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
  • Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance trade-offs for training and inference.
  • Architect multi-cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto-rollback) patterns in CI/CD.
  • Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
  • Evolve CI/CD from repo-local workflows to reusable pipeline templates with quality/performance gates; standardize GitOps objects/guardrails (e.g., Argo CD Applications/Projects, policy-as-code).
  • Define org-wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
  • Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
  • Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
  • Own multi-component roadmap initiatives that measurably move platform & reliability OKRs; communicate major changes and incidents to org-wide forums and host cross-team design sessions.
  • Partner with teams across the business to enable reliable adoption of ML by hosting internal workshops, publishing playbooks/templates, and advising teams on adopting platform patterns safely.

Requirements

  • Bachelor’s degree
  • 7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles
  • Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred
  • Experience leading org-wide platform initiatives (e.g., multi-cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers
  • Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred
  • Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm)
  • Deep expertise with GitOps practices and tools (Argo CD app-of-apps, RBAC, sync policies) as well as policy-as-code (OPA/Kyverno) for safe rollouts
  • Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK)
  • Deep, hands-on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns)
  • Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (GPUs)