Staff Software Engineer, Machine Learning Operations

Grainger

full-time

Posted on: 8/28/2025

Location: Illinois • 🇺🇸 United States

Visit company website

✨ AI Apply

Apply

Salary

💰 $121,500 - $202,500 per year

Job Level

Lead

Tech Stack

AnsibleAWSCloudDistributed SystemsGoGrafanaKubernetesPrometheusPythonSplunkTerraform

About the role

Build self-service and automated components of the machine learning platform to enable the development, deployment, and monitoring of machine learning models.
Design, monitor, and improve cloud infrastructure solutions that support applications executing at scale.
Optimize infrastructure spend by conducting utilization reviews, forecasting capacity, and driving cost/performance trade-offs for training and inference.
Architect multi-cluster/region topologies (e.g., with High Availability (HA), Disaster Recovery (DR), failover/federation, blue/green) for ML workloads and lead progressive delivery (canary, auto-rollback) patterns in CI/CD.
Ensure a rigorous deployment process using DevOps (GitOps) standards and mentor users in software development best practices.
Evolve CI/CD from repo-local workflows to reusable pipeline templates with quality/performance gates; standardize GitOps objects/guardrails (e.g., Argo CD Applications/Projects, policy-as-code).
Define org-wide observability standards (logs/metrics/traces schemas, retention) for ML system and model reliability; drive adoption across teams and integrate with enterprise tools (Prometheus/Grafana + Splunk/Datadog).
Collaborate with the SRE team to define and drive SRE standards for ML systems by setting and reviewing SLOs/error budgets, partnering on org-wide reliability scorecards and improvement plans, and scaling blameless RCA rituals.
Institute compatibility and deprecation/versioning policies for clusters and runtimes; integrate enterprise SSO (Okta/AD) and define RBAC scopes across clusters / pipelines.
Own multi-component roadmap initiatives that measurably move platform & reliability OKRs; communicate major changes and incidents to org-wide forums and host cross-team design sessions.
Partner with teams across the business to enable reliable adoption of ML by hosting internal workshops, publishing playbooks/templates, and advising teams on adopting platform patterns safely.

Requirements

Bachelor’s degree
7+ years’ relevant work experience or equivalent staff-level impact in platform / infrastructure roles
Possess strong software engineering fundamentals and experience developing production-grade software; experience with Python, Golang, or similar language preferred
Experience leading org-wide platform initiatives (e.g., multi-cluster K8s, CI/CD platform evolution, observability standards) and mentoring senior engineers
Strong working knowledge of cloud-based services as well as their capabilities and usage; AWS preferred
Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments (e.g., Terraform, Ansible, Helm)
Deep expertise with GitOps practices and tools (Argo CD app-of-apps, RBAC, sync policies) as well as policy-as-code (OPA/Kyverno) for safe rollouts
Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, Datadog, ELK)
Deep, hands-on experience with containers and Kubernetes (cluster operations/upgrades, HA/DR patterns)
Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (GPUs)

Staff Software Engineer, Machine Learning Operations

Salary

Job Level

Tech Stack

About the role

Requirements

Similar jobs on JobTailor

Machine Learning Engineer

Staff Full Stack – AI Engineer

VP, AI Engineering

Machine Learning Engineer

Senior AI Engineer