Senior/Staff Software Engineer – Machine Learning Platform and Operations

Grainger

full-time

Posted on: 9/25/2025

Origin: • 🇺🇸 United States • Illinois

✨ AI Apply

💰 $121,500 - $202,500 per year

Senior

AWSCloudDistributed SystemsDockerGoGrafanaKafkaKubernetesPrometheusPythonSparkSplunkTerraform

About the role

Build self-service and automated components of the machine learning platform to enable the development, deployment, scaling, and monitoring of machine learning models
Ship production platform components end-to-end across multiple modules; own reliability, performance, security, and cost from design through operation
Design Helm releases and author GitOps objects (ArgoCD Applications/Projects) with RBAC/sync policies; keep deployments predictable and auditable
Collaborate with machine learning, network, security, infrastructure, and platform engineers to ensure performant access to data, compute, and networked services
Ensure a rigorous deployment process using DevOps standards and mentor users in software development best practices
Partner with teams across the business to drive broader adoption of ML, enabling teams to improve the pace and quality of ML system development
Develop tools and services that form the backbone of Grainger’s AI-driven features leveraging Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond

Bachelor’s degree and 5+ years’ relevant work experience or an equivalent combination of education and experience
Track record building and operating production-grade, cloud-deployed systems (AWS preferred) with strong software engineering fundamentals (Python/Go or similar)
Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments using DevOps or GitOps best practices (e.g., Terraform/Helm + GitHub Actions/ArgoCD)
Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, DataDog, ELK)
Familiarity with containerization as well as container management and orchestration technologies (e.g., Docker, Kubernetes)
Ability to work collaboratively in a team environment
Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs)
Bonus: Working knowledge of the machine learning lifecycle and experience working with machine learning systems and associated frameworks/tools, particularly for monitoring and observability
Bonus: Experience with big data technologies, distributed computing frameworks, and/or streaming data processing tools (e.g., Spark, Kafka, Presto, Flink)
Bonus: Experience deploying, evaluating, and testing, or otherwise supporting, GenAI applications and their components (e.g., LLMs, Vector DBs, etc.)