Salary
💰 $121,500 - $202,500 per year
Tech Stack
AWSCloudDistributed SystemsDockerGoGrafanaKafkaKubernetesPrometheusPythonSparkSplunkTerraform
About the role
- Build self-service and automated components of the machine learning platform to enable the development, deployment, scaling, and monitoring of machine learning models
- Ship production platform components end-to-end across multiple modules; own reliability, performance, security, and cost from design through operation
- Design Helm releases and author GitOps objects (ArgoCD Applications/Projects) with RBAC/sync policies; keep deployments predictable and auditable
- Collaborate with machine learning, network, security, infrastructure, and platform engineers to ensure performant access to data, compute, and networked services
- Ensure a rigorous deployment process using DevOps standards and mentor users in software development best practices
- Partner with teams across the business to drive broader adoption of ML, enabling teams to improve the pace and quality of ML system development
- Develop tools and services that form the backbone of Grainger’s AI-driven features leveraging Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond
Requirements
- Bachelor’s degree and 5+ years’ relevant work experience or an equivalent combination of education and experience
- Track record building and operating production-grade, cloud-deployed systems (AWS preferred) with strong software engineering fundamentals (Python/Go or similar)
- Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments using DevOps or GitOps best practices (e.g., Terraform/Helm + GitHub Actions/ArgoCD)
- Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, DataDog, ELK)
- Familiarity with containerization as well as container management and orchestration technologies (e.g., Docker, Kubernetes)
- Ability to work collaboratively in a team environment
- Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs)
- Bonus: Working knowledge of the machine learning lifecycle and experience working with machine learning systems and associated frameworks/tools, particularly for monitoring and observability
- Bonus: Experience with big data technologies, distributed computing frameworks, and/or streaming data processing tools (e.g., Spark, Kafka, Presto, Flink)
- Bonus: Experience deploying, evaluating, and testing, or otherwise supporting, GenAI applications and their components (e.g., LLMs, Vector DBs, etc.)