Grainger

Senior/Staff Software Engineer – Machine Learning Platform and Operations

Grainger

full-time

Posted on:

Origin:  • 🇺🇸 United States • Illinois

Visit company website
AI Apply
Apply

Salary

💰 $121,500 - $202,500 per year

Job Level

Senior

Tech Stack

AWSCloudDistributed SystemsDockerGoGrafanaKafkaKubernetesPrometheusPythonSparkSplunkTerraform

About the role

  • Build self-service and automated components of the machine learning platform to enable the development, deployment, scaling, and monitoring of machine learning models
  • Ship production platform components end-to-end across multiple modules; own reliability, performance, security, and cost from design through operation
  • Design Helm releases and author GitOps objects (ArgoCD Applications/Projects) with RBAC/sync policies; keep deployments predictable and auditable
  • Collaborate with machine learning, network, security, infrastructure, and platform engineers to ensure performant access to data, compute, and networked services
  • Ensure a rigorous deployment process using DevOps standards and mentor users in software development best practices
  • Partner with teams across the business to drive broader adoption of ML, enabling teams to improve the pace and quality of ML system development
  • Develop tools and services that form the backbone of Grainger’s AI-driven features leveraging Deep Learning, Natural Language Processing / Generative AI, Computer Vision, and beyond

Requirements

  • Bachelor’s degree and 5+ years’ relevant work experience or an equivalent combination of education and experience
  • Track record building and operating production-grade, cloud-deployed systems (AWS preferred) with strong software engineering fundamentals (Python/Go or similar)
  • Expertise with IaC tools and patterns to provision, manage, and deploy applications to multiple environments using DevOps or GitOps best practices (e.g., Terraform/Helm + GitHub Actions/ArgoCD)
  • Familiarity with application monitoring and observability tools and integration patterns (e.g., Prometheus/Grafana, Splunk, DataDog, ELK)
  • Familiarity with containerization as well as container management and orchestration technologies (e.g., Docker, Kubernetes)
  • Ability to work collaboratively in a team environment
  • Bonus: Expertise in designing, analyzing, and troubleshooting large-scale distributed systems and/or working with accelerated compute (e.g., GPUs)
  • Bonus: Working knowledge of the machine learning lifecycle and experience working with machine learning systems and associated frameworks/tools, particularly for monitoring and observability
  • Bonus: Experience with big data technologies, distributed computing frameworks, and/or streaming data processing tools (e.g., Spark, Kafka, Presto, Flink)
  • Bonus: Experience deploying, evaluating, and testing, or otherwise supporting, GenAI applications and their components (e.g., LLMs, Vector DBs, etc.)
Articul8 AI

Senior Site Reliability Engineer, SRE

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 18 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPython+2 more
Articul8 AI

Senior Software Development Engineer in Test, SDET, Chaos Engineering Specialist

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 3 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRust
Articul8 AI

Senior Software Development Engineer in Test, Chaos Engineering Specialist

Articul8 AI
Seniorfull-timeCalifornia · 🇺🇸 United States
Posted: 18 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDistributed SystemsGoGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonRust
qode.world

Infrastructure Engineer, Kafka and GenAI

qode.world
Mid · Seniorfull-time🇺🇸 United States
Posted: 29 days agoSource: apply.workable.com
ApacheAWSAzureCloudDockerGoGoogle Cloud PlatformGrafanaJenkinsKafkaKubernetesPrometheus+4 more
IDT BY INDET GROUP

Lead .NET Software Developer

IDT BY INDET GROUP
Seniorfull-time🇲🇩 Moldova
Posted: 13 days agoSource: jobs.lever.co
AWSCloudDockerETLGoGrafanaKafkaKubernetesMicroservicesMongoDBMySQL.NET+7 more