Senior Systems Engineer, Storage – DGX Cloud

NVIDIA

Senior Systems Engineer designing and operating large-scale Kubernetes storage platforms. Collaborating with teams to ensure reliability, observability, and performance in production systems.

Posted 6/8/2026full-timeRemote • California, Colorado, Illinois, North Carolina, Oregon • 🇺🇸 United StatesSenior💰 $208,000 - $414,000 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

KubernetesPythonGoJavainfrastructure-as-codetelemetryobservabilitysoftware design fundamentalsLinuxtroubleshooting

Soft Skills

analytical skillsproblem-solvingcollaborationcommunicationsystematic approach

Tools & Technologies

PrometheusInfluxDBGrafanaElastic stackAnsibleChefPuppetArgoCDGit PipelinesTerraform

Industry Keywords

data platformsstorage systemsdeployment automationcapacity planningincident responsepostmortemsCI/CDlarge-scale systemsdistributed systemscontainerized infrastructure

Tech Stack

Tools & technologies

AnsibleChefGoGrafanaJavaKubernetesLinuxPrometheusPuppetPythonTerraform

About the role

Key responsibilities & impact

Design, deploy, and operate solutions on Kubernetes for large-scale storage and data platforms, including the manifests, Helm charts, and operators that run them.
Build tools, services, and automation that improve the lifecycle of storage and data systems – from provisioning and configuration through deployment, scaling, and day-2 operations.
Develop and operate telemetry and observability for production systems – metrics, logging, tracing, dashboards, and alerting – so that system health, availability, and latency are measurable and actionable.
Apply strong analytical troubleshooting skills to diagnose and resolve complex issues across distributed, containerized infrastructure.
Work closely with peers and partner teams to improve the lifecycle of services, from inception and design through deployment, operation, and refinement.
Scale systems sustainably through automation, infrastructure-as-code, and CI/CD, and evolve systems by pushing for changes that improve reliability and velocity.
Support services before they go live through activities such as deployment automation, capacity planning, and launch and readiness reviews.
Practice sustainable incident response and postmortems, and participate in an on-call rotation to support production systems.

Requirements

What you’ll need

BS degree (or equivalent experience) in Computer Science or related technical field involving coding.
12+ years of practical experience.
Hands-on experience with Kubernetes – deploying, configuring, and operating workloads and solutions on Kubernetes in production.
Experience building tools and services for storage, data, or platform infrastructure, with solid software design fundamentals (algorithms, data structures, complexity analysis) on large-scale Linux-based systems.
Experience building and operating telemetry and observability using tools such as Prometheus, InfluxDB, Grafana, and the Elastic stack.
Strong analytical troubleshooting skills with a systematic, root-cause-driven approach to identifying and resolving complex problems.
Proficiency in one or more of the following: Python, Go, or Java.
Good knowledge of infrastructure configuration management and infrastructure-as-code tools such as Ansible, Chef, Puppet, ArgoCD, Git Pipelines, and Terraform.

Benefits

Comp & perks

Equity
Health insurance
Retirement plans
Paid time off
Professional development opportunities