NVIDIA

Senior Storage Production Engineer – DGX Cloud

NVIDIA

full-time

Posted on:

Location Type: Hybrid

Location: Santa Clara • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $168,000 - $270,250 per year

Job Level

Senior

Tech Stack

AnsibleChefGoGrafanaJavaLinuxNFSNode.jsPrometheusPuppetPythonTerraform

About the role

  • Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
  • Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues.
  • Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
  • Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization.
  • Support storage services before they go live through activities such as system design consulting, developing automation frameworks, capacity management, and launch reviews.
  • Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
  • Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
  • Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
  • Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
  • Practice sustainable incident response and blameless root cause analysis.
  • Be part of an on-call rotation to support storage and production systems.

Requirements

  • BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
  • Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
  • Solid understanding of block, file, and object storage technologies, including their scalability, reliability, and performance characteristics and standard processes.
  • Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
  • Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
  • Experience in one or more of the following: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
  • Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
  • Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
storage clustersstorage monitoringAI/ML workloadscompressiondeduplicationtiering strategiesencryptionNFSSMBC/C++
Soft skills
problem-solvingcommunicationcollaborationincident responseroot cause analysis