
Senior Storage Production Engineer – DGX Cloud
NVIDIA
full-time
Posted on:
Location Type: Hybrid
Location: Santa Clara • California • 🇺🇸 United States
Visit company websiteSalary
💰 $168,000 - $270,250 per year
Job Level
Senior
Tech Stack
AnsibleChefGoGrafanaJavaLinuxNFSNode.jsPrometheusPuppetPythonTerraform
About the role
- Design, implement, and support large-scale storage clusters, ensuring scalability, high availability, and data integrity.
- Develop and maintain storage monitoring, logging, and alerting systems to ensure proactive detection and resolution of performance issues.
- Work with AI/ML workloads to optimize storage architectures for low-latency access, efficient caching, and high-throughput performance.
- Improve the lifecycle of storage services – from inception and design to deployment, operation, and continuous optimization.
- Support storage services before they go live through activities such as system design consulting, developing automation frameworks, capacity management, and launch reviews.
- Maintain production storage infrastructure by monitoring availability, latency, and system health, leveraging predictive analytics and AI-driven automation.
- Optimize storage efficiency through compression, deduplication, tiering strategies, and intelligent workload placement.
- Scale storage systems sustainably using AI/ML-driven automation, policy-based tiering, and dynamic data migration techniques.
- Ensure data security and compliance by implementing encryption, access controls, and auditing mechanisms for storage systems.
- Practice sustainable incident response and blameless root cause analysis.
- Be part of an on-call rotation to support storage and production systems.
Requirements
- BS degree or equivalent experience in Computer Science, Storage Systems, or a related technical field with 8+ years of practical experience.
- Experience with distributed and high-performance storage solutions, including clustered and parallel file systems, distributed object storage, and enterprise-grade storage systems.
- Solid understanding of block, file, and object storage technologies, including their scalability, reliability, and performance characteristics and standard processes.
- Experience with storage networking protocols such as NFS, SMB, iSCSI, S3, Fibre Channel, RDMA, and NVMe over Fabrics.
- Expertise in algorithms, data structures, complexity analysis, software design, and automating maintenance of large-scale Linux-based storage systems.
- Experience in one or more of the following: C/C++, Java, Python, Go, NodeJS, and Bash for storage automation, monitoring, and performance tuning.
- Hands-on experience with infrastructure configuration management tools like Ansible, Chef, Puppet, and Terraform for automating storage deployments.
- Experience with observability and tracing tools like InfluxDB, Prometheus, Grafana, and the Elastic stack for monitoring storage system health.
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
storage clustersstorage monitoringAI/ML workloadscompressiondeduplicationtiering strategiesencryptionNFSSMBC/C++
Soft skills
problem-solvingcommunicationcollaborationincident responseroot cause analysis