FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Staff Infrastructure Engineer – Storage Platform
TensorWaveStorage Platform Staff Infrastructure Engineer responsible for design, operation, and evolution of storage systems. Collaborating with cross-functional partners to support business objectives.
Tech Stack
Tools & technologiesAnsibleDistributed SystemsGrafanaKubernetesLinuxPrometheusTerraform
About the role
Key responsibilities & impact- Design and evolve storage architectures supporting Kubernetes (block, file, object storage), AI/ML and high-performance compute workloads
- Evaluate and select storage technologies based on performance (IOPS, throughput, latency), scalability and fault tolerance, operational complexity and maintainability
- Define storage standards, best practices, and reference architectures
- Design for resilience over traditional HA, including failure-domain awareness
- Own production storage platforms, including Ceph (RBD, CephFS, RGW), High-performance NAS (Weka, VAST, or similar)
- Lead lifecycle operations - Cluster design and deployment, expansion and scaling, upgrades and migrations
- Perform and guide capacity planning, performance tuning, failure analysis
- Analyze storage performance across IOPS, throughput, latency, and tail latency
- Identify and resolve bottlenecks across disk subsystems, network paths (including RDMA), client access patterns
- Lead root cause analysis for storage-related incidents
- Ensure storage platforms meet the demands of GPU and Kubernetes workloads
- Define and implement Kubernetes storage patterns - CSI drivers, StorageClasses, persistent storage design
- Troubleshoot complex Kubernetes storage issues involving stateful workloads, provisioning failures, performance anomalies
- Partner with platform teams to align storage with workload requirements
- Design and implement automation for storage deployment and configuration, cluster lifecycle management
- Leverage tools such as Ansible, Terraform, Kubernetes manifests / Helm
- Integrate storage platforms into observability stacks (Prometheus, Grafana, etc.)
- Serve as the technical authority for storage across the organization
- Mentor engineers on storage systems, performance, and troubleshooting
- Establish operational standards and best practices
- Drive continuous improvement of storage reliability and performance
Requirements
What you’ll need- 7+ years of experience in infrastructure, storage, or distributed systems
- Deep hands-on experience with distributed storage systems in production
- Strong experience with Ceph (RBD, CephFS, and/or RGW)
- Strong Linux systems expertise
- Experience with high-performance storage platforms such as: Weka, VAST Data, or similar
- Strong understanding of: Storage performance characteristics
- Data replication and failure domains
- Distributed system design principles
- Ability to troubleshoot across: Storage, network, and compute layers
- Experience supporting AI/ML or HPC workloads
- Familiarity with: NVMe-based architectures
- RDMA or high-throughput Ethernet
- Experience integrating storage with Kubernetes at scale
- Experience operating across multiple data centers
- Exposure to object storage and S3-compatible APIs
Benefits
Comp & perks- Stock Options
- 100% paid Medical, Dental, and Vision insurance for Employees
- Company Health Savings Account Contributions
- 100% paid Short Term and Long Term Disability Insurance for Employees
- Life and Voluntary Supplemental Insurance Options
- Other Insurance Options, such as Pet & Legal Insurance
- Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
- Flexible Spending Account
- 401(k)
- Employee Assistance Program
- Flexible PTO
- Paid Holidays
- Parental Leave
- Other In-Office Perks
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
KubernetesCephLinuxdistributed storage systemshigh-performance storagestorage performance characteristicsdata replicationtroubleshootingAI/ML workloadsHPC workloads
Soft Skills
leadershipmentoringproblem-solvingcommunicationcollaborationcontinuous improvementoperational standards