Tech Stack
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
About the role
- Serve as final escalation point for complex and critical issues in enterprise and hyperscale environments
- Lead technical direction of the In-Market Engineering team and drive technical decisions
- Build and implement tools platform strategy for Infinia and a reporting platform using dial-home data
- Define and implement training for the support organization and author runbooks, performance tuning guides, and RCA documentation
- Deliver log analytics strategy and design framework for Infinia; utilize AI-powered debugging, log analysis, and system pattern recognition tools
- Own critical customer case escalations end-to-end, including deep root cause analysis and mitigation strategies
- Reproduce complex customer issues, propose product improvements or workarounds, and feed real-world support insights back into development
- Partner with Field CTOs, Solutions Architects, and Sales Engineers to ensure customer success and translate technical issues into executive-ready summaries
- Participate in post-mortems and executive briefings for strategic accounts; drive adoption of observability, automation, and self-healing support mechanisms
Requirements
- 12+ years in enterprise storage, distributed systems, or cloud infrastructure
- Deep understanding of file systems (POSIX, NFS, S3), storage performance, and Linux kernel internals
- Proven debugging skills at system/protocol/app levels (e.g., strace, tcpdump, perf)
- Hands-on experience with AI/ML data pipelines, container orchestration (Kubernetes), and GPU-based architectures
- TCP/IP / Network top expert
- Exposure to RDMA, NVMe-oF, or high-performance networking stacks
- Exceptional communication and executive reporting skills
- Experience using AI tools (e.g., log pattern analysis, LLM-based summarization, automated RCA tooling) to accelerate diagnostics and reduce MTTR
- Participation in an on-call rotation to provide after-hours support as needed
- Preferred: Experience with DDN, VAST, Weka, or similar scale-out file systems
- Preferred: Expert scripting/coding ability in Python, Bash, or Go
- Preferred: Familiarity with observability platforms: Prometheus, Grafana, ELK, OpenTelemetry
- Preferred: Knowledge of replication, consistency models, and data integrity mechanisms
- Preferred: Exposure to Sovereign AI, LLM model training environments, or autonomous system data architectures