Tech Stack
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
About the role
- Be the final escalation point for the most complex and critical issues affecting enterprise and hyperscale environments.
- Technical leader of the In-Market Engineering team, driving technical decisions for the team.
- Build and implement a tools platform strategy for Infinia.
- Build a reporting platform for Infinia using dial-home data.
- Define and implement training for the support organisation.
- Deliver log analytics strategy and design framework for Infinia.
- Own critical customer case escalations end-to-end, including deep root cause analysis and mitigation strategies.
- Utilize AI-powered debugging, log analysis, and system pattern recognition tools to accelerate resolution.
- Be the subject-matter expert on Infinia internals: metadata handling, storage fabric interfaces, performance tuning, AI integration, etc.
- Reproduce complex customer issues and propose product improvements or workarounds.
- Author and maintain detailed runbooks, performance tuning guides, and RCA documentation.
- Feed real-world support insights back into the development cycle to improve reliability and diagnostics.
- Partner with Field CTOs, Solutions Architects, and Sales Engineers to ensure customer success.
- Translate technical issues into executive-ready summaries and business impact statements.
- Participate in post-mortems and executive briefings for strategic accounts.
- Drive adoption of observability, automation, and self-healing support mechanisms using AI/ML tools.
Requirements
- 12+ years in enterprise storage, distributed systems, or cloud infrastructure
- Deep understanding of file systems (POSIX, NFS, S3), storage performance, and Linux kernel internals.
- Proven debugging skills at system/protocol/app levels (e.g., strace, tcpdump, perf).
- Hands-on experience with AI/ML data pipelines, container orchestration (Kubernetes), and GPU-based architectures.
- TCP/IP / Network top expert.
- Exposure to RDMA, NVMe-oF, or high-performance networking stacks.
- Exceptional communication and executive reporting skills.
- Experience using AI tools (e.g., log pattern analysis, LLM-based summarization, automated RCA tooling) to accelerate diagnostics and reduce MTTR.
- Experience with DDN, VAST, Weka, or similar scale-out file systems.
- Expert scripting/coding ability in Python, Bash, or Go.
- Familiarity with observability platforms: Prometheus, Grafana, ELK, OpenTelemetry.
- Knowledge of replication, consistency models, and data integrity mechanisms.
- Exposure to Sovereign AI, LLM model training environments, or autonomous system data architectures.
- Participation in an on-call rotation to provide after-hours support as needed.