Senior Staff Engineer – AI In-Market Engineering

DDN

full-time

Posted on: 9/6/2025

Location: 🇺🇸 United States

✨ AI Apply

Senior

CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP

About the role

Be the final escalation point for the most complex and critical issues affecting enterprise and hyperscale environments.
Technical leader of the In-Market Engineering team, driving technical decisions for the team.
Build and implement a tools platform strategy for Infinia.
Build a reporting platform for Infinia using dial-home data.
Define and implement training for the support organisation.
Deliver log analytics strategy and design framework for Infinia.
Own critical customer case escalations end-to-end, including deep root cause analysis and mitigation strategies.
Utilize AI-powered debugging, log analysis, and system pattern recognition tools to accelerate resolution.
Be the subject-matter expert on Infinia internals: metadata handling, storage fabric interfaces, performance tuning, AI integration, etc.
Reproduce complex customer issues and propose product improvements or workarounds.
Author and maintain detailed runbooks, performance tuning guides, and RCA documentation.
Feed real-world support insights back into the development cycle to improve reliability and diagnostics.
Partner with Field CTOs, Solutions Architects, and Sales Engineers to ensure customer success.
Translate technical issues into executive-ready summaries and business impact statements.
Participate in post-mortems and executive briefings for strategic accounts.
Drive adoption of observability, automation, and self-healing support mechanisms using AI/ML tools.

12+ years in enterprise storage, distributed systems, or cloud infrastructure
Deep understanding of file systems (POSIX, NFS, S3), storage performance, and Linux kernel internals.
Proven debugging skills at system/protocol/app levels (e.g., strace, tcpdump, perf).
Hands-on experience with AI/ML data pipelines, container orchestration (Kubernetes), and GPU-based architectures.
TCP/IP / Network top expert.
Exposure to RDMA, NVMe-oF, or high-performance networking stacks.
Exceptional communication and executive reporting skills.
Experience using AI tools (e.g., log pattern analysis, LLM-based summarization, automated RCA tooling) to accelerate diagnostics and reduce MTTR.
Experience with DDN, VAST, Weka, or similar scale-out file systems.
Expert scripting/coding ability in Python, Bash, or Go.
Familiarity with observability platforms: Prometheus, Grafana, ELK, OpenTelemetry.
Knowledge of replication, consistency models, and data integrity mechanisms.
Exposure to Sovereign AI, LLM model training environments, or autonomous system data architectures.
Participation in an on-call rotation to provide after-hours support as needed.