Salary
💰 $208,000 - $327,750 per year
Tech Stack
CloudDistributed SystemsDockerGrafanaKubernetesPrometheusSplunk
About the role
- Be a subject‑matter expert on resiliency and observability.
- Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs.
- Master modern reliability architectures. Keep up-to-date with the industry trends. Build for all that want to use.
- Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
- Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
- Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
- Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.
Requirements
- BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology
- Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems
- Deep knowledge of AI/ML infrastructure, HPC, networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools
- Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch
- Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP)
- Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences
- Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models
- Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes
- Masters/PhD or Expertise in distributed systems, performance modeling, or fault‑tolerant computing
- Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification
- Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud
- Expertise with containerization technologies like Docker and Kubernetes, plus virtualization
- Proficiency in network architecture and high‑performance interconnects (InfiniBand, Ethernet, RoCE)