Senior Platform Telemetry Engineer

NVIDIA

full-time

Posted on: 9/1/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $148,000 - $287,500 per year

Senior

GrafanaPrometheusPython

About the role

Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from Nvidia
Work with customers, product management and architects to narrow requirements for implementation
Design architecture for fleet health monitoring and fault-remediation at scale
Understand customers' health monitoring requirements; use in-band and out-of-band capabilities
Produce detailed architecture and run POCs to validate designs
Educate customers about product architecture and incorporate feedback
Write architecture specs and design documents; own end-to-end product delivery
Review code produced from architecture specs and ensure code quality
Ensure product is properly tested; enhance unit testing and test plans
Drive product life cycles with QA teams and act as product owner
Articulate requirements in Jira and manage end-to-end execution plans
Contribute across all phases: product definition, architecture, design, implementation, debugging, testing, and early customer support

BS, MS, or PhD in EE/CS or related field (or equivalent experience)
5+ years hands-on coding experience
Strong knowledge of time series databases like Influxdb & Prometheus
Strong knowledge of building and consuming REST APIs (Redfish is big plus)
Strong knowledge of telemetry visualization solutions like Grafana & Influx
Strong knowledge of firmware architecture; optimize firmware for low latency APIs
Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
Proven record of solutions for scalability
Strong and demonstrable skill in C/C++ and Python
Experience programming and debugging skills for server platforms
Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
Excellent written and oral communication skills; strong teamwork and work ethic
Self-starter, hands-on with coding and committed to delivering quality work
(Nice-to-have) Experience building telemetry collection & analysis engines
(Nice-to-have) Experience with Redfish
(Nice-to-have) Experience with notification systems like PagerDuty
(Nice-to-have) Active Open Compute (OCP) and DMTF contributor in relevant areas
(Nice-to-have) Hands on with x86 or ARM system architecture
(Nice-to-have) Familiarity with Confidential Compute
(Nice-to-have) Experience with ML and multi-variable optimization techniques