Senior Platform Telemetry Engineer

NVIDIA

full-time

Posted on: 9/21/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $148,000 - $287,500 per year

Senior

GrafanaPrometheusPython

About the role

Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from NVIDIA
Work with customers, product management and other architects to narrow down on requirements for implementation
Design architecture for fleet health monitoring and fault-remediation solution at scale
Work with customers and other architects to understand health monitoring requirements and leverage in-band and out-of-band capabilities
Create detailed architecture and perform POCs to validate architecture
Educate customers about product architecture and incorporate feedback
Write architecture specs and design documents; own end-to-end delivery across teams
Perform code reviews for code produced from architecture specs
Ensure product is properly tested; enhance unit testing and establish proper test plans
Drive product life cycles with QA teams to productize code and act as product owner
Articulate requirements in Jira and bug management tools and coordinate execution plans with managers
Contribute to all phases of product development: definition, architecture, design, implementation, debugging, testing, and early customer support

BS, MS, or PhD in EE/CS or related field of education (or equivalent experience)
5+ years hands-on coding experience
Strong knowledge of time series databases like Influxdb & Prometheus
Strong knowledge of building and consuming REST APIs (Redfish is big plus)
Strong knowledge of telemetry visualization solutions like Grafana & Influx
Strong knowledge of firmware architecture, optimize firmware for low latency APIs
Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
Proven record of solutions for scalability
Strong and demonstrable skill in C/C++ and Python
Experience programming and debugging skills for server platforms
Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
Excellent written and oral communication skills
Excellent work ethics, teamwork, and commitment to finishing tasks
Self-starter with hands-on coding ability
Ways to stand out: Experience building telemetry collection & analysis engines; Experience with Redfish; Experience with notification systems like PagerDuty; Active OCP and DMTF contribution; Hands on with x86 or ARM system architecture; Familiarity with Confidential Compute; Experience with ML and multi-variable optimization techniques