NVIDIA

Senior Platform Telemetry Engineer

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $148,000 - $287,500 per year

Job Level

Senior

Tech Stack

GrafanaPrometheusPython

About the role

  • Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from Nvidia
  • Work with customers, product management and architects to narrow requirements for implementation
  • Design architecture for fleet health monitoring and fault-remediation at scale
  • Understand customers' health monitoring requirements; use in-band and out-of-band capabilities
  • Produce detailed architecture and run POCs to validate designs
  • Educate customers about product architecture and incorporate feedback
  • Write architecture specs and design documents; own end-to-end product delivery
  • Review code produced from architecture specs and ensure code quality
  • Ensure product is properly tested; enhance unit testing and test plans
  • Drive product life cycles with QA teams and act as product owner
  • Articulate requirements in Jira and manage end-to-end execution plans
  • Contribute across all phases: product definition, architecture, design, implementation, debugging, testing, and early customer support

Requirements

  • BS, MS, or PhD in EE/CS or related field (or equivalent experience)
  • 5+ years hands-on coding experience
  • Strong knowledge of time series databases like Influxdb & Prometheus
  • Strong knowledge of building and consuming REST APIs (Redfish is big plus)
  • Strong knowledge of telemetry visualization solutions like Grafana & Influx
  • Strong knowledge of firmware architecture; optimize firmware for low latency APIs
  • Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
  • Proven record of solutions for scalability
  • Strong and demonstrable skill in C/C++ and Python
  • Experience programming and debugging skills for server platforms
  • Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
  • Excellent written and oral communication skills; strong teamwork and work ethic
  • Self-starter, hands-on with coding and committed to delivering quality work
  • (Nice-to-have) Experience building telemetry collection & analysis engines
  • (Nice-to-have) Experience with Redfish
  • (Nice-to-have) Experience with notification systems like PagerDuty
  • (Nice-to-have) Active Open Compute (OCP) and DMTF contributor in relevant areas
  • (Nice-to-have) Hands on with x86 or ARM system architecture
  • (Nice-to-have) Familiarity with Confidential Compute
  • (Nice-to-have) Experience with ML and multi-variable optimization techniques