NVIDIA

Senior Platform Telemetry Engineer

NVIDIA

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $148,000 - $287,500 per year

Job Level

Senior

Tech Stack

GrafanaPrometheusPython

About the role

  • Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from NVIDIA
  • Work with customers, product management and other architects to narrow down on requirements for implementation
  • Design architecture for fleet health monitoring and fault-remediation solution at scale
  • Work with customers and other architects to understand health monitoring requirements and leverage in-band and out-of-band capabilities
  • Create detailed architecture and perform POCs to validate architecture
  • Educate customers about product architecture and incorporate feedback
  • Write architecture specs and design documents; own end-to-end delivery across teams
  • Perform code reviews for code produced from architecture specs
  • Ensure product is properly tested; enhance unit testing and establish proper test plans
  • Drive product life cycles with QA teams to productize code and act as product owner
  • Articulate requirements in Jira and bug management tools and coordinate execution plans with managers
  • Contribute to all phases of product development: definition, architecture, design, implementation, debugging, testing, and early customer support

Requirements

  • BS, MS, or PhD in EE/CS or related field of education (or equivalent experience)
  • 5+ years hands-on coding experience
  • Strong knowledge of time series databases like Influxdb & Prometheus
  • Strong knowledge of building and consuming REST APIs (Redfish is big plus)
  • Strong knowledge of telemetry visualization solutions like Grafana & Influx
  • Strong knowledge of firmware architecture, optimize firmware for low latency APIs
  • Strong knowledge of analyzing algorithms for time & space complexity and project system resource requirements
  • Proven record of solutions for scalability
  • Strong and demonstrable skill in C/C++ and Python
  • Experience programming and debugging skills for server platforms
  • Experience in SCM (e.g., Git, Perforce) and project management tools like Jira
  • Excellent written and oral communication skills
  • Excellent work ethics, teamwork, and commitment to finishing tasks
  • Self-starter with hands-on coding ability
  • Ways to stand out: Experience building telemetry collection & analysis engines; Experience with Redfish; Experience with notification systems like PagerDuty; Active OCP and DMTF contribution; Hands on with x86 or ARM system architecture; Familiarity with Confidential Compute; Experience with ML and multi-variable optimization techniques
NVIDIA

Senior Platform Telemetry Engineer

NVIDIA
Seniorfull-time$148k–$288k / yearCalifornia · 🇺🇸 United States
Posted: 20 days agoSource: nvidia.wd5.myworkdayjobs.com
GrafanaPrometheusPython
Artisan Studios

Site Reliability Engineer

Artisan Studios
Mid · Seniorfull-time🇺🇸 United States
Posted: 9 days agoSource: artisanstudios.applytojob.com
GoGrafanaKubernetesPrometheusPythonTerraform
DDN

Senior Staff Engineer – AI In-Market Engineering

DDN
Seniorfull-time🇺🇸 United States
Posted: 15 days agoSource: careers-ddn.icims.com
CloudDistributed SystemsGoGrafanaKubernetesLinuxNFSPrometheusPythonTCP/IP
Checkmate

Linux System Engineer

Checkmate
Mid · Seniorfull-time🇮🇳 India
Posted: 10 days agoSource: apply.workable.com
GoGrafanaJavaLinuxPrometheusPythonReactRuby
Hazelcast

Lead Platform Engineer, Build and Release

Hazelcast
Seniorfull-time🇬🇧 United Kingdom
Posted: 10 days agoSource: hazelcast.pinpointhq.com
AWSAzureCloudGrafanaJavaJenkinsPrometheusPythonTerraform