NVIDIA

Senior Engineering Manager – Data Center Telemetry, RAS

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: Remote • California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $272,000 - $425,500 per year

Job Level

Senior

Tech Stack

PrometheusPython

About the role

  • Lead Data Center Compute Telemetry & RAS: Own the end-to-end architecture and delivery for telemetry solutions, including fleet health monitoring, fault remediation, and data visualization at scale.
  • Owning OOB telemetry solution and data validation for telemetry from each underlying device.
  • Build and Mentor a World-Class Team: Recruit, develop, and motivate a high-performing engineering team focused on platform telemetry, RAS and observability.
  • Process Optimization: Continuously improve software development processes for optimal productivity and quality.
  • Cross-Functional Collaboration: Work across teams to ensure seamless integration of telemetry solutions with platform firmware, server architecture, and data center management.
  • Product Ownership: Drive product life cycles with QA teams, ensuring robust testing, productization, and delivery.
  • Performance Management: Conduct performance reviews, foster a culture of excellence, and ensure high productivity.

Requirements

  • 12+ overall years of relevant experience and 5 yrs of managing systems/platform software teams, ideally in server RAS, firmware, telemetry, or data center solutions.
  • BS, MS, or PhD in EE/CS or related field (or equivalent experience).
  • Strong knowledge of DMTF/PLDM for OOB telemetry collection, time series databases (e.g., InfluxDB, Prometheus) and REST APIs (Redfish).
  • Deep understanding of Server and firmware architecture and optimization for low-latency APIs.
  • Proven track record of delivering scalable server products and telemetry solutions.
  • Experience with SCM (Git, Perforce) and project management tools (Jira).
  • Excellent written and oral communication skills, strong work ethic, and commitment to teamwork.
  • Hands-on experience with x86/ARM system architecture and coding (C/C++, Python).
  • Familiarity with Confidential Compute and notification systems.
  • Demonstrated ability to analyze algorithms for time/space complexity and system resource requirements.
Benefits
  • Equity
  • Benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
DMTFPLDMOOB telemetrytime series databasesInfluxDBPrometheusREST APIsCC++Python
Soft skills
leadershipteam developmentcommunicationprocess optimizationcross-functional collaborationproduct ownershipperformance managementmotivationcommitment to teamworkstrong work ethic
Certifications
BS in EEMS in EEPhD in EEBS in CSMS in CSPhD in CS