
Senior Engineering Manager – Data Center Telemetry, RAS
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: Remote • California • 🇺🇸 United States
Visit company websiteSalary
💰 $272,000 - $425,500 per year
Job Level
Senior
Tech Stack
PrometheusPython
About the role
- Lead Data Center Compute Telemetry & RAS: Own the end-to-end architecture and delivery for telemetry solutions, including fleet health monitoring, fault remediation, and data visualization at scale.
- Owning OOB telemetry solution and data validation for telemetry from each underlying device.
- Build and Mentor a World-Class Team: Recruit, develop, and motivate a high-performing engineering team focused on platform telemetry, RAS and observability.
- Process Optimization: Continuously improve software development processes for optimal productivity and quality.
- Cross-Functional Collaboration: Work across teams to ensure seamless integration of telemetry solutions with platform firmware, server architecture, and data center management.
- Product Ownership: Drive product life cycles with QA teams, ensuring robust testing, productization, and delivery.
- Performance Management: Conduct performance reviews, foster a culture of excellence, and ensure high productivity.
Requirements
- 12+ overall years of relevant experience and 5 yrs of managing systems/platform software teams, ideally in server RAS, firmware, telemetry, or data center solutions.
- BS, MS, or PhD in EE/CS or related field (or equivalent experience).
- Strong knowledge of DMTF/PLDM for OOB telemetry collection, time series databases (e.g., InfluxDB, Prometheus) and REST APIs (Redfish).
- Deep understanding of Server and firmware architecture and optimization for low-latency APIs.
- Proven track record of delivering scalable server products and telemetry solutions.
- Experience with SCM (Git, Perforce) and project management tools (Jira).
- Excellent written and oral communication skills, strong work ethic, and commitment to teamwork.
- Hands-on experience with x86/ARM system architecture and coding (C/C++, Python).
- Familiarity with Confidential Compute and notification systems.
- Demonstrated ability to analyze algorithms for time/space complexity and system resource requirements.
Benefits
- Equity
- Benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
DMTFPLDMOOB telemetrytime series databasesInfluxDBPrometheusREST APIsCC++Python
Soft skills
leadershipteam developmentcommunicationprocess optimizationcross-functional collaborationproduct ownershipperformance managementmotivationcommitment to teamworkstrong work ethic
Certifications
BS in EEMS in EEPhD in EEBS in CSMS in CSPhD in CS