Senior Engineer – AI, HPC Observability

NVIDIA

full-time

Posted on: 10/21/2025

Location Type: Office

Location: Santa Clara • California, Texas, Washington • 🇺🇸 United States

✨ AI Apply

💰 $184,000 - $287,500 per year

Senior

CloudDistributed SystemsGoGrafanaJavaKafkaNoSQLPrometheusPythonSparkSQL

About the role

Design and implement full-stack observability systems covering metrics, logs, traces, and events for GPU-powered AI and HPC workloads.
Build large-scale telemetry data pipelines leveraging OpenTelemetry, Kafka, Prometheus, and other distributed systems to ingest, process, and analyze massive data streams.
Develop analytics and anomaly detection frameworks to enable real-time visibility, performance optimization, and predictive insights across multi-tenant environments.
Architect and tune high-throughput data stores (e.g., TSDBs, columnar databases, OLAP systems) for large-scale observability data.
Drive self-service analytics capabilities through APIs, dashboards, and recommendation engines that empower developers and operators with actionable insights.
Collaborate with AI platform, GPU, and cloud infrastructure teams to optimize observability for model training, inference workloads, and HPC performance.
Leverage machine learning and statistical techniques for correlation, anomaly detection, and intelligent alerting.
Contribute to performance tuning, scalability, and reliability of observability services across on-prem, and cloud environments.

BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.
8+ years of experience in large-scale observability, data engineering, or performance monitoring systems.
Proven expertise in building and scaling observability stacks (metrics, logs, traces, events) using OpenTelemetry, Prometheus, Grafana, or Thanos.
Deep understanding of data collection, transformation, and storage at scale, experience with streaming frameworks (Kafka, Flink, Spark) preferred.
Hands-on experience with Python, Go, and/or Java for backend development and automation.
Strong knowledge of API design, data modeling, SQL/NoSQL, and data pipeline architecture.
Experience working with PromQL, time-series databases, and large-scale monitoring systems.
Familiarity with AI/ML pipelines, GPU-based workloads, and HPC environments.
Experience with anomaly detection, log analytics, and recommendation systems using ML or statistical techniques.
Excellent problem-solving, debugging, and performance-tuning skills in distributed systems.

Benefits

equity
benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Tip: use these terms in your resume and cover letter to boost ATS matches.

full-stack observability systemstelemetry data pipelinesOpenTelemetryKafkaPrometheusdata storesanomaly detection frameworksPythonGoJava

problem-solvingdebuggingperformance-tuning