NVIDIA

Senior Engineer – AI, HPC Observability

NVIDIA

full-time

Posted on:

Location Type: Office

Location: Santa Clara • California, Texas, Washington • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $184,000 - $287,500 per year

Job Level

Senior

Tech Stack

CloudDistributed SystemsGoGrafanaJavaKafkaNoSQLPrometheusPythonSparkSQL

About the role

  • Design and implement full-stack observability systems covering metrics, logs, traces, and events for GPU-powered AI and HPC workloads.
  • Build large-scale telemetry data pipelines leveraging OpenTelemetry, Kafka, Prometheus, and other distributed systems to ingest, process, and analyze massive data streams.
  • Develop analytics and anomaly detection frameworks to enable real-time visibility, performance optimization, and predictive insights across multi-tenant environments.
  • Architect and tune high-throughput data stores (e.g., TSDBs, columnar databases, OLAP systems) for large-scale observability data.
  • Drive self-service analytics capabilities through APIs, dashboards, and recommendation engines that empower developers and operators with actionable insights.
  • Collaborate with AI platform, GPU, and cloud infrastructure teams to optimize observability for model training, inference workloads, and HPC performance.
  • Leverage machine learning and statistical techniques for correlation, anomaly detection, and intelligent alerting.
  • Contribute to performance tuning, scalability, and reliability of observability services across on-prem, and cloud environments.

Requirements

  • BS or equivalent experience in Computer Science, Computer Engineering, or a related technical field.
  • 8+ years of experience in large-scale observability, data engineering, or performance monitoring systems.
  • Proven expertise in building and scaling observability stacks (metrics, logs, traces, events) using OpenTelemetry, Prometheus, Grafana, or Thanos.
  • Deep understanding of data collection, transformation, and storage at scale, experience with streaming frameworks (Kafka, Flink, Spark) preferred.
  • Hands-on experience with Python, Go, and/or Java for backend development and automation.
  • Strong knowledge of API design, data modeling, SQL/NoSQL, and data pipeline architecture.
  • Experience working with PromQL, time-series databases, and large-scale monitoring systems.
  • Familiarity with AI/ML pipelines, GPU-based workloads, and HPC environments.
  • Experience with anomaly detection, log analytics, and recommendation systems using ML or statistical techniques.
  • Excellent problem-solving, debugging, and performance-tuning skills in distributed systems.
Benefits
  • equity
  • benefits 📊 Resume Score Upload your resume to see if it passes auto-rejection tools used by recruiters Check Resume Score

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
full-stack observability systemstelemetry data pipelinesOpenTelemetryKafkaPrometheusdata storesanomaly detection frameworksPythonGoJava
Soft skills
problem-solvingdebuggingperformance-tuning
NVIDIA

Senior Software Engineer, Datacenter Modeling

NVIDIA
Seniorfull-time$148k–$288k / yearCalifornia, Colorado, Oregon · 🇺🇸 United States
Posted: 16 hours agoSource: nvidia.wd5.myworkdayjobs.com
CloudDistributed SystemsPandasPython
Intel Corporation

Software Engineer – Infrastructure, Quality

Intel Corporation
Junior · Midfull-time$126k–$240k / yearCalifornia, Oregon · 🇺🇸 United States
Posted: 6 days agoSource: intel.wd1.myworkdayjobs.com
PythonRuby
NVIDIA

Senior Software Engineer – HPC

NVIDIA
Seniorfull-time$184k–$357k / yearCalifornia, Massachusetts, North Carolina, Texas · 🇺🇸 United States
Posted: 7 days agoSource: nvidia.wd5.myworkdayjobs.com
AWSAzureCloudElixirGoGoogle Cloud PlatformJavaPythonScala
NVIDIA

Senior Software Engineer, Networking

NVIDIA
Seniorfull-time$148k–$288k / yearCalifornia, North Carolina · 🇺🇸 United States
Posted: 17 days agoSource: nvidia.wd5.myworkdayjobs.com
Switching