NVIDIA

Senior AI and HPC Observability Engineer

NVIDIA

full-time

Posted on:

Location Type: Hybrid

Location: Santa ClaraCaliforniaWashingtonUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $152,000 - $241,500 per year

Job Level

About the role

  • Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
  • Build high-performance backend services for telemetry ingestion, processing, and routing
  • Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
  • Build and optimize metrics pipelines using large-scale time-series storage systems
  • Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies
  • Improve platform reliability, performance, and cost efficiency through tuning, capacity planning, and system optimization
  • Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance
  • Collaborate with platform engineering, infrastructure, and site reliability teams to deliver production-grade observability solutions

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, or related field or equivalent experience
  • 5+ years of experience building backend or distributed systems in production environments
  • Strong programming skills in Python, Go, or Java, with experience developing production-quality software
  • Hands-on experience with modern observability architectures, including metrics, logs, and traces
  • Solid experience with PromQL and time-series data systems
  • Experience building or operating distributed data pipelines using technologies such as Kafka, Spark, or Flink
  • Experience working with Kubernetes and cloud-native infrastructure
  • Strong understanding of distributed systems, concurrency, and fault-tolerant system design.
  • Strong debugging, performance tuning, and production operations skills
Benefits
  • equity
  • benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonGoJavaOpenTelemetryPromQLKafkaSparkFlinkKubernetesdistributed systems
Soft Skills
collaborationdebuggingperformance tuningsystem optimizationcapacity planning
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Computer Engineering