
Senior AI and HPC Observability Engineer
NVIDIA
full-time
Posted on:
Location Type: Hybrid
Location: Santa Clara • California • Washington • United States
Visit company websiteExplore more
Salary
💰 $152,000 - $241,500 per year
Job Level
About the role
- Design and scale observability platforms handling high-volume metrics, logs, and traces across distributed environments
- Build high-performance backend services for telemetry ingestion, processing, and routing
- Develop and extend OpenTelemetry collectors, processors, exporters, and instrumentation libraries
- Build and optimize metrics pipelines using large-scale time-series storage systems
- Design and operate real-time and batch telemetry pipelines using streaming and distributed data technologies
- Improve platform reliability, performance, and cost efficiency through tuning, capacity planning, and system optimization
- Develop monitoring, alerting, and service reliability frameworks to ensure platform health and performance
- Collaborate with platform engineering, infrastructure, and site reliability teams to deliver production-grade observability solutions
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, or related field or equivalent experience
- 5+ years of experience building backend or distributed systems in production environments
- Strong programming skills in Python, Go, or Java, with experience developing production-quality software
- Hands-on experience with modern observability architectures, including metrics, logs, and traces
- Solid experience with PromQL and time-series data systems
- Experience building or operating distributed data pipelines using technologies such as Kafka, Spark, or Flink
- Experience working with Kubernetes and cloud-native infrastructure
- Strong understanding of distributed systems, concurrency, and fault-tolerant system design.
- Strong debugging, performance tuning, and production operations skills
Benefits
- equity
- benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonGoJavaOpenTelemetryPromQLKafkaSparkFlinkKubernetesdistributed systems
Soft Skills
collaborationdebuggingperformance tuningsystem optimizationcapacity planning
Certifications
Bachelor’s degree in Computer ScienceBachelor’s degree in Computer Engineering