Senior Site Reliability Engineer, GeForce NOW

NVIDIA

Senior Site Reliability Engineer improving service reliability and observability for NVIDIA's GeForce NOW gaming platform. Responsibilities include automation, Kubernetes management, and incident response.

Posted 6/9/2026full-timeSanta Clara • California • 🇺🇸 United StatesSenior💰 $168,000 - $270,250 per yearWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

KubernetesautomationscriptingmicroservicesGoPythonBash scriptingdeployment pipelineschange managementproblem-solving

Soft Skills

ownershipoptimizationefficiencyleadershipcommunication

Tools & Technologies

DatadogPrometheusAlertmanagerAWSGCPAzureGitHub ActionsGitLab CIArgoCDVMI

Industry Keywords

Site Reliability Engineeringobservabilityproduction systemssystem design consultingcapacity managementon-call rotationhigh-severity alertsservice degradationpost-mortem reviewsworkflow processes

Tech Stack

Tools & technologies

AWSAzureCloudGoGoogle Cloud PlatformKubernetesPrometheusPython

About the role

Key responsibilities & impact

Working on building tools to improve the SRE Observability.
Be part of the Kubernetes migration journey with VMI setup and problem solving.
Rapidly debug and triage incidents and user-reported issues
Taking ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.
Be part of an on-call rotation to support production systems

Requirements

What you’ll need

MS or BS in Computer Science/Engineering or a related field or equivalent experience.
8+ year’s Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling.
Very strong Kubernetes background and ability to understand Kubernetes with complex and highly available VMI setup on K8's.
Lead significant production improvements including change management, post-mortem reviews, workflow processes, design, and deliver software automation in various languages.
Confirmed strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency, and the bottom line.
Previous experience with Datadog, Prometheus, Alertmanager, or similar monitoring systems.
Experience managing multi-region cloud deployments on hyperscalers like AWS, GCP, or Azure.
Experience designing and managing deployment pipelines using tools such as GitHub Actions, GitLab CI, or ArgoCD.
Production-grade coding proficiency in languages like Go, Python, or robust Bash scripting.
Production on-call experience is a must.
Should have served in a primary production on-call rotation, responding to and mitigating high-severity infrastructure alerts and service degradations.

Benefits

Comp & perks

equity
benefits 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score