NVIDIA

Manager, Site Reliability Engineer – DGX Cloud

NVIDIA

full-time

Posted on:

Location Type: Remote

Location: India

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Recruit, develop, and inspire a team of Site Reliability Engineers, fostering a strong culture of collaboration, ownership, and technical excellence.
  • Provide mentorship, guidance, and career development opportunities to help your team grow.
  • Establish and enforce SRE standard practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and robust incident management processes.
  • Drive continuous improvement in system reliability, availability, and performance.
  • Collaborate closely with engineering and product teams to design, build, and deploy highly scalable, fault-tolerant, and performant cloud services.
  • Champion architecture reviews and ensure operational considerations are embedded from inception.
  • Lead initiatives to eliminate toil by driving automation across the entire service lifecycle, including provisioning, deployment, monitoring, incident response, and capacity management.
  • Implement comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.
  • Champion the use of data-driven approaches to identify potential issues before they impact users.
  • Oversee incident response, ensuring rapid identification, mitigation, and resolution of production issues.
  • Lead blameless post-mortems, fostering a culture of learning and continuous improvement based on incidents.
  • Develop and implement strategies to improve platform scalability and performance to accommodate growing demand while ensuring efficient performance.
  • Partner closely with various engineering teams (development, infrastructure, security), product management, and customer success to align on priorities, communicate service health, and drive successful product launches.
  • Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.
  • Provide leadership and support for on-call rotations, ensuring effective incident response and knowledge sharing.

Requirements

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent practical experience.
  • 10+ overall years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 5 years in a leadership/management capacity.
  • Proven experience in operating and owning end-to-end availability of critical, large-scale distributed systems in a cloud environment (e.g., AWS, GCP, Azure).
  • Deep expertise in Kubernetes administration, containerization, and microservices architecture.
  • Strong understanding of SRE principles, including SLOs, SLIs, error budgets, and incident management.
  • Extensive experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet).
  • Proficiency in at least one high-level programming language (e.g., Python, Go).
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards.
  • Experience with building and operating comprehensive observability platforms (monitoring, logging, tracing) using tools like Prometheus, Grafana, ELK Stack, Splunk, Jaeger, etc.
  • Demonstrated ability to lead and mentor engineering teams, fostering a collaborative and innovative environment.
  • Excellent communication, interpersonal, and problem-solving skills, with the ability to articulate complex technical concepts to both technical and non-technical audiences.
Benefits
  • Competitive salary
  • Health insurance
  • Professional development opportunities
  • Flexible working hours
  • Wellness programs

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
Site Reliability EngineeringDevOpsKubernetes administrationcontainerizationmicroservices architectureinfrastructure automationmonitoringloggingtracingcloud security
Soft skills
leadershipmentorshipcollaborationproblem-solvingcommunicationinterpersonal skillscontinuous improvementstrategic thinkingteam developmentincident management