
Manager, Site Reliability Engineer – DGX Cloud
NVIDIA
full-time
Posted on:
Location Type: Remote
Location: India
Visit company websiteExplore more
Tech Stack
About the role
- Recruit, develop, and inspire a team of Site Reliability Engineers, fostering a strong culture of collaboration, ownership, and technical excellence.
- Provide mentorship, guidance, and career development opportunities to help your team grow.
- Establish and enforce SRE standard practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, and robust incident management processes.
- Drive continuous improvement in system reliability, availability, and performance.
- Collaborate closely with engineering and product teams to design, build, and deploy highly scalable, fault-tolerant, and performant cloud services.
- Champion architecture reviews and ensure operational considerations are embedded from inception.
- Lead initiatives to eliminate toil by driving automation across the entire service lifecycle, including provisioning, deployment, monitoring, incident response, and capacity management.
- Implement comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance.
- Champion the use of data-driven approaches to identify potential issues before they impact users.
- Oversee incident response, ensuring rapid identification, mitigation, and resolution of production issues.
- Lead blameless post-mortems, fostering a culture of learning and continuous improvement based on incidents.
- Develop and implement strategies to improve platform scalability and performance to accommodate growing demand while ensuring efficient performance.
- Partner closely with various engineering teams (development, infrastructure, security), product management, and customer success to align on priorities, communicate service health, and drive successful product launches.
- Ensure the implementation of security standards and compliance requirements across all operational aspects of the cloud platforms.
- Provide leadership and support for on-call rotations, ensuring effective incident response and knowledge sharing.
Requirements
- Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field, or equivalent practical experience.
- 10+ overall years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 5 years in a leadership/management capacity.
- Proven experience in operating and owning end-to-end availability of critical, large-scale distributed systems in a cloud environment (e.g., AWS, GCP, Azure).
- Deep expertise in Kubernetes administration, containerization, and microservices architecture.
- Strong understanding of SRE principles, including SLOs, SLIs, error budgets, and incident management.
- Extensive experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet).
- Proficiency in at least one high-level programming language (e.g., Python, Go).
- In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards.
- Experience with building and operating comprehensive observability platforms (monitoring, logging, tracing) using tools like Prometheus, Grafana, ELK Stack, Splunk, Jaeger, etc.
- Demonstrated ability to lead and mentor engineering teams, fostering a collaborative and innovative environment.
- Excellent communication, interpersonal, and problem-solving skills, with the ability to articulate complex technical concepts to both technical and non-technical audiences.
Benefits
- Competitive salary
- Health insurance
- Professional development opportunities
- Flexible working hours
- Wellness programs
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Site Reliability EngineeringDevOpsKubernetes administrationcontainerizationmicroservices architectureinfrastructure automationmonitoringloggingtracingcloud security
Soft skills
leadershipmentorshipcollaborationproblem-solvingcommunicationinterpersonal skillscontinuous improvementstrategic thinkingteam developmentincident management