Salary
💰 $148,000 - $287,500 per year
Tech Stack
AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesPythonTerraform
About the role
- We are now seeking a Senior DevOps Engineer for NVIDIA AI Inference Operations Team.
This is a unique opportunity to be the cornerstone of our DevOps practice, taking full ownership of the critical systems that power our engineering innovation.
You will be responsible for the entire DevOps landscape, from our CI/CD pipelines to our kernel build systems, driving efficiency and reliability across the organization.
You will work with autonomy to design and implement the best solutions and collaborate with external partners to achieve our goals.
If you're passionate about infrastructure, Kubernetes, automation, and observability, we want you with us at one of the most innovative companies in the world.
Building and maintaining infrastructure from first principles needed to deliver our growing family of AI Inferencing products including Dynamo and NIXL.
Maintain CI/CD pipelines to automate the build, test, and deployment process and build improvements on the bottlenecks.
Managing tools and enabling automations for redundant manual workflows via Github Actions, Gitlab, Terraform, etc
Enable performing scans and handling of security CVEs for infrastructure components
Extensive collaboration with cross-functional teams to integrate pipelines from deep learning frameworks and components is essential to ensuring seamless deployment and inference of deep learning models on our platform.
Requirements
- Masters degree or equivalent experience
3+ years of experience in Computer Science, computer architecture, or related field
Ability to work in a fast-paced, agile team environment
Excellent Bash, CI/CD, Python programming and software design skills, including debugging, performance analysis, and test design.
Experience in administering, monitoring, and deploying systems and services on GitHub and cloud platforms.
Support other technical teams in monitoring operating efficiencies of the platform, and responding as needs arise.
Highly skilled in Kubernetes and Docker/containerd.
Automation expert with hands on skills in frameworks like Ansible & Terraform.
Experience in AWS, Azure or GCP
Knowledge of distributed systems programming.