
Applied Research Engineer – Training Infra
Snorkel AI
full-time
Posted on:
Location Type: Hybrid
Location: Redwood City • California • United States
Visit company websiteExplore more
Salary
💰 $150,000 - $180,000 per year
Tech Stack
About the role
- Own the infrastructure that powers model training and evaluation work
- Build and operate GPU cluster infrastructure, training pipelines
- Translate training requirements into robust, reproducible systems
- Monitor and optimize cluster health, inter-node communication
- Work closely with research scientists and ML engineers
Requirements
- Hands-on experience managing GPU clusters on major cloud providers
- Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent
- Working knowledge of distributed training concepts
- Experience with setting up, managing, and integrating ML experiment tracking
- Strong Python proficiency and solid software engineering fundamentals
- Ability to work in a fast-moving, iterative environment
- Hands-on experience with post-training workflows is a plus
Benefits
- Global team events
- Professional development opportunities
- Health insurance
- Flexible working hours
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
GPU clusterscloud providersKubernetesSlurmdistributed trainingML experiment trackingPythonsoftware engineering
Soft Skills
ability to work in fast-moving environmentiterative work