Salary
💰 $100,000 - $175,000 per year
Tech Stack
CloudGoKubernetesNode.js
About the role
- Develop and manage scalable infrastructure to support research workloads
- Own existing Kubernetes cluster on bare-metal H100 cloud instances and enhance it to support new workloads, users, and features
- Deliver a scalable and easy to use compute cluster to support impactful research
- Empower research team to solve day-to-day compute problems and streamline recurring tasks
- Maintain and develop in-cluster services such as backups, experiment tracking, and in-house LLM-based cluster support bot
- Maintain cluster stability (>95% uptime outside planned maintenance windows)
- Maintain situational awareness of cloud GPU market and assist with vendor comparisons
- Implement security measures securing cluster against insider and external threats
- Streamline secure workflows (e.g., OAuth reverse proxy for internal dashboards)
- Champion security, maintain and extend MDM system
- Architect Kubernetes cluster to support novel ML workloads and assist projects with bespoke requirements
- Improve observability over cluster resources and GPU utilization
Requirements
- Have Kubernetes or other system administration experience
- Have a curiosity and willingness to rapidly learn the needs of a new space
- Are self-directed and comfortable with ambiguous or rapidly evolving requirements
- Are willing to be on-call during waking hours for cluster issues ahead of major deadlines (for a few weeks a quarter)
- Are interested in improving our security posture through identifying, implementing and administering security policies
- Preferable: experience supporting ML/AI workloads
- Preferable: previously worked in research environments or startups
- Preferable: experienced in administering compute or GPU clusters
- Preferable: able to adopt a security mindset
- Preferable: willing to be part of an eventual on-call rotation, if required