Salary
💰 $184,000 - $356,500 per year
Tech Stack
CloudDistributed SystemsGoJavaKubernetesLinuxNode.jsOpen SourceOpenStackPythonTerraform
About the role
- Design, build, deploy, and run internal tooling for large scale AI training and Inferencing platform built on top of cloud infrastructure
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.
- Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
- Practice sustainable incident response and blameless postmortems
- Be part of an on call rotation to support production systems
Requirements
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
- 6+ years of experience.
- A track record showing a good balance between initiating your own projects, convincing others to collaborate with you and collaborating well on projects initiated by others.
- Experience with infrastructure automation and distributed systems design developing tools for running large scale private or public cloud systems in production.
- Experience in one or more of the following: Python, Go, C/C++, Java
- In depth knowledge in one or more of Linux, Networking, Storage, and Containers Technologies
- Experience with Public Cloud and Infrastructure as a code (IAAC) and Terraform
- Distributed system experience