Tech Stack
AWSCloudDistributed SystemsGoKubernetesPythonPyTorchRay
About the role
- Initially focus on re-architecting and enhancing the dynamic provisioning platform for remote ML experiments
- Lead collaboration with ML engineers and data scientists to understand needs and design robust tools and processes
- Drive design, development, and optimization of core ML infrastructure leveraging Kubernetes
- Champion best practices in software engineering, including code quality, testing, and system reliability
- Take ownership of key infrastructure components and initiatives
- Mentor junior engineers with high-level system design and code reviews
- Contribute to broader ML infrastructure enhancements to keep training ecosystem cutting-edge and robust
Requirements
- 5+ years of professional experience in software engineering
- Strong knowledge of software engineering principles, distributed systems
- Expertise with Python or Go
- Experience with AWS services or other Cloud platforms
- Experience with Kubernetes
- Strong written and oral communication skills
- Demonstrated ability to mentor and guide junior engineers
- Experience with the various stages of the ML development lifecycle is a plus
- Experience with ML frameworks such as PyTorch, Ray is a plus
ATS Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
PythonGoKubernetesAWSML frameworksPyTorchRaysoftware engineering principlesdistributed systemsML development lifecycle
Soft skills
communication skillsmentoringcollaborationleadershipcode reviewssystem designownershipbest practicescode qualitytesting