
Research Engineer – Distributed Training
Prime Intellect
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
About the role
- Lead and participate in novel research to build a massive scale, highly reliable and secure decentralized training orchestration solution
- Optimize the performance, cost, and resource utilization of AI workloads by leveraging the most recent advances for compute & memory optimization techniques.
- Contribute to the development of our open-source libraries and frameworks for distributed model training.
- Publish research in top-tier AI conferences such as ICML & NeurIPS.
- Distill highly technical project outcomes in layman approachable technical blogs to our customers and developers.
- Stay up-to-date with the latest advancements in AI/ML infrastructure and tools, decentralized training research and proactively identify opportunities to enhance our platform's capabilities and user experience.
Requirements
- Strong background in AI/ML engineering, with extensive experience in designing and implementing end-to-end pipelines for training and deploying large-scale AI models.
- Deep expertise in distributed training techniques, frameworks (e.g., PyTorch Distributed, DeepSpeed, MosaicML’s LLM Foundry), and tools (e.g. Ray) for optimizing the performance and scalability of AI workloads.
- Experience in large-scale model training incl. distributed training techniques such as data, tensor & pipeline parallelism
- Solid understanding of MLOps best practices, including model versioning, experiment tracking, and continuous integration/deployment (CI/CD) pipelines.
- Passion for advancing the state-of-the-art in decentralized AI model training and democratizing access to AI capabilities for researchers, developers, and businesses worldwide.
- If you're not familiar with these, but feel like that you can contribute to our mission and you're a high-energy person, get familiar with these resources (here, here and here) and please reach out!
Benefits
- Competitive compensation, including equity incentives, aligning your success with the growth and impact of Prime Intellect.
- Flexible work arrangements, with the option to work remotely or in-person at our offices in San Francisco.
- Visa sponsorship and relocation assistance for international candidates.
- Quarterly team off-sites, hackathons, conferences and learning opportunities.
- Opportunity to work with a talented, hard-working and mission-driven team, united by a shared passion for leveraging technology to accelerate science and AI.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
AI engineeringML engineeringend-to-end pipelinesdistributed training techniquesdata parallelismtensor parallelismpipeline parallelismMLOps best practicesmodel versioningcontinuous integration/deployment
Soft Skills
leadershipcommunicationtechnical writingcollaborationproblem-solvingadaptabilitypassion for innovationcustomer engagementresearch publicationproactive identification of opportunities