
ML Infrastructure Engineer
Restorative Neurotechnologies
full-time
Posted on:
Location Type: Hybrid
Location: San Francisco • California • United States
Visit company websiteExplore more
Salary
💰 $180,000 - $230,000 per year
About the role
- Create flexible and performant ML infrastructure
- Design and build systems ML cloud infrastructure to enable massive-scale modeling and analytics
- Support diverse model exploration, hyperparameter optimization, pretraining, fine-tuning, and evaluation processes
- Design and optimize scalable distributed training pipelines, with support for features such as model sharding, cross-GPU communication, and real-time training monitoring
- Create, operate, and maintain robust ML platforms and services across the model lifecycle
- Make informed architecture decisions that balance performance, cost, reliability, and scalability
- Build diverse and scalable data platforms
- Design, build, and optimize massive-scale databases and data pipelines for scalable, flexible, and reliable data access
- Explore research-driven, tailored data solutions using existing and simulated data, comparing performance and efficiency across solutions for typical data-access patterns
- Create infrastructure and pipelines for ingesting internal and external datasets with varied shapes, formats, and associated metadata
- Design and assess custom data formats for efficient storage and slicing of high-dimensional time-series data
- Enable efficient data movement, preprocessing, and artifact management for data lineage and modeling reproducibility
- Meet company standards for delivered solutions
- Establish best practices for reliability, observability, reproducibility, and operational excellence across the ML ecosystem
- Make informed and collaborative decisions with domain experts across the software & ML teams
- Foster visibility and reproducibility within the company by maintaining extensive documentation of design decisions, evaluations of viable alternatives for selected solutions, pipeline assessments, etc.
- Support ML R&D operations while preparing for eventual incorporation into product pipelines
Requirements
- Bachelor's degree in Computer Science, Electrical Engineering, or a related technical discipline
- 5+ years of industry experience in software engineering, large-scale data infrastructure, or systems ML
- Extensive proficiency in Python
- Familiarity with PyTorch
- Experience designing, building, and maintaining high-throughput data pipelines for large and diverse datasets
- Experience working with distributed-training frameworks (e.g. FSDP, DeepSpeed, Megatron-LM, Ray, etc.)
- Experience building or optimizing ML training pipelines for transformers or other large neural-network models
- Demonstrated ability to partner closely with research and modeling teams to productionize workflows
- Excellent communication and collaboration skills to work effectively on cross-functional and interdisciplinary teams
- Experience having technical ownership over at least one successfully implemented collaborative project.
Benefits
- 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learning infrastructurecloud infrastructurehyperparameter optimizationdistributed training pipelinesdata platformsdata pipelinesdata accessdata movementPythonPyTorch
Soft Skills
communicationcollaborationdecision makingdocumentationoperational excellence
Certifications
Bachelor's degree in Computer ScienceBachelor's degree in Electrical Engineering