Restorative Neurotechnologies

ML Infrastructure Engineer

Restorative Neurotechnologies

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Salary

💰 $180,000 - $230,000 per year

About the role

  • Create flexible and performant ML infrastructure
  • Design and build systems ML cloud infrastructure to enable massive-scale modeling and analytics
  • Support diverse model exploration, hyperparameter optimization, pretraining, fine-tuning, and evaluation processes
  • Design and optimize scalable distributed training pipelines, with support for features such as model sharding, cross-GPU communication, and real-time training monitoring
  • Create, operate, and maintain robust ML platforms and services across the model lifecycle
  • Make informed architecture decisions that balance performance, cost, reliability, and scalability
  • Build diverse and scalable data platforms
  • Design, build, and optimize massive-scale databases and data pipelines for scalable, flexible, and reliable data access
  • Explore research-driven, tailored data solutions using existing and simulated data, comparing performance and efficiency across solutions for typical data-access patterns
  • Create infrastructure and pipelines for ingesting internal and external datasets with varied shapes, formats, and associated metadata
  • Design and assess custom data formats for efficient storage and slicing of high-dimensional time-series data
  • Enable efficient data movement, preprocessing, and artifact management for data lineage and modeling reproducibility
  • Meet company standards for delivered solutions
  • Establish best practices for reliability, observability, reproducibility, and operational excellence across the ML ecosystem
  • Make informed and collaborative decisions with domain experts across the software & ML teams
  • Foster visibility and reproducibility within the company by maintaining extensive documentation of design decisions, evaluations of viable alternatives for selected solutions, pipeline assessments, etc.
  • Support ML R&D operations while preparing for eventual incorporation into product pipelines

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering, or a related technical discipline
  • 5+ years of industry experience in software engineering, large-scale data infrastructure, or systems ML
  • Extensive proficiency in Python
  • Familiarity with PyTorch
  • Experience designing, building, and maintaining high-throughput data pipelines for large and diverse datasets
  • Experience working with distributed-training frameworks (e.g. FSDP, DeepSpeed, Megatron-LM, Ray, etc.)
  • Experience building or optimizing ML training pipelines for transformers or other large neural-network models
  • Demonstrated ability to partner closely with research and modeling teams to productionize workflows
  • Excellent communication and collaboration skills to work effectively on cross-functional and interdisciplinary teams
  • Experience having technical ownership over at least one successfully implemented collaborative project.
Benefits
  • 📊 Check your resume score for this job Improve your chances of getting an interview by checking your resume score before you apply. Check Resume Score
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
machine learning infrastructurecloud infrastructurehyperparameter optimizationdistributed training pipelinesdata platformsdata pipelinesdata accessdata movementPythonPyTorch
Soft Skills
communicationcollaborationdecision makingdocumentationoperational excellence
Certifications
Bachelor's degree in Computer ScienceBachelor's degree in Electrical Engineering