Rad AI

Senior Machine Learning Engineer, Infrastructure

Rad AI

full-time

Posted on:

Origin:  • 🇺🇸 United States

Visit company website
AI Apply
Apply

Salary

💰 $170,000 - $200,000 per year

Job Level

Senior

Tech Stack

AirflowAnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaJavaScriptKubernetesPostgresPythonPyTorchReactTerraformTypeScript

About the role

  • Design, implement, and maintain the infrastructure that supports our machine learning applications, services, and workflows
  • Build, maintain, and improve our ML platform that supports continuous integration, continuous delivery, and continuous training for our machine learning models
  • Develop fullstack, cloud-native services and serverless architectures to build scalable and resilient systems
  • Plan, design, and develop components in the data pipeline to enable various machine learning models in production
  • Write code that meets internal standards for security, style, maintainability, and best practices for a high-scale HIPAA web environment
  • Design, deploy, and maintain the full ML platform stack including monitoring and observability, data analytics, backend integration with customer-facing products, and the full model R&D lifecycle
  • Work with Product Management, Research, and Engineering to iterate on new features and address inefficiencies across our AI/ML infrastructure
  • Connect language models to customer-facing products and serve those models to radiologists
  • Backend-heavy role that includes fullstack development in Python and Typescript

Requirements

  • 5+ years of industry experience in ML Engineering in cloud-native environments
  • In-depth knowledge of Python and Javascript/Typescript (preferable), or other modern languages in the ML domain
  • Strong experience with infrastructure and DevOps tools such as Kubernetes, Docker, and Ansible
  • Experience in distributed systems, storage systems, and databases
  • Strong knowledge of cloud computing platforms such as AWS (preferable), GCP, and Azure
  • Experience with infrastructure-as-code tools such as Terraform (preferable), Pulumi, Cloud Formation, etc.
  • Experience with monitoring, tracing, and logging tools such Cloudwatch, NewRelic, Grafana, etc.
  • Excellent communication skills, with a strong sense of ownership and a systematic approach to problem-solving
  • Proven ability to manage and lead active incidents, address root causes, and run blameless postmortems
  • Authorized to work lawfully in the US (application requires this)
  • Experience with React (nice to have)
  • Experience with PostgreSQL (nice to have)
  • Experience with orchestration tools like Airflow and Metaflow (nice to have)
  • Experience with data analytics tools like Hex, Amplitude, Retool (nice to have)
  • Experience working at a fast-growing startup (nice to have)
  • Experience in a HIPAA-compliant environment (nice to have)
  • Experience working with machine learning frameworks such as PyTorch and LangGraph (nice to have)
  • Experience productionizing or optimizing inference of LLMs or other NLP models (nice to have)