
Senior Machine Learning Engineer
Chalice AI
full-time
Posted on:
Location Type: Hybrid
Location: New York City • New York • 🇺🇸 United States
Visit company websiteSalary
💰 $180,000 - $200,000 per year
Job Level
Senior
Tech Stack
AWSCloudEC2GrafanaPrometheusPySparkPythonPyTorchRayUnity
About the role
- Architect, train, and maintain scalable neural network systems for audience modeling and bid optimization using PyTorch and Ray distributed training (Ray Train, Ray Tune, DDP)
- Build and optimize multi-GPU training pipelines on Databricks, including hyperparameter search with ASHA scheduling and early stopping
- Develop feature engineering pipelines using PySpark, including embedding layers (EmbeddingBag, Embedding) for categorical and behavioral features
- Implement model comparison workflows with champion/challenger evaluation on holdout data
- Build resilient training and batch inference workflows with a focus on automation, reproducibility, and checkpoint recovery
- Implement robust model monitoring and observability solutions (MLflow, Prometheus, Grafana, Datadog) to track drift, performance metrics (AUC, AUPRC, F1), and system health
- Manage model versioning, experiment tracking, and artifact persistence using MLflow and Unity Catalog
- Work closely with engineering teams to integrate model outputs into production systems and optimize dataflows for fault-tolerance
- Partner with product stakeholders to align ML efforts with business impact, KPIs, and product strategy across AI Audiences, AI Allocator, CPA Algo, and Curate AI
- Lead technical design reviews, contribute to internal Python packages, and enforce engineering best practices (testing, CI/CD, modularity)
- Stay current on ML infrastructure advancements (distributed training, inference optimization, model serving patterns) and help guide adoption internally
- Document system architectures, create runbooks, and enable team members to adopt and extend the ML framework
Requirements
- Master's Degree or PhD in Computer Science, Statistics, Machine Learning, or related discipline with 5-10 years of industry experience
- Strong proficiency in PyTorch for neural network development, including custom architectures with embedding layers, MLP backbones, and binary classification heads
- Production experience with Databricks including Delta Lake, Unity Catalog, Asset Bundles, and cluster management
- Strong grasp of MLOps best practices: experiment tracking (MLflow), model versioning, model serving, monitoring, and reproducibility
- Expert-level Python and PySpark skills for data processing and feature engineering at scale
- Experience building and maintaining batch inference pipelines with schema versioning and artifact management
- Familiarity with cloud platforms (AWS: S3, EC2) and data warehousing (Snowflake)
- Experience with CI/CD workflows including build automation, testing, and packaging using GitHub Actions and Make
- Excellent collaboration and communication skills; ability to work effectively in a cross-functional environment with DS, Product, and Engineering teams.
Benefits
- Medical, Dental, and Vision coverage
- 401(k) options
- Unlimited PTO
- 11 Company Holidays
- Office-wide closure between Christmas Eve and New Year's
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
neural network systemsPyTorchRay distributed trainingmulti-GPU training pipelinesfeature engineeringPySparkmodel monitoringexperiment trackingmodel versioningPython
Soft skills
collaborationcommunicationleadershipcross-functional teamworktechnical design reviews
Certifications
Master's DegreePhD