Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
SPREEAI

Principal Engineer, AI Platform – Infrastructure

SPREEAI

Principal Engineer for AI Platform & Infrastructure at SPREEAI, focusing on multimodal AI systems and scalable infrastructure. Collaborating with teams to deploy production-grade ML models for retail partners.

Posted 4/25/2026full-timeSan Francisco • California • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies
AirflowAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesPythonPyTorchRay

About the role

Key responsibilities & impact
  • Build and operate SPREEAI’s end-to-end ML platform spanning training, evaluation, deployment, and monitoring.
  • Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems.
  • Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation.
  • Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks.
  • Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability.
  • Establish production SLOs for latency, availability, error rate, GPU saturation, cold-start time, cost per inference, and model quality drift.
  • Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems.
  • Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads.
  • Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems.

Requirements

What you’ll need
  • 10+ years of software engineering / infrastructure experience, with 5+ years in ML infrastructure, MLOps, distributed systems, or AI platform engineering.
  • Deep experience with Python, PyTorch, Kubernetes, Docker, cloud infrastructure, and GPU-based workloads.
  • Strong understanding of distributed systems and large-scale ML infrastructure design.
  • Experience with ML workflow orchestration systems such as Ray, Kubeflow, Argo, Airflow, Flyte, or Metaflow.
  • Experience deploying and managing production inference systems using platforms like Triton, vLLM, TensorRT-LLM, Ray Serve, KServe, Seldon, BentoML, TorchServe, or custom services.
  • Strong understanding of inference optimization techniques such as batching, quantization, CUDA graphs, and memory-aware scheduling.
  • Experience with model registries, experiment tracking, CI/CD for ML, canary deployments, shadow traffic, rollback strategies, and production monitoring.
  • Strong cloud experience across AWS, GCP, Azure, or GPU-focused providers like CoreWeave, Lambda Labs, or RunPod.
  • Ability to debug performance bottlenecks across distributed systems, containers, networking, GPU memory, and storage layers.
  • Strong ownership mindset with the ability to define architecture, set platform standards, and drive execution across teams.

Benefits

Comp & perks
  • Health insurance
  • Remote work options

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
PythonPyTorchKubernetesDockerML infrastructureMLOpsdistributed systemsinference optimizationmodel registriesCI/CD for ML
Soft Skills
strong ownership mindsetability to define architecturedrive execution across teams