Principal Engineer, AI Platform – Infrastructure

SPREEAI

Principal Engineer for AI Platform & Infrastructure at SPREEAI, focusing on multimodal AI systems and scalable infrastructure. Collaborating with teams to deploy production-grade ML models for retail partners.

Posted 4/25/2026full-timeSan Francisco • California • 🇺🇸 United StatesLeadWebsite

Tech Stack

Tools & technologies

AirflowAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesPythonPyTorchRay

About the role

Key responsibilities & impact

Build and operate SPREEAI’s end-to-end ML platform spanning training, evaluation, deployment, and monitoring.
Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems.
Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation.
Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks.
Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability.
Establish production SLOs for latency, availability, error rate, GPU saturation, cold-start time, cost per inference, and model quality drift.
Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems.
Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads.
Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems.

Requirements

What you’ll need

10+ years of software engineering / infrastructure experience, with 5+ years in ML infrastructure, MLOps, distributed systems, or AI platform engineering.
Deep experience with Python, PyTorch, Kubernetes, Docker, cloud infrastructure, and GPU-based workloads.
Strong understanding of distributed systems and large-scale ML infrastructure design.
Experience with ML workflow orchestration systems such as Ray, Kubeflow, Argo, Airflow, Flyte, or Metaflow.
Experience deploying and managing production inference systems using platforms like Triton, vLLM, TensorRT-LLM, Ray Serve, KServe, Seldon, BentoML, TorchServe, or custom services.
Strong understanding of inference optimization techniques such as batching, quantization, CUDA graphs, and memory-aware scheduling.
Experience with model registries, experiment tracking, CI/CD for ML, canary deployments, shadow traffic, rollback strategies, and production monitoring.
Strong cloud experience across AWS, GCP, Azure, or GPU-focused providers like CoreWeave, Lambda Labs, or RunPod.
Ability to debug performance bottlenecks across distributed systems, containers, networking, GPU memory, and storage layers.
Strong ownership mindset with the ability to define architecture, set platform standards, and drive execution across teams.

Benefits

Comp & perks

Health insurance
Remote work options

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

PythonPyTorchKubernetesDockerML infrastructureMLOpsdistributed systemsinference optimizationmodel registriesCI/CD for ML

Soft Skills

strong ownership mindsetability to define architecturedrive execution across teams