FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Principal Engineer, AI Platform – Infrastructure
SPREEAIPrincipal Engineer for AI Platform & Infrastructure at SPREEAI, focusing on multimodal AI systems and scalable infrastructure. Collaborating with teams to deploy production-grade ML models for retail partners.
Tech Stack
Tools & technologiesAirflowAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformKubernetesPythonPyTorchRay
About the role
Key responsibilities & impact- Build and operate SPREEAI’s end-to-end ML platform spanning training, evaluation, deployment, and monitoring.
- Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems.
- Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation.
- Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks.
- Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability.
- Establish production SLOs for latency, availability, error rate, GPU saturation, cold-start time, cost per inference, and model quality drift.
- Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems.
- Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads.
- Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems.
Requirements
What you’ll need- 10+ years of software engineering / infrastructure experience, with 5+ years in ML infrastructure, MLOps, distributed systems, or AI platform engineering.
- Deep experience with Python, PyTorch, Kubernetes, Docker, cloud infrastructure, and GPU-based workloads.
- Strong understanding of distributed systems and large-scale ML infrastructure design.
- Experience with ML workflow orchestration systems such as Ray, Kubeflow, Argo, Airflow, Flyte, or Metaflow.
- Experience deploying and managing production inference systems using platforms like Triton, vLLM, TensorRT-LLM, Ray Serve, KServe, Seldon, BentoML, TorchServe, or custom services.
- Strong understanding of inference optimization techniques such as batching, quantization, CUDA graphs, and memory-aware scheduling.
- Experience with model registries, experiment tracking, CI/CD for ML, canary deployments, shadow traffic, rollback strategies, and production monitoring.
- Strong cloud experience across AWS, GCP, Azure, or GPU-focused providers like CoreWeave, Lambda Labs, or RunPod.
- Ability to debug performance bottlenecks across distributed systems, containers, networking, GPU memory, and storage layers.
- Strong ownership mindset with the ability to define architecture, set platform standards, and drive execution across teams.
Benefits
Comp & perks- Health insurance
- Remote work options
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
PythonPyTorchKubernetesDockerML infrastructureMLOpsdistributed systemsinference optimizationmodel registriesCI/CD for ML
Soft Skills
strong ownership mindsetability to define architecturedrive execution across teams