FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.
Tech Stack
Tools & technologiesAirflowCloudDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonPyTorchTerraform
About the role
Key responsibilities & impact- Design, implement, and maintain scalable infrastructure for ML and GenAI applications
- Deploy, operate, and troubleshoot production ML/GenAI pipelines/services
- Build and optimize CI/CD pipelines for ML model deployment and serving
- Scale compute resources across CPU/GPU architectures to meet performance requirements
- Implement container orchestration with Kubernetes
- Architect and optimize cloud resources on GCP for ML training and inference
- Setup and maintain runtime frameworks and job management systems (Airflow, KubeFlow, MLflow, etc.)
- Establish monitoring, logging and alerting for systems observability
- Optimize system performance and resource utilization for cost efficiency
- Develop and enforce AIOps best practices across the organization
Requirements
What you’ll need- Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent experience)
- 8+ years of overall software engineering experience
- 3+ years of focused experience in DevOps/AIOps or similar ML infrastructure roles
- Proficient in IaC, using Terraform
- Strong experience with containerization and orchestration using Docker and Kubernetes
- Demonstrated expertise in cloud infrastructure management on GCP
- Proficiency with workflow management such as Airflow & Kubeflow
- Strong CI/CD expertise with experience implementing automated testing and deployment pipelines
- Experience with scaling distributed compute architectures utilizing various accelerators (CPU/GPU)
- Solid understanding of system performance optimization techniques
- Experience implementing comprehensive observability solutions for complex systems
- Knowledge of monitoring and logging tools (Prometheus, Grafana, ELK stack)
- Strong proficiency in Python
- Familiarity with ML frameworks such as PyTorch and ML platforms like Vertex AI
- Excellent problem-solving skills and ability to work independently
- Strong communication skills and ability to work effectively in cross-functional teams
Benefits
Comp & perks- Health insurance
- 401(k) matching
- Flexible work arrangements
- Professional development
- Possible bonuses
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
machine learninggenerative AICI/CDcontainer orchestrationKubernetescloud infrastructure managementGCPTerraformPythonsystem performance optimization
Soft Skills
problem-solvingcommunicationindependent workcross-functional teamwork
