Tech Stack
AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPythonSQLTerraform
About the role
- Architect and maintain scalable, highly available infrastructure for our GenAI platform
- Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance
- Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency
- Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
- Participate in on-call rotations and provide rapid response to production incidents
- Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads
- Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives
- Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads
- Implement and enforce security best practices across all systems and environments
- Create and maintain comprehensive documentation, including runbooks and knowledge base articles
Requirements
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 5+ years of experience in DevOps, SRE, or similar roles
- Strong experience with cloud platforms (AWS, GCP, or Azure)
- Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
- Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
- Solid background in containerization technologies (Docker, Kubernetes)
- Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
- Strong understanding of CI/CD pipelines and automation
- Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
- Experience supporting AI/ML systems in production (preferred)
- Knowledge of GPU infrastructure management and optimization (preferred)
- Familiarity with distributed systems and high-performance computing (preferred)
- Experience with database systems (SQL and NoSQL) (preferred)
- Certifications in cloud platforms (AWS, GCP, Azure) (preferred)
- Experience with chaos engineering and resilience testing (preferred)
- Knowledge of security best practices and compliance requirements (preferred)