Tech Stack
AnsibleAWSAzureCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaKubernetesPrometheusPythonTerraform
About the role
- Design, Build, and Maintain Core Infrastructure: Architect and implement scalable, highly available, and secure infrastructure on cloud platforms (GCP, AWS, Azure) to support our AI-driven applications and services.\n
- Automate Everything: Develop and maintain automation tools and frameworks to eliminate manual effort in deployment, configuration, and management of our production environment.\n
- Ensure System Reliability and Performance: Establish and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our production systems. Proactively identify and resolve performance bottlenecks and availability issues.\n
- Manage ML Infrastructure and Pipelines: Collaborate with ML engineers to build and maintain robust CI/CD pipelines for machine learning models, ensuring seamless training, deployment, and monitoring.\n
- Incident Response and Post-Mortems: Lead incident response efforts to minimize downtime and conduct thorough post-incident reviews to identify root causes and implement preventative measures.\n
- Implement and Enhance Observability: Deploy and manage comprehensive monitoring, logging, and tracing solutions (e.g., Prometheus, Grafana, ELK stack) to provide deep visibility into system health.\n
- Capacity Planning and Cost Optimization: Forecast infrastructure needs and optimize resource utilization to ensure our platform can scale efficiently and cost-effectively.\n
- Foster a Culture of Reliability: Champion SRE best practices across the engineering organization and mentor team members on reliability, performance, and scalability.
Requirements
- Proven SRE and DevOps Experience: Demonstrated experience in a Site Reliability Engineering or DevOps role, managing complex, large-scale production environments.\n
- Cloud Infrastructure Expertise: Hands-on experience with one or more major cloud platforms (GCP, AWS, Azure).\n
- Proficiency in Infrastructure as Code: Strong skills with IaC tools such as Terraform, Ansible, or CloudFormation.\n
- Containerization and Orchestration Mastery: Deep knowledge of Docker and Kubernetes, including experience deploying and managing containerized applications in production.\n
- Strong Programming and Scripting Skills: Proficiency in languages such as Python, with a focus on automation and building reliable software.\n
- Experience with Monitoring and Observability Tools: Expertise in setting up and using monitoring and logging systems like Prometheus, Grafana, or the ELK stack.\n
- CI/CD Pipeline Development: A strong background in building and managing CI/CD pipelines for both software applications and machine learning models.\n
- Excellent Problem-Solving and Communication Skills: The ability to troubleshoot complex issues across the stack and clearly communicate technical concepts to both technical and non-technical stakeholders.\n
- Educational Background: A Bachelor\'s or Master\'s degree in Computer Science, Software Engineering, or a related field.