DevOps Engineer

• Architect and maintain scalable, highly available infrastructure for our GenAI platform
• Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance
• Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency
• Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
• Participate in on-call rotations and provide rapid response to production incidents
• Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads
• Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives
• Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads
• Implement and enforce security best practices across all systems and environments
• Create and maintain comprehensive documentation, including runbooks and knowledge base articles

Senior Site Reliability Engineer, SRE

Job Level

Tech Stack

About the role

Requirements