Articul8 AI

Senior Site Reliability Engineer, SRE

Articul8 AI

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Manual Apply

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPythonSQLTerraform

About the role

  • Architect and maintain scalable, highly available infrastructure for our GenAI platform
  • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance
  • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency
  • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
  • Participate in on-call rotations and provide rapid response to production incidents
  • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads
  • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives
  • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads
  • Implement and enforce security best practices across all systems and environments
  • Create and maintain comprehensive documentation, including runbooks and knowledge base articles

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
  • 5+ years of experience in DevOps, SRE, or similar roles
  • Strong experience with cloud platforms (AWS, GCP, or Azure)
  • Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
  • Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
  • Solid background in containerization technologies (Docker, Kubernetes)
  • Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
  • Strong understanding of CI/CD pipelines and automation
  • Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
  • Experience supporting AI/ML systems in production (preferred)
  • Knowledge of GPU infrastructure management and optimization (preferred)
  • Familiarity with distributed systems and high-performance computing (preferred)
  • Experience with database systems (SQL and NoSQL) (preferred)
  • Certifications in cloud platforms (AWS, GCP, Azure) (preferred)
  • Experience with chaos engineering and resilience testing (preferred)
  • Knowledge of security best practices and compliance requirements (preferred)