Articul8 AI

Senior Site Reliability Engineer, SRE

Articul8 AI

full-time

Posted on:

Location: California • 🇺🇸 United States

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AWSAzureCloudDistributed SystemsDockerGoGoogle Cloud PlatformGrafanaKubernetesNoSQLPrometheusPythonSQLTerraform

About the role

  • Architect and maintain scalable, highly available infrastructure for our GenAI platform
  • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance
  • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency
  • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
  • Participate in on-call rotations and provide rapid response to production incidents
  • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads
  • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives
  • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads
  • Implement and enforce security best practices across all systems and environments
  • Create and maintain comprehensive documentation, including runbooks and knowledge base articles

Requirements

  • Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
  • 5+ years of experience in DevOps, SRE, or similar roles
  • Strong experience with cloud platforms (AWS, GCP, or Azure)
  • Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.)
  • Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.)
  • Solid background in containerization technologies (Docker, Kubernetes)
  • Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.)
  • Strong understanding of CI/CD pipelines and automation
  • Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems
  • Experience supporting AI/ML systems in production (preferred)
  • Knowledge of GPU infrastructure management and optimization (preferred)
  • Familiarity with distributed systems and high-performance computing (preferred)
  • Experience with database systems (SQL and NoSQL) (preferred)
  • Certifications in cloud platforms (AWS, GCP, Azure) (preferred)
  • Experience with chaos engineering and resilience testing (preferred)
  • Knowledge of security best practices and compliance requirements (preferred)
Coates Group

Senior DevOps Engineer

Coates Group
Seniorfull-time$125k–$140k / yearIllinois · 🇺🇸 United States
Posted: 3 hours agoSource: jobs.lever.co
AWSCloudDockerIoTLinuxMicroservicesPython
Eduphoria! Inc.

AWS DevOps Engineer

Eduphoria! Inc.
Mid · Seniorfull-time$110k–$125k / yearFlorida, Illinois, Kansas, Maryland, North Carolina, Ohio, Tennessee, Texas, Virginia · 🇺🇸 United States
Posted: 17 hours agoSource: eduphoria.applytojob.com
AWSAzureCloudEC2LinuxMySQL.NETSQLTerraform
GEICO

DevOps Engineer II – FinTech Commissions, Substantiation

GEICO
Mid · Seniorfull-time$75k–$160k / yearDistrict of Columbia, Maryland, Texas, Virginia · 🇺🇸 United States
Posted: 18 hours agoSource: geico.wd1.myworkdayjobs.com
AWSAzureCloudDistributed SystemsJava.NETNoSQLPythonSQL
ParentSquare

Site Reliability Engineer

ParentSquare
Mid · Seniorfull-time$170k–$200k / year🇺🇸 United States
Posted: 18 hours agoSource: ats.rippling.com
AnsibleAWSAzureChefCloudDistributed SystemsDockerGoogle Cloud PlatformGrafanaKubernetesLinuxPrometheus+4 more
Leidos

DevOps Technical Lead

Leidos
Seniorfull-time$105k–$189k / year🇺🇸 United States
Posted: 19 hours agoSource: leidos.wd5.myworkdayjobs.com
AWSCloudGrafanaJenkinsJMeterKafkaLinuxMavenSeleniumSplunkZookeeper