Staff Site Reliability Engineer, Managed AI

Crusoe

full-time

Posted on: 9/25/2025

Origin: • 🇺🇸 United States • California

✨ AI Apply

💰 $204,000 - $247,000 per year

Lead

CloudDistributed SystemsGoJavaKubernetesPython

About the role

Ensure the reliability and scalability of Crusoe’s AI-optimized cloud platform and managed AI services
Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
Build automation and reliability tooling to support distributed AI pipelines and inference services
Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Strong software engineering background — experience building production-grade systems beyond scripting or Bash
Demonstrated experience in distributed systems design and implementation
Hands-on work with large language models (LLMs) or AI/ML infrastructure
SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs; Building monitoring and observability systems; Driving performance and reliability improvements; Designing fault-tolerant systems and automated testing strategies
Proficiency in at least one modern programming language (Python, Go, Java, C++)
Familiarity with Kubernetes or container orchestration platforms
Strong collaboration and communication skills
Ability to thrive in a fast-paced, mission-driven environment
Bonus: Experience scaling inference or training workloads for LLMs