Crusoe

Staff Site Reliability Engineer, Managed AI

Crusoe

full-time

Posted on:

Origin:  • 🇺🇸 United States • California

Visit company website
AI Apply
Apply

Salary

💰 $204,000 - $247,000 per year

Job Level

Lead

Tech Stack

CloudDistributed SystemsGoJavaKubernetesPython

About the role

  • Ensure the reliability and scalability of Crusoe’s AI-optimized cloud platform and managed AI services
  • Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
  • Build automation and reliability tooling to support distributed AI pipelines and inference services
  • Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
  • Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
  • Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
  • Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
  • Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments

Requirements

  • Strong software engineering background — experience building production-grade systems beyond scripting or Bash
  • Demonstrated experience in distributed systems design and implementation
  • Hands-on work with large language models (LLMs) or AI/ML infrastructure
  • SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs; Building monitoring and observability systems; Driving performance and reliability improvements; Designing fault-tolerant systems and automated testing strategies
  • Proficiency in at least one modern programming language (Python, Go, Java, C++)
  • Familiarity with Kubernetes or container orchestration platforms
  • Strong collaboration and communication skills
  • Ability to thrive in a fast-paced, mission-driven environment
  • Bonus: Experience scaling inference or training workloads for LLMs
NVIDIA

Senior Software Engineer – Container and Cloud Infrastructure

NVIDIA
Seniorfull-time$184k–$357k / yearCalifornia · 🇺🇸 United States
Posted: 5 days agoSource: nvidia.wd5.myworkdayjobs.com
CloudDockerKubernetesPython
Cross River

Engineering Manager

Cross River
Mid · Seniorfull-time🇮🇱 Israel
Posted: 1 day agoSource: www.comeet.com
CloudDistributed SystemsDockerKubernetes
Twelve Labs

Applied AI Engineer – Field Engineering

Twelve Labs
Mid · Seniorfull-time🇺🇸 United States
Posted: 21 days agoSource: jobs.ashbyhq.com
AWSAzureCloudDockerFFmpegGoogle Cloud PlatformKubernetesPythonRust
Personify Health

Tech Lead Software Engineer – GenAI-Enabled Products

Personify Health
Seniorfull-time$150k–$185k / yearColorado · 🇺🇸 United States
Posted: 1 day agoSource: careers-personifyhealth.icims.com
Island

Senior C++ Developer

Island
Seniorfull-timeFlorida · 🇺🇸 United States
Posted: 17 days agoSource: www.comeet.com
AndroidAWSCloudDistributed SystemsGoiOSJavaScriptLinuxMacOSReact