Salary
💰 $204,000 - $247,000 per year
Tech Stack
CloudDistributed SystemsGoJavaKubernetesPython
About the role
- Ensure the reliability and scalability of Crusoe’s AI-optimized cloud platform and managed AI services
- Design and operate reliable managed AI services with a focus on serving and scaling LLM workloads
- Build automation and reliability tooling to support distributed AI pipelines and inference services
- Define, measure, and improve SLIs/SLOs across AI workloads to ensure performance and reliability targets are met
- Collaborate with AI, platform, and infrastructure teams to optimize large-scale training and inference clusters
- Automate observability by building telemetry and performance tuning strategies for latency-sensitive AI services
- Investigate and resolve reliability issues in distributed AI systems using telemetry, logs, and profiling
- Contribute to the architecture of next-generation distributed systems purpose-built for AI-first environments
Requirements
- Strong software engineering background — experience building production-grade systems beyond scripting or Bash
- Demonstrated experience in distributed systems design and implementation
- Hands-on work with large language models (LLMs) or AI/ML infrastructure
- SRE mindset and experience (whether or not under the SRE title) including: Defining and measuring SLIs/SLOs; Building monitoring and observability systems; Driving performance and reliability improvements; Designing fault-tolerant systems and automated testing strategies
- Proficiency in at least one modern programming language (Python, Go, Java, C++)
- Familiarity with Kubernetes or container orchestration platforms
- Strong collaboration and communication skills
- Ability to thrive in a fast-paced, mission-driven environment
- Bonus: Experience scaling inference or training workloads for LLMs