Anyscale

Software Engineer, Platform Infrastructure – Foundations

Anyscale

full-time

Posted on:

Location Type: Hybrid

Location: San FranciscoCaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

About the role

  • Design, build, and scale services that orchestrate Ray clusters across cloud and on-prem environments, supporting both VM-based and Kubernetes-based deployments
  • Optimize control plane components for large-scale, distributed AI/ML workloads
  • Build intelligent scheduling and resource management systems for heterogeneous compute clusters
  • Develop features to enhance the reliability, performance, scalability, and observability of Anyscale-managed Ray workloads
  • Support and optimize accelerator integration (e.g., GPUs, TPUs).
  • Handle container image management and dependency resolution for distributed workloads
  • Participate in code reviews, design and architecture discussions
  • Provide on-call support, working closely with customer and field teams to troubleshoot infrastructure issues
  • Collaborate with leading distributed systems and machine learning experts to push the boundaries of AI infrastructure

Requirements

  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of experience writing high-quality production code
  • Hands-on experience in building and maintaining highly available, scalable, and performant distributed system
  • Expertise in cloud-native technologies (AWS, Azure, GCP) and Kubernetes-based deployments
  • Deep understanding of networking, security, and authentication mechanisms in cloud environment
  • Familiarity with observability stacks (Prometheus, Grafana etc)
  • Proficiency in Go and Python
  • Knowledge of low-level operating system foundations (Linux kernel, file systems, containers)
Benefits
  • Health insurance
  • Professional development opportunities
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
GoPythonKubernetesAWSAzureGCPdistributed systemsobservabilityLinuxcontainer management
Soft Skills
collaborationtroubleshootingcode reviewdesign discussionsarchitecture discussionscustomer support