Cohere

Staff Software Engineer, GPU Infrastructure – HPC

Cohere

full-time

Posted on:

Location Type: Remote

Location: CaliforniaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
  • Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
  • Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
  • Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
  • Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
  • Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
  • Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

Requirements

  • Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
  • Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
  • Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
  • Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
  • Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
  • Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.
Benefits
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
ML infrastructureHPCGPU clustersTPU clustersdistributed training frameworksKubernetesPythonGoLinux internalsRDMA networking
Soft skills
problem-solvingcollaborationmentorshipcommunicationinnovation