Cerebras Systems

Senior Deployment Engineer, AI Inference

Cerebras Systems

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇨🇦 Canada

Visit company website
AI Apply
Apply

Job Level

Senior

Tech Stack

AWSDockerGrafanaKubernetesLinuxPrometheusPython

About the role

  • Deploy AI inference replicas and cluster software across multiple datacenters.
  • Operate across heterogeneous datacenter environments undergoing rapid 10x growth.
  • Maximize capacity allocation and optimize replica placement using constraint-solver algorithms.
  • Operate bare-metal inference infrastructure while supporting transition to K8S-based platform.
  • Develop and extend telemetry, observability and alerting solutions to ensure deployment reliability at scale.
  • Develop and extend a fully automated deployment pipeline to support fast software updates and capacity reallocation at scale.
  • Translate technical and customer needs into actionable requirements for the Dev Infra, Cluster, Platform and Core teams.
  • Stay up to date with the latest advancements in AI compute infrastructure and related technologies.

Requirements

  • 5-7 years of experience in operating on-prem compute infrastructure (ideally in Machine Learning or High-Performance Compute) or developing and managing complex AWS plane infrastructure for hybrid deployments.
  • Strong proficiency in Python for automation, orchestration, and deployment tooling.
  • Solid understanding of Linux-based systems and command-line tools.
  • Extensive knowledge of Docker containers and container orchestration platforms like K8S.
  • Familiarity with spine-leaf (Clos) networking architecture.
  • Proficiency with telemetry and observability stacks such as Prometheus, InfluxDB and Grafana.
  • Strong ownership mindset and accountability for complex deployments.
  • Ability to work effectively in a fast-paced environment.
Benefits
  • Build a breakthrough AI platform beyond the constraints of the GPU.
  • Publish and open source their cutting-edge AI research.
  • Work on one of the fastest AI supercomputers in the world.
  • Enjoy job stability with startup vitality.
  • Our simple, non-corporate work culture that respects individual beliefs.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
PythonLinuxDockerKubernetestelemetryobservabilityconstraint-solver algorithmsautomated deployment pipelineAWSspine-leaf networking architecture
Soft skills
ownership mindsetaccountabilityability to work effectively in fast-paced environment
S&P Global

Senior Site Reliability Engineer

S&P Global
Junior · Midfull-time🇨🇦 Canada
Posted: 3 days agoSource: spgi.wd5.myworkdayjobs.com
AnsibleApacheAWSChefCloudDockerEC2GrafanaJ2EEJenkinsKubernetesLinux+8 more
Case IQ

DevOps Engineer

Case IQ
Mid · Seniorfull-time🇨🇦 Canada
Posted: 4 days agoSource: caseiq.bamboohr.com
AWSCloudDockerPostgresSCSS
Zoic Studios

DevOps Engineer – On-Prem Cloud Infrastructure

Zoic Studios
Mid · Seniorfull-time$50k–$65k / year🇨🇦 Canada
Posted: 5 days agoSource: zoicstudios.applytojob.com
AnsibleCloudGrafanaKubernetesLinuxPackerPrometheusPythonTerraform
CrowdStrike

Engineer II – UI Release Engineering, Observability

CrowdStrike
Mid · Seniorfull-time$100k–$135k / year🇨🇦 Canada
Posted: 5 days agoSource: crowdstrike.wd5.myworkdayjobs.com
AngularAWSEmber.jsJavaScriptJenkinsMochaNode.jsReactSplunkTypeScriptVue.jsWebpack