Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
Kraken Digital Asset Exchange

Site Reliability Engineer – AI Agents

Kraken Digital Asset Exchange

Site Reliability Engineer responsible for designing and operating AI infrastructure at Kraken. Collaborating with multiple teams to ensure reliability, scalability, and observability of systems.

Posted 6/11/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSenior💰 $96,000 - $192,000 per yearWebsite

Tech Stack

Tools & technologies
AWSCloudDockerKubernetesPythonTerraform

About the role

Key responsibilities & impact
  • Design, build, and operate the infrastructure layer supporting AI agent workflows in production
  • Ensure reliability, scalability, and observability of agentic systems across internal and external products
  • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
  • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
  • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
  • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
  • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
  • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
  • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
  • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
  • Implement access controls and security best practices across AI infrastructure environments
  • Document architecture, runbooks, and best practices to support knowledge sharing across the team.

Requirements

What you’ll need
  • 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
  • Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
  • Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
  • Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
  • Proficiency with Infrastructure as Code tools, particularly Terraform
  • Experience with containerization and orchestration, particularly Kubernetes and Docker
  • Solid understanding of cloud infrastructure, preferably AWS
  • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
  • Experience designing and operating observability, monitoring, and alerting systems
  • Experience implementing incident response procedures and participating in on-call rotations
  • Strong collaboration skills working across data, AI, and engineering teams
  • High ownership mindset in a fast-moving, high-stakes production environment.

Benefits

Comp & perks
  • Offers Equity
  • Offers Bonus
  • Wellness allowance
  • Health insurance (medical, dental, vision)
  • 401(k)

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Infrastructure as CodeTerraformKubernetesDockerCI/CDAPIsSDKsmonitoringalertingscripting
Soft Skills
collaborationownership mindsetcommunicationincident responseproblem-solvingscalabilityreliabilityobservabilityadaptabilityteamwork