FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer – AI Agents
Kraken Digital Asset ExchangeSite Reliability Engineer responsible for designing and operating AI infrastructure at Kraken. Collaborating with multiple teams to ensure reliability, scalability, and observability of systems.
Posted 6/11/2026full-timeRemote • 🇺🇸 United StatesMid-LevelSenior💰 $96,000 - $192,000 per yearWebsite
Tech Stack
Tools & technologiesAWSCloudDockerKubernetesPythonTerraform
About the role
Key responsibilities & impact- Design, build, and operate the infrastructure layer supporting AI agent workflows in production
- Ensure reliability, scalability, and observability of agentic systems across internal and external products
- Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services
- Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution
- Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads
- Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components
- Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows
- Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems
- Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems
- Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services
- Implement access controls and security best practices across AI infrastructure environments
- Document architecture, runbooks, and best practices to support knowledge sharing across the team.
Requirements
What you’ll need- 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
- Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
- Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
- Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
- Proficiency with Infrastructure as Code tools, particularly Terraform
- Experience with containerization and orchestration, particularly Kubernetes and Docker
- Solid understanding of cloud infrastructure, preferably AWS
- Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
- Experience designing and operating observability, monitoring, and alerting systems
- Experience implementing incident response procedures and participating in on-call rotations
- Strong collaboration skills working across data, AI, and engineering teams
- High ownership mindset in a fast-moving, high-stakes production environment.
Benefits
Comp & perks- Offers Equity
- Offers Bonus
- Wellness allowance
- Health insurance (medical, dental, vision)
- 401(k)
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Infrastructure as CodeTerraformKubernetesDockerCI/CDAPIsSDKsmonitoringalertingscripting
Soft Skills
collaborationownership mindsetcommunicationincident responseproblem-solvingscalabilityreliabilityobservabilityadaptabilityteamwork