Engineering Reliability: Designing and implementing self-healing infrastructure using Kubernetes to maintain high uptime and system integrity.
Scaling Cloud Ecosystems: Optimizing our cloud footprint (AWS/GCP/Azure) to ensure our platforms can handle rapid growth without breaking a sweat.
Innovating with AI: Proactively identifying opportunities to integrate AI tools into our observability stack to automate incident detection and root-cause analysis.
Eliminating Toil: Writing clean, efficient code to automate repetitive operational tasks, turning manual workflows into seamless "set and forget" processes.
Defining Observability: Building advanced monitoring and alerting frameworks that provide deep insights into system health and performance.

Requirements

Kubernetes Power User: Extensive experience managing production-grade K8s environments, including ingress, service mesh, and container security.
Cloud Infrastructure Expert: A deep understanding of cloud networking, storage, and compute services within a major provider (AWS, Azure, or GCP).
Proactive Mindset: An engineer who doesn't wait for a ticket; you naturally seek out system weaknesses and build solutions to strengthen them.
AI Curiosity: An active interest in the AI landscape and a desire to leverage LLMs or machine learning to improve SRE workflows.
Programming Literacy: Ideally experience with at least one language (such as Java, Python, Go, or Ruby) to bridge the gap between software engineering and operations.

Benefits

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

KubernetesAWSGCPAzureAI toolsincident detectionroot-cause analysisprogramming (Java, Python, Go, Ruby)monitoring frameworksalerting frameworks

Soft Skills

proactive mindsetproblem-solvingcuriosity