Tech Stack
AWSAzureCloudGoGoogle Cloud PlatformKubernetesPythonTerraform
About the role
- Maintain and improve production stability across a large-scale infrastructure with thousands of Kubernetes nodes and instances
- Monitor, analyze, and optimize system performance to ensure seamless user experience and SLA adherence
- Implement and drive FinOps practices to manage cloud cost efficiency and cost of goods sold (COGS) effectively
- Utilize ControlUp and other advanced monitoring/observability tools to proactively detect issues and ensure SLA compliance
- Collaborate with development and operations teams to automate deployments, scaling, and incident response
- Design and implement robust alerting, incident management, and post-mortem processes
- Continuously evaluate and adopt cutting-edge technologies to improve reliability, performance, and cost efficiency
- Provide technical guidance and best practices for infrastructure and application scalability
- Participate in on-call rotations to respond to critical incidents and minimize downtime
Requirements
- Proven experience as an SRE or similar role in large-scale environments with thousands of Kubernetes nodes and instances
- Strong expertise in Kubernetes, container orchestration, and cloud infrastructure (AWS, GCP, Azure, or similar)
- Solid understanding of performance tuning, monitoring, and observability tools (experience with ControlUp is a strong plus)
- Experience with FinOps principles and tools to manage cloud costs and optimize resource utilization
- Deep knowledge of production incident management, root cause analysis, and SLA management
- Proficiency in scripting and automation (Python, Go, Bash, etc.)
- Familiarity with CI/CD pipelines and infrastructure as code (Terraform, Helm, etc.)
- Excellent communication skills and ability to work collaboratively across teams
- Participation in on-call rotations