Site Reliability Engineer

ControlUp

full-time

Posted on: 9/15/2025

Origin: • 🇺🇸 United States • Florida

✨ AI Apply

Mid-LevelSenior

AWSAzureCloudGoGoogle Cloud PlatformKubernetesPythonTerraform

About the role

Maintain and improve production stability across a large-scale infrastructure with thousands of Kubernetes nodes and instances
Monitor, analyze, and optimize system performance to ensure seamless user experience and SLA adherence
Implement and drive FinOps practices to manage cloud cost efficiency and cost of goods sold (COGS) effectively
Utilize ControlUp and other advanced monitoring/observability tools to proactively detect issues and ensure SLA compliance
Collaborate with development and operations teams to automate deployments, scaling, and incident response
Design and implement robust alerting, incident management, and post-mortem processes
Continuously evaluate and adopt cutting-edge technologies to improve reliability, performance, and cost efficiency
Provide technical guidance and best practices for infrastructure and application scalability
Participate in on-call rotations to respond to critical incidents and minimize downtime

Proven experience as an SRE or similar role in large-scale environments with thousands of Kubernetes nodes and instances
Strong expertise in Kubernetes, container orchestration, and cloud infrastructure (AWS, GCP, Azure, or similar)
Solid understanding of performance tuning, monitoring, and observability tools (experience with ControlUp is a strong plus)
Experience with FinOps principles and tools to manage cloud costs and optimize resource utilization
Deep knowledge of production incident management, root cause analysis, and SLA management
Proficiency in scripting and automation (Python, Go, Bash, etc.)
Familiarity with CI/CD pipelines and infrastructure as code (Terraform, Helm, etc.)
Excellent communication skills and ability to work collaboratively across teams
Participation in on-call rotations