ControlUp

Site Reliability Engineer

ControlUp

full-time

Posted on:

Origin:  • 🇺🇸 United States • Florida

Visit company website
AI Apply
Apply

Job Level

Mid-LevelSenior

Tech Stack

AWSAzureCloudGoGoogle Cloud PlatformKubernetesPythonTerraform

About the role

  • Maintain and improve production stability across a large-scale infrastructure with thousands of Kubernetes nodes and instances
  • Monitor, analyze, and optimize system performance to ensure seamless user experience and SLA adherence
  • Implement and drive FinOps practices to manage cloud cost efficiency and cost of goods sold (COGS) effectively
  • Utilize ControlUp and other advanced monitoring/observability tools to proactively detect issues and ensure SLA compliance
  • Collaborate with development and operations teams to automate deployments, scaling, and incident response
  • Design and implement robust alerting, incident management, and post-mortem processes
  • Continuously evaluate and adopt cutting-edge technologies to improve reliability, performance, and cost efficiency
  • Provide technical guidance and best practices for infrastructure and application scalability
  • Participate in on-call rotations to respond to critical incidents and minimize downtime

Requirements

  • Proven experience as an SRE or similar role in large-scale environments with thousands of Kubernetes nodes and instances
  • Strong expertise in Kubernetes, container orchestration, and cloud infrastructure (AWS, GCP, Azure, or similar)
  • Solid understanding of performance tuning, monitoring, and observability tools (experience with ControlUp is a strong plus)
  • Experience with FinOps principles and tools to manage cloud costs and optimize resource utilization
  • Deep knowledge of production incident management, root cause analysis, and SLA management
  • Proficiency in scripting and automation (Python, Go, Bash, etc.)
  • Familiarity with CI/CD pipelines and infrastructure as code (Terraform, Helm, etc.)
  • Excellent communication skills and ability to work collaboratively across teams
  • Participation in on-call rotations