SEON

Senior Site Reliability Engineer

SEON

full-time

Posted on:

Location Type: Remote

Location: Hungary

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Ensure the reliability, availability, and performance of our systems by implementing SRE best practices
  • Develop and maintain comprehensive monitoring and alerting systems using tools such as Prometheus, Grafana, ELK stack, etc.
  • Manage incident response and root cause analysis for production issues
  • Conduct postmortems to learn from failures and drive continuous improvement in the system’s reliability
  • Continuously monitor and optimize the performance of cloud infrastructure to ensure efficient resource utilization and cost-effectiveness
  • Automate routine tasks and processes to reduce manual intervention and increase efficiency
  • Analyze current system capacity and plan for future growth to ensure the infrastructure can scale with increasing demands
  • Define, measure, and monitor SLOs and SLIs to ensure that services meet their reliability targets
  • Work closely with engineering, and product teams to provide feedback and suggestions on new architectures, ensuring they meet reliability and performance standards
  • Develop and maintain comprehensive documentation for architecture, infrastructure, and troubleshooting processes.
  • Provide on-call support to ensure the continuous availability of our applications and infrastructure
  • Ensure that systems meet security and compliance requirements, performing regular audits and assessments based on the internal security team’s guidelines
  • Stay current with new technologies and industry trends, evaluating their potential impact on our infrastructure and reliability practices

Requirements

  • 6+ years of experience as a SRE, DevOps or in a similar engineering role, with a focus on reliability principles and practices
  • Strong hands-on experience working with Kubernetes (AWS EKS preferred)
  • Strong hands-on expertise in Terraform
  • Extensive experience working in multi-region and multi-account AWS setup
  • Strong experience with monitoring and logging tools such as Prometheus, Grafana, Elasticsearch, and Kibana.
  • Strong experience deploying, maintaining and troubleshooting scalable distributed components in microservice-based architecture
  • Experience researching, troubleshooting and improving customer critical requests related to latency, availability and performance issues
  • Ability to quickly troubleshoot complex issues related to infrastructure
  • Proficiency with incident management tools such as PagerDuty, Opsgenie, etc.
  • Familiarity with CI pipelines and tools (Github Actions preferred)
  • Experience working with GitOps practices and CD tools (ArgoCD preferred)
  • A proactive approach to identifying and resolving issues independently with a strong problem-solving attitude
  • Excellent communication and collaboration skills to work effectively with cross-functional teams
Benefits
  • Flexible work arrangements
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability Engineering (SRE)KubernetesTerraformAWSPrometheusGrafanaElasticsearchKibanamicroservice architectureincident management
Soft Skills
problem-solvingcommunicationcollaborationproactive approachtroubleshooting