
Senior Site Reliability Engineer
SEON
full-time
Posted on:
Location Type: Remote
Location: Hungary
Visit company websiteExplore more
Job Level
About the role
- Ensure the reliability, availability, and performance of our systems by implementing SRE best practices
- Develop and maintain comprehensive monitoring and alerting systems using tools such as Prometheus, Grafana, ELK stack, etc.
- Manage incident response and root cause analysis for production issues
- Conduct postmortems to learn from failures and drive continuous improvement in the system’s reliability
- Continuously monitor and optimize the performance of cloud infrastructure to ensure efficient resource utilization and cost-effectiveness
- Automate routine tasks and processes to reduce manual intervention and increase efficiency
- Analyze current system capacity and plan for future growth to ensure the infrastructure can scale with increasing demands
- Define, measure, and monitor SLOs and SLIs to ensure that services meet their reliability targets
- Work closely with engineering, and product teams to provide feedback and suggestions on new architectures, ensuring they meet reliability and performance standards
- Develop and maintain comprehensive documentation for architecture, infrastructure, and troubleshooting processes.
- Provide on-call support to ensure the continuous availability of our applications and infrastructure
- Ensure that systems meet security and compliance requirements, performing regular audits and assessments based on the internal security team’s guidelines
- Stay current with new technologies and industry trends, evaluating their potential impact on our infrastructure and reliability practices
Requirements
- 6+ years of experience as a SRE, DevOps or in a similar engineering role, with a focus on reliability principles and practices
- Strong hands-on experience working with Kubernetes (AWS EKS preferred)
- Strong hands-on expertise in Terraform
- Extensive experience working in multi-region and multi-account AWS setup
- Strong experience with monitoring and logging tools such as Prometheus, Grafana, Elasticsearch, and Kibana.
- Strong experience deploying, maintaining and troubleshooting scalable distributed components in microservice-based architecture
- Experience researching, troubleshooting and improving customer critical requests related to latency, availability and performance issues
- Ability to quickly troubleshoot complex issues related to infrastructure
- Proficiency with incident management tools such as PagerDuty, Opsgenie, etc.
- Familiarity with CI pipelines and tools (Github Actions preferred)
- Experience working with GitOps practices and CD tools (ArgoCD preferred)
- A proactive approach to identifying and resolving issues independently with a strong problem-solving attitude
- Excellent communication and collaboration skills to work effectively with cross-functional teams
Benefits
- Flexible work arrangements
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability Engineering (SRE)KubernetesTerraformAWSPrometheusGrafanaElasticsearchKibanamicroservice architectureincident management
Soft Skills
problem-solvingcommunicationcollaborationproactive approachtroubleshooting