
Senior Site Reliability Engineer – SRE
Xenon Seven
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇩🇪 Germany
Visit company websiteJob Level
Senior
Tech Stack
ElasticSearchGrafanaKubernetesLinuxLogstashOpenShiftPrometheusUnix
About the role
- Design and architect highly available and scalable OpenShift/Kubernetes infrastructure for banking applications on on-premise servers
- Lead and implement comprehensive monitoring and observability strategy using Prometheus and Grafana
- Design and oversee centralized logging infrastructure using ELK Stack (Elasticsearch, Logstash, Kibana)
- Lead SRE best practices implementation and adoption of production support standards across teams
- Mentor and coach junior SRE and DevOps engineers on OpenShift, Kubernetes, monitoring, and production support
- Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) with measurable metrics
- Lead incident response strategy, post-incident reviews, and drive continuous improvement in production stability
- Architect and implement advanced alerting, monitoring dashboards, and visualization strategies using Prometheus and Grafana
- Design automation frameworks and tools to reduce operational toil and improve production efficiency
- Lead OpenShift/Kubernetes cluster upgrades, security patches, and infrastructure modernization on-premise
- Establish production support procedures, on-call rotation policies, and escalation frameworks
- Optimize system performance, cost, and resource utilization across containerized on-premise infrastructure
- Conduct capacity planning, performance optimization, and infrastructure scaling initiatives
- Lead technical architecture reviews and infrastructure design decisions for banking applications
- Manage on-premise data center resources and infrastructure planning
- Participate in 24/7 on-call rotation and escalation for critical production incidents
- Ensure compliance, security hardening, and disaster recovery procedures for financial systems
Requirements
- BSc in Computer Science, Information Technology, Software Engineering, or related field
- 5+ years of hands-on SRE, DevOps, or Production Engineering experience
- 3+ years of experience leading SRE teams or managing production support operations
- 3+ years of hands-on experience managing OpenShift and Kubernetes infrastructure on on-premise infrastructure
- Expert-level experience with Prometheus for monitoring and alerting in production
- Expert-level experience with Grafana for creating comprehensive monitoring dashboards
- Advanced experience with ELK Stack (Elasticsearch, Logstash, Kibana) for logging and log analysis
- Proven experience designing and scaling production systems for high-traffic banking applications
- Deep expertise in Linux/Unix system administration and container networking
- Advanced knowledge of CI/CD automation and deployment strategies
- Hands-on experience with database management, tuning, and optimization on-premises
- Strong experience with infrastructure automation and Infrastructure as Code
- Proven 24/7 production support experience in mission-critical environments
- Experience managing on-premise data center infrastructure
- Proven leadership skills and ability to mentor junior engineers
- Excellent communication skills and ability to present to executive stakeholders
- Experience in financial services or banking sector is highly preferred.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
OpenShiftKubernetesPrometheusGrafanaELK StackLinux/Unix administrationCI/CD automationInfrastructure as Codedatabase managementproduction support
Soft skills
leadershipmentoringcommunicationincident responsecontinuous improvementcapacity planningteam collaborationproblem-solvingpresentation skillsorganizational skills