Senior Site Reliability Engineer, SRE

1GLOBAL

Senior Site Reliability Engineer ensuring stability and reliability for global mobile connectivity provider. Collaborating with DevOps and Infrastructure teams to enhance system reliability and performance.

Posted 6/16/2026full-timeBerlin • 🇩🇪 GermanySeniorWebsite

ATS Keywords

Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills

Linux systems engineeringdistributed systemsnetworkinghigh-availability architectureredundancy testingdisaster recoverymonitoringobservabilityPythonGo

Soft Skills

analytical skillsproblem-solving skillscommunication skillscollaboration skills

Tools & Technologies

PrometheusGrafanaLokiThanosOpenTelemetryKubernetesAWSTerraformBashservice mesh

Industry Keywords

Site Reliability Engineeringinfrastructure engineeringincident managementcapacity planningperformance benchmarkingresilience auditscloud cost-optimizationfault-injection testingchaos testingoperational guidelines

Tech Stack

Tools & technologies

AWSCloudDistributed SystemsDNSEC2GoGrafanaKubernetesLinuxPrometheusPythonTerraform

About the role

Key responsibilities & impact

Act as a senior technical contributor within the SRE team, mentoring peers and setting the technical bar for reliability engineering.
Define, measure, and maintain SLIs and SLOs for core infrastructure and customer-facing services.
Plan and execute redundancy and resilience testing across service, infrastructure, and networking layers — validating failover, HA configurations, and disaster recovery readiness.
Design and implement automated recovery mechanisms, self-healing workflows, and intelligent alerting systems.
Drive incident response, root-cause analysis, and blameless post-mortems, and ensure implementation and tracking of corrective and preventive actions derived from them to achieve continuous improvement.
Develop and enhance observability (metrics, logs, traces) using Prometheus, Grafana, Loki, and OpenTelemetry.
Partner with Infrastructure and DevOps teams to ensure deployment safety, rollback policies, and configuration consistency.
Proactively identify weaknesses through fault-injection, load, and chaos testing.
Continuously reduce operational toil through automation and reliability tooling.
Contribute to on-call practices, improving alert quality, runbooks, escalation procedures, and incident management processes.
Perform capacity planning, performance benchmarking, and resilience audits across systems.
Ensure compliance with security, reliability, and availability standards.
Create and maintain internal documentation, playbooks, and operational guidelines for peers and users.
Contribute to cloud cost-optimization initiatives, including reserved capacity planning, autoscaling design, storage tiering, workload right-sizing, and continuous anomaly detection.

Requirements

What you’ll need

A minimum of 5 years of experience in Site Reliability, Systems, or Infrastructure Engineering (including 2+ years in a dedicated SRE role).
Strong expertise in Linux systems engineering, distributed systems, and networking.
Proven experience building and running high-availability, mission-critical production systems.
Hands-on experience with redundancy and failover testing, disaster recovery, and high-availability architecture validation.
Deep understanding of monitoring, observability, and incident management principles.
Experience with Prometheus, Grafana, Loki, Thanos, and OpenTelemetry or similar tools.
Proficiency in Python, Go, and Bash for automation and reliability tooling.
Strong knowledge of Kubernetes, container orchestration, and service mesh architectures.
Experience with AWS (EKS, EC2, VPC) and on-premises infrastructure integration.
Proficiency in Infrastructure as Code tools such as Terraform.
Understanding of networking fundamentals (routing, load balancing, BGP, DNS, VXLAN, etc.).
Excellent analytical and problem-solving skills, capable of operating under pressure.
Strong communication and collaboration skills across distributed and cross-functional teams.

Benefits

Comp & perks

Growth Opportunities: Advance your career in one of the fastest growing telecommunications companies, expanding over 100% year-on-year under the leadership of successful tech entrepreneurs.
Major Transaction Exposure: Be in the driver’s seat for transactions that will have an impact on the future telco industry.
Work with a Talented Team: From the Board and the Founders to the Senior Management Team, you will collaborate daily with the most capable and renowned external advisors, and constantly being exposed to talented and driven individuals.
Dynamic Work Environment: Thrive in a collaborative, fast-paced workplace where innovation is encouraged, and every contribution counts.
Professional Development: Work alongside industry experts to enhance your skills and knowledge in a cutting-edge field.
International Experience: Gain opportunities to work in different 1GLOBAL offices around the world as you grow within the company.
Open Communication Culture: Join a team where your ideas are heard, and open dialogue is encouraged, fostering a supportive and transparent work environment.
Get Things Done Attitude: Be part of a results-driven team that values efficiency, creativity, and the drive to make a tangible impact in the industry.