
Site Reliability Engineer
The Leaflet
full-time
Posted on:
Location Type: Remote
Location: Remote • 🇵🇱 Poland
Visit company websiteJob Level
JuniorMid-Level
Tech Stack
AnsibleAWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraform
About the role
- Maintain and improve the reliability, scalability, and performance of our Java-based application.
- Responsible for managing and monitoring the applications and infrastructure.
- Use the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability.
- Implement robust monitoring, alerting, and logging solutions.
- Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.
- Troubleshoot and resolve complex issues in production and non-production environments.
- Participate in both pre- and post-deployment performance testing and monitoring efforts.
- Optimize Java application performance, ensuring efficient resource utilization and scaling.
- Deploy and manage the Grafana stack to provide real-time monitoring, logging, and alerting.
- Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.
- Support the operations team’s incident response efforts and participate in post-mortems.
- Document and share lessons learned from incidents.
Requirements
- Degree in computer science or a related field, or equivalent work experience
- 2-3 years in SRE, DevOps, or similar Infrastructure roles
- Experience managing large-scale, high-availability production systems
- Track record of incident response and post-mortem processes
- Experience with capacity planning and performance optimization
- 1+ years hands-on experience managing production Kubernetes clusters
- Deep understanding of k8s architecture, networking, storage, and security
- Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management
- Proficiency with kubectl, Helm, and Kubernetes operators
- Container orchestration and troubleshooting knowledge
- Expertise with the Grafana stack for dashboards, alerting, and visualization
- Hands-on experience with Grafana Alloy for telemetry data collection
- Proficiency in PromQL
- Experience with Loki for log aggregation and analysis
- Experience building comprehensive monitoring and alerting strategies
- Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.
- Cloud Platform expertise (AWS, GCP, or Azure)
- Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.
- ArgoCD proficiency for GitOps workflows and continuous deployment
- Scripting abilities in Bash, Python, or Go
- Experience with CI/CD pipelines and automation tools
- Configuration Management and deployment automation
- Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.
- Proven experience in on-call rotations, incident response, and root cause analysis.
- Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.
Benefits
- Competitive pay and benefits.
- Start-up culture backed by a secure, global brand.
- Flexible vacation allowance.
- Internal growth and development.
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
JavaKubernetesGrafanaPrometheusLokiHelmBashPythonGoCI/CD
Soft skills
troubleshootingincident responsecommunicationproactive approachteam collaborationdocumentationperformance optimizationcapacity planningroot cause analysispositive attitude