The Leaflet

Site Reliability Engineer

The Leaflet

full-time

Posted on:

Location Type: Remote

Location: Remote • 🇵🇱 Poland

Visit company website
AI Apply
Apply

Job Level

JuniorMid-Level

Tech Stack

AnsibleAWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraform

About the role

  • Maintain and improve the reliability, scalability, and performance of our Java-based application.
  • Responsible for managing and monitoring the applications and infrastructure.
  • Use the Grafana stack (Grafana, Loki, Prometheus) to ensure a high level of observability.
  • Implement robust monitoring, alerting, and logging solutions.
  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.
  • Troubleshoot and resolve complex issues in production and non-production environments.
  • Participate in both pre- and post-deployment performance testing and monitoring efforts.
  • Optimize Java application performance, ensuring efficient resource utilization and scaling.
  • Deploy and manage the Grafana stack to provide real-time monitoring, logging, and alerting.
  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.
  • Support the operations team’s incident response efforts and participate in post-mortems.
  • Document and share lessons learned from incidents.

Requirements

  • Degree in computer science or a related field, or equivalent work experience
  • 2-3 years in SRE, DevOps, or similar Infrastructure roles
  • Experience managing large-scale, high-availability production systems
  • Track record of incident response and post-mortem processes
  • Experience with capacity planning and performance optimization
  • 1+ years hands-on experience managing production Kubernetes clusters
  • Deep understanding of k8s architecture, networking, storage, and security
  • Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management
  • Proficiency with kubectl, Helm, and Kubernetes operators
  • Container orchestration and troubleshooting knowledge
  • Expertise with the Grafana stack for dashboards, alerting, and visualization
  • Hands-on experience with Grafana Alloy for telemetry data collection
  • Proficiency in PromQL
  • Experience with Loki for log aggregation and analysis
  • Experience building comprehensive monitoring and alerting strategies
  • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.
  • Cloud Platform expertise (AWS, GCP, or Azure)
  • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.
  • ArgoCD proficiency for GitOps workflows and continuous deployment
  • Scripting abilities in Bash, Python, or Go
  • Experience with CI/CD pipelines and automation tools
  • Configuration Management and deployment automation
  • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.
  • Proven experience in on-call rotations, incident response, and root cause analysis.
  • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.
Benefits
  • Competitive pay and benefits.
  • Start-up culture backed by a secure, global brand.
  • Flexible vacation allowance.
  • Internal growth and development.

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard skills
JavaKubernetesGrafanaPrometheusLokiHelmBashPythonGoCI/CD
Soft skills
troubleshootingincident responsecommunicationproactive approachteam collaborationdocumentationperformance optimizationcapacity planningroot cause analysispositive attitude