The Leaflet

Senior Site Reliability Engineer

The Leaflet

full-time

Posted on:

Location Type: Remote

Location: FloridaUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.
  • Troubleshoot and resolve complex issues in production and non-production environments.
  • Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.
  • Optimize Java application performance, ensuring efficient resource utilization and scaling.
  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.
  • Implement and refine observability strategies to enhance application and infrastructure visibility.
  • Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.
  • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes of issues to prevent recurrence.
  • Document and share lessons learned from incidents, contributing to a culture of continuous improvement.
  • Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.
  • Collaborate closely with DevOps and NOC teams to support the application platform.
  • Communicate SRE practices and principles to technical and non-technical stakeholders.
  • Provide feedback and insights on application performance, potential improvements, and observability metrics.

Requirements

  • Degree in computer science or a related field, or equivalent work experience
  • 5+ years in SRE, DevOps, or similar Infrastructure roles
  • Experience managing large-scale, high-availability production systems
  • Track record of incident response and post-mortem processes
  • Experience with capacity planning and performance optimization
  • 3+ years hands-on experience managing production Kubernetes clusters
  • Deep understanding of k8s architecture, networking, storage, and security
  • Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management
  • Proficiency with kubectl, Helm, and Kubernetes operators
  • Container orchestration and troubleshooting expertise
  • Advanced expertise with the Grafana stack for dashboards, alerting, and visualization
  • Hands-on experience with Grafana Alloy for telemetry data collection
  • Proficiency in PromQL
  • Experience with Loki for log aggregation and analysis
  • Experience building comprehensive monitoring and alerting strategies
  • Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.
  • Cloud Platform expertise (AWS, GCP, or Azure)
  • Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.
  • ArgoCD proficiency for GitOps workflows and continuous deployment
  • Strong scripting abilities in Bash, Python, or Go
  • Experience with CI/CD pipleines and automation tools
  • Configuration Management and deployment automation
  • Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.
  • Proven experience managing on-call rotations, incident response, and root cause analysis.
  • Ability to mentor junior team members
  • Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.
Benefits
  • Competitive pay and benefits
  • Flexible vacation allowance
  • A hybrid / remote working environment
  • Startup culture backed by a secure, global brand
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
JavaKubernetesGrafanaPrometheusLokiHelmBashPythonGoTerraform
Soft Skills
troubleshootingcommunicationmentoringincident responseroot cause analysiscollaborationcontinuous improvementfeedbackproactive approachpositive attitude