
Senior Site Reliability Engineer
The Leaflet
full-time
Posted on:
Location Type: Remote
Location: Florida • United States
Visit company websiteExplore more
Job Level
About the role
- Ensure the availability, reliability, and performance of a high-traffic Java-based application in a distributed environment.
- Troubleshoot and resolve complex issues in production and non-production environments.
- Participate in both pre- and post-deployment performance testing and monitoring efforts to improve application performance.
- Optimize Java application performance, ensuring efficient resource utilization and scaling.
- Deploy and manage the Grafana stack (Grafana, Prometheus, Loki) to provide real-time monitoring, logging, and alerting.
- Implement and refine observability strategies to enhance application and infrastructure visibility.
- Create and maintain dashboards, alerts, and logs for comprehensive monitoring of system health and performance.
- Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes of issues to prevent recurrence.
- Document and share lessons learned from incidents, contributing to a culture of continuous improvement.
- Work closely with developers, architects, and other engineers to design and implement solutions that improve application reliability.
- Collaborate closely with DevOps and NOC teams to support the application platform.
- Communicate SRE practices and principles to technical and non-technical stakeholders.
- Provide feedback and insights on application performance, potential improvements, and observability metrics.
Requirements
- Degree in computer science or a related field, or equivalent work experience
- 5+ years in SRE, DevOps, or similar Infrastructure roles
- Experience managing large-scale, high-availability production systems
- Track record of incident response and post-mortem processes
- Experience with capacity planning and performance optimization
- 3+ years hands-on experience managing production Kubernetes clusters
- Deep understanding of k8s architecture, networking, storage, and security
- Experience with cluster scaling (Karpenter), upgrades, and multi-cluster management
- Proficiency with kubectl, Helm, and Kubernetes operators
- Container orchestration and troubleshooting expertise
- Advanced expertise with the Grafana stack for dashboards, alerting, and visualization
- Hands-on experience with Grafana Alloy for telemetry data collection
- Proficiency in PromQL
- Experience with Loki for log aggregation and analysis
- Experience building comprehensive monitoring and alerting strategies
- Hands-on experience managing Java-based applications in large-scale, distributed environments, with a focus on JVM tuning and application optimization.
- Cloud Platform expertise (AWS, GCP, or Azure)
- Familiarity with infrastructure as code (IAC) tools like Terraform/Terragrunt or Ansible.
- ArgoCD proficiency for GitOps workflows and continuous deployment
- Strong scripting abilities in Bash, Python, or Go
- Experience with CI/CD pipleines and automation tools
- Configuration Management and deployment automation
- Strong troubleshooting skills, with a proactive approach to diagnosing and resolving performance bottlenecks.
- Proven experience managing on-call rotations, incident response, and root cause analysis.
- Ability to mentor junior team members
- Strong communication skills (both written and verbal), positive attitude, and ability to receive constructive feedback.
Benefits
- Competitive pay and benefits
- Flexible vacation allowance
- A hybrid / remote working environment
- Startup culture backed by a secure, global brand
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
JavaKubernetesGrafanaPrometheusLokiHelmBashPythonGoTerraform
Soft Skills
troubleshootingcommunicationmentoringincident responseroot cause analysiscollaborationcontinuous improvementfeedbackproactive approachpositive attitude