Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
The Leaflet

Senior Site Reliability Engineer

The Leaflet

Senior Site Reliability Engineer optimizing Java applications while pioneering AI-driven operations for high-traffic environments. Collaborating with teams to enhance reliability and performance across distributed systems.

Posted 6/10/2026full-timeRemote • Florida • 🇺🇸 United StatesSeniorWebsite

Tech Stack

Tools & technologies
AnsibleAWSAzureCloudGoGoogle Cloud PlatformGrafanaJavaKubernetesPrometheusPythonTerraform

About the role

Key responsibilities & impact
  • Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment.
  • Troubleshoot and resolve complex issues across production and non-production environments.
  • Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance.
  • Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling.
  • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting.
  • Implement and refine observability strategies that enhance visibility into application and infrastructure health.
  • Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring.
  • Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction.
  • Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization.
  • Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval.
  • Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents.
  • Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems.
  • Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving.
  • Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization.
  • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence.
  • Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents.
  • Document and share lessons learned, contributing to a culture of continuous improvement.
  • Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements.
  • Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language.
  • Measure and report on toil reduction metrics to quantify the impact of automation initiatives.
  • Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities.
  • Collaborate with DevOps and NOC teams to support the application platform.
  • Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders.
  • Provide feedback on application performance, potential improvements, and observability metrics.

Requirements

What you’ll need
  • Degree in Computer Science or a related field, or equivalent professional experience.
  • 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems.
  • 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security.
  • Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.
  • Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting.
  • Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.
  • Proficiency in PromQL and experience with Loki for log aggregation and analysis.
  • Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.
  • Cloud platform expertise (AWS preferred; GCP or Azure also valued).
  • Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible.
  • ArgoCD proficiency for GitOps workflows and continuous deployment.
  • Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation.
  • Proven track record with on-call rotations, incident response, and root cause analysis.
  • 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.
  • Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks.
  • Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.
  • Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).
  • Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples.
  • Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.
  • Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents.
  • Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.

Benefits

Comp & perks
  • Competitive pay and benefits
  • Flexible vacation allowance
  • A hybrid / remote working environment
  • Startup culture backed by a secure, global brand

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
JavaKubernetesGrafanaPrometheusLokiJVM tuningPythonBashGoTerraform
Soft Skills
incident responseroot cause analysiscommunicationcollaborationcontinuous improvementproblem-solvingautomationleadershiporganizational skillstechnical documentation