FREE ACCESS
5,000–10,000 jobs/day

See all jobs on JobTailor
Search thousands of fresh jobs every day.
Discover
- Fresh listings
- Fast filters
- No subscription required
Create a free account and start exploring right away.

Site Reliability Engineer, AI & Agentic Systems
ServiceLinkSite Reliability Engineer ensuring reliability, scalability, performance, and operational excellence in Azure-hosted systems for ServiceLink, a mortgage services company.
Tech Stack
Tools & technologiesAzureDistributed SystemsDNSGoGrafanaJavaKubernetesLinuxMicroservicesPostgresPrometheusPythonTCP/IPTerraform
About the role
Key responsibilities & impact- Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
- Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
- Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
- Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
- Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
- Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
- Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms
Requirements
What you’ll need- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
- Strong hands-on experience in production troubleshooting of distributed systems at scale
- Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
- Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
- Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
- Proficiency in one or more programming languages: Python, Go, Java, or equivalent
- Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)
- Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
- Proven experience designing and executing performance and load testing for large-scale distributed applications
- Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning
- Hands-on experience building or integrating AI-powered automation in production environments
Benefits
Comp & perks- Health insurance
- Vision insurance
- Dental insurance
- Life insurance
- 401(k) plans
- Employee Stock Purchase Plan
- Paid vacation
- Paid sick time
ATS Keywords
✓ Tailor your resumeApplicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Site Reliability EngineeringDevOpsProduction EngineeringLinux internalsNetworkingKubernetesPythonGoJavaPostgreSQL
Soft Skills
incident troubleshootingroot cause analysisperformance testingautomationscalabilityhigh availabilityfault tolerancecommunicationleadershipproblem-solving