Site Reliability Engineer, AI & Agentic Systems

ServiceLink

Site Reliability Engineer ensuring reliability, scalability, performance, and operational excellence in Azure-hosted systems for ServiceLink, a mortgage services company.

Posted 6/19/2026full-timePlano • Texas • 🇺🇸 United StatesMid-LevelSenior💰 $40 - $45 per hourWebsite

Tech Stack

Tools & technologies

AzureDistributed SystemsDNSGoGrafanaJavaKubernetesLinuxMicroservicesPostgresPrometheusPythonTCP/IPTerraform

About the role

Key responsibilities & impact

Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms

Requirements

What you’ll need

5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
Strong hands-on experience in production troubleshooting of distributed systems at scale
Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
Proficiency in one or more programming languages: Python, Go, Java, or equivalent
Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)
Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
Proven experience designing and executing performance and load testing for large-scale distributed applications
Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning
Hands-on experience building or integrating AI-powered automation in production environments

Benefits

Comp & perks

Health insurance
Vision insurance
Dental insurance
Life insurance
401(k) plans
Employee Stock Purchase Plan
Paid vacation
Paid sick time

ATS Keywords

✓ Tailor your resume

Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools

Site Reliability EngineeringDevOpsProduction EngineeringLinux internalsNetworkingKubernetesPythonGoJavaPostgreSQL

Soft Skills

incident troubleshootingroot cause analysisperformance testingautomationscalabilityhigh availabilityfault tolerancecommunicationleadershipproblem-solving