Apply

Ready to go for it?

AI Apply speeds things up—apply directly if you prefer.

FREE ACCESS
5,000–10,000 jobs/day
JobTailor Logo

See all jobs on JobTailor

Search thousands of fresh jobs every day.

Discover
  • Fresh listings
  • Fast filters
  • No subscription required
Create a free account and start exploring right away.
ServiceLink

Site Reliability Engineer, AI & Agentic Systems

ServiceLink

Site Reliability Engineer ensuring reliability, scalability, performance, and operational excellence in Azure-hosted systems for ServiceLink, a mortgage services company.

Posted 6/19/2026full-timePlano • Texas • 🇺🇸 United StatesMid-LevelSenior💰 $40 - $45 per hourWebsite

Tech Stack

Tools & technologies
AzureDistributed SystemsDNSGoGrafanaJavaKubernetesLinuxMicroservicesPostgresPrometheusPythonTCP/IPTerraform

About the role

Key responsibilities & impact
  • Own end-to-end reliability of large-scale, Azure-hosted production systems, ensuring high availability, fault tolerance, and graceful degradation
  • Lead hands-on incident troubleshooting, root cause analysis (RCA), and post-incident reviews with actionable follow-ups
  • Build and operate resilient, scalable services on Microsoft Azure (AKS, App Services, Functions, Event Hubs, etc.)
  • Design and maintain comprehensive observability platforms using Prometheus for metrics, Loki for log aggregation, Tempo for distributed tracing, and Grafana for dashboarding and alerting
  • Create automation to eliminate manual operational tasks and reduce Mean Time to Recovery (MTTR)
  • Design, develop, and execute performance testing strategies for distributed systems and microservices, including load testing, stress testing, soak testing, and capacity planning
  • Integrate AI agents with Azure monitoring stack, CI/CD tooling, and incident management platforms

Requirements

What you’ll need
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering roles
  • Strong hands-on experience in production troubleshooting of distributed systems at scale
  • Solid understanding of Linux internals, networking (TCP/IP, DNS, HTTP, TLS), and system performance tuning
  • Deep hands-on experience with Microsoft Azure (compute, networking, storage, managed services, AKS)
  • Strong knowledge of Kubernetes, container orchestration, Helm charts, and microservices architectures
  • Proficiency in one or more programming languages: Python, Go, Java, or equivalent
  • Experience with CI/CD pipelines (Azure DevOps, GitHub Actions) and Infrastructure as Code (Terraform, ARM Templates, Bicep)
  • Hands-on experience building and operating observability stacks using Prometheus, Grafana, Loki, and Tempo
  • Proven experience designing and executing performance and load testing for large-scale distributed applications
  • Strong PostgreSQL administration skills, including query optimization, replication, backup/recovery, and performance tuning
  • Hands-on experience building or integrating AI-powered automation in production environments

Benefits

Comp & perks
  • Health insurance
  • Vision insurance
  • Dental insurance
  • Life insurance
  • 401(k) plans
  • Employee Stock Purchase Plan
  • Paid vacation
  • Paid sick time

ATS Keywords

✓ Tailor your resume
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
Site Reliability EngineeringDevOpsProduction EngineeringLinux internalsNetworkingKubernetesPythonGoJavaPostgreSQL
Soft Skills
incident troubleshootingroot cause analysisperformance testingautomationscalabilityhigh availabilityfault tolerancecommunicationleadershipproblem-solving