Dimensional Fund Advisors

Senior Site Reliability Engineer – Observability

Dimensional Fund Advisors

full-time

Posted on:

Location Type: Hybrid

Location: AustinNorth CarolinaTexasUnited States

Visit company website

Explore more

AI Apply
Apply

Job Level

About the role

  • Serve as a primary escalation point for production support involving the ELK Stack, Grafana, and New Relic
  • Own platform health, capacity planning, and performance tuning for on-premises observability infrastructure – including Elasticsearch cluster management, index lifecycle policies, and retention strategies
  • Monitor and maintain SLOs for the observability platforms, ensuring the tools engineers depend on are highly available and performant
  • Support engineering teams in onboarding to observability platforms – helping teams instrument their applications, build dashboards, and define meaningful alerts
  • Manage patching, upgrades, and configuration management across the observability stack
  • Collaborate with security to harden platform configurations and manage software vulnerabilities
  • Contribute to on-call rotations and maintain runbooks and escalation procedures
  • Design and build tooling/automation to reduce toil and improve the experience for teams using observability platforms
  • Lead or contribute to platform modernization initiatives – e.g., improving ingestion pipelines, scaling platform capacity, standardizing Grafana dashboard and alerting patterns, or evaluating new capabilities within the existing stack
  • Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
  • Build and enforce standards around logging metrics and alerting that help engineering teams adopt observability best practices at scale
  • Participate in design reviews and contribute to the overall platform roadmap

Requirements

  • Bachelor’s degree in a technical field or equivalent practical experience
  • 5+ years of experience in SRE, DevOps, or platform engineering roles
  • Deep hands-on experience with the ELK Stack – Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management
  • Strong experience with Grafana, including data source integrations, dashboard design, and alerting
  • Solid understanding of observability principles
  • Experience operating on-premises infrastructure, including capacity planning, server management, and the operational tradeoffs with managed cloud services
  • Proficiency in Python for automation and tooling; familiarity with shell scripting
  • Strong Linux systems knowledge and comfort working with configuration management tools (e.g., Ansible, Chef, Puppet, etc.)
  • Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
  • A bias toward automation and a low tolerance for repetitive manual work
Benefits
  • comprehensive benefits
  • educational initiatives
  • special celebrations of our history, culture, and growth
Applicant Tracking System Keywords

Tip: use these terms in your resume and cover letter to boost ATS matches.

Hard Skills & Tools
ELK StackElasticsearchLogstashKibanaGrafanaTerraformHelmAnsiblePythonLinux
Soft Skills
communicationincident resolutioncollaborationleadershipcapacity planningperformance tuningautomationproblem-solvingorganizational skillsadaptability
Certifications
Bachelor’s degree in a technical field