
Senior Site Reliability Engineer – Observability
Dimensional Fund Advisors
full-time
Posted on:
Location Type: Hybrid
Location: Austin • North Carolina • Texas • United States
Visit company websiteExplore more
Job Level
About the role
- Serve as a primary escalation point for production support involving the ELK Stack, Grafana, and New Relic
- Own platform health, capacity planning, and performance tuning for on-premises observability infrastructure – including Elasticsearch cluster management, index lifecycle policies, and retention strategies
- Monitor and maintain SLOs for the observability platforms, ensuring the tools engineers depend on are highly available and performant
- Support engineering teams in onboarding to observability platforms – helping teams instrument their applications, build dashboards, and define meaningful alerts
- Manage patching, upgrades, and configuration management across the observability stack
- Collaborate with security to harden platform configurations and manage software vulnerabilities
- Contribute to on-call rotations and maintain runbooks and escalation procedures
- Design and build tooling/automation to reduce toil and improve the experience for teams using observability platforms
- Lead or contribute to platform modernization initiatives – e.g., improving ingestion pipelines, scaling platform capacity, standardizing Grafana dashboard and alerting patterns, or evaluating new capabilities within the existing stack
- Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
- Build and enforce standards around logging metrics and alerting that help engineering teams adopt observability best practices at scale
- Participate in design reviews and contribute to the overall platform roadmap
Requirements
- Bachelor’s degree in a technical field or equivalent practical experience
- 5+ years of experience in SRE, DevOps, or platform engineering roles
- Deep hands-on experience with the ELK Stack – Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management
- Strong experience with Grafana, including data source integrations, dashboard design, and alerting
- Solid understanding of observability principles
- Experience operating on-premises infrastructure, including capacity planning, server management, and the operational tradeoffs with managed cloud services
- Proficiency in Python for automation and tooling; familiarity with shell scripting
- Strong Linux systems knowledge and comfort working with configuration management tools (e.g., Ansible, Chef, Puppet, etc.)
- Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
- A bias toward automation and a low tolerance for repetitive manual work
Benefits
- comprehensive benefits
- educational initiatives
- special celebrations of our history, culture, and growth
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
ELK StackElasticsearchLogstashKibanaGrafanaTerraformHelmAnsiblePythonLinux
Soft Skills
communicationincident resolutioncollaborationleadershipcapacity planningperformance tuningautomationproblem-solvingorganizational skillsadaptability
Certifications
Bachelor’s degree in a technical field