
Senior Site Reliability Engineer – Workflow Automation
Dimensional Fund Advisors
full-time
Posted on:
Location Type: Hybrid
Location: Austin • North Carolina • Texas • United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Serve as a primary escalation point for production support involving Airflow and UC4 – assisting end-user inquiries, incident root cause analysis, and implementing go-forward solutions
- Own and continuously improve SLOs, SLIs, and error budgets for orchestration platforms
- Monitor platform health, capacity, and performance; proactively identify and remediate issues before they impact users
- Partner with data engineering and application teams to troubleshoot DAG failures, job dependencies, and scheduling issues
- Manage patching, upgrades, and configuration management for Airflow and UC4 environments
- Collaborate with security to harden platform configurations and manage software vulnerabilities
- Contribute to on-call rotations and maintain runbooks and escalation procedures
- Design and build tooling and automation to reduce toil and improve developer experience for teams that depend on Airflow and UC4
- Lead or contribute to platform modernization initiatives– e.g., migrating workloads, improving deployment pipelines, containerizing components, or adopting managed service offerings
- Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) for platform components
- Build observability solutions (e.g., dashboards, alerting, log aggregation) that give teams better visibility into their workflows
- Build and enforce standards around platform use that help engineering teams adopt best practices at scale
- Participate in design reviews and contribute to the overall platform roadmap
Requirements
- Bachelor’s degree in a technical field or equivalent practical experience
- 5+ years of experience in SRE, DevOps, or platform engineering roles
- Deep hands-on experience with Apache Airflow – ideally including distributed executor configurations (Celery or Kubernetes), DAG authoring best practices, and multi-environment deployments
- Experience operating enterprise job scheduling platforms (e.g., Automic/UC4, Control-M, etc.)
- Strong Linux and Windows systems knowledge and comfort working in cloud environments (AWS preferred)
- Proficiency in Python for automation and tooling; familiarity with shell scripting
- Experience with container orchestration (Kubernetes, Docker) and CI/CD pipelines
- Solid understanding of observability principles – metrics, logging, tracing – and tools like ELK, Grafana, and Prometheus
- Demonstrated ability to drive incidents to resolution and communicate clearly under pressure
- A bias toward automation and a low tolerance for repetitive manual work
Benefits
- comprehensive benefits
- educational initiatives
- special celebrations of our history, culture, and growth
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard Skills & Tools
Apache AirflowUC4TerraformHelmAnsiblePythonKubernetesDockerCI/CDobservability
Soft Skills
incident resolutioncommunication under pressurecollaborationproblem-solvingleadership