
Senior Site Reliability Engineer
Analytic Partners
full-time
Posted on:
Location Type: Hybrid
Location: Dallas • Texas • United States
Visit company websiteExplore more
Job Level
Tech Stack
About the role
- Own the Internal Developer Platform (IDP) as a product, treating engineering teams as customers and optimizing for reliability, usability, and delivery velocity.
- Define and execute a platform roadmap aligned with business priorities, developer needs, and long-term scalability.
- Design, build, and evolve paved roads for application delivery, including CI/CD pipelines, infrastructure templates, service scaffolding, and standardized deployment patterns.
- Build self-service capabilities that enable teams to provision, deploy, observe, and operate services with minimal friction.
- Create and maintain reusable platform abstractions across AWS and Azure that standardize security, reliability, networking, and observability.
- Reduce developer cognitive load by abstracting unnecessary complexity while enforcing clear guardrails for security, cost, and compliance.
- Partner closely with application, product, and security teams to embed reliability, scalability, and security by design.
- Establish and evolve platform standards for logging, monitoring, alerting, tracing, and incident response workloads.
- Define, measure, and manage SLIs, SLOs, and error budgets for shared platform services.
- Drive the reduction of operational toil through automation, standardization, and platform-first solutions.
- Ensure shared platform services meet high standards for availability, performance, resilience, and scalability.
- Own system-to-system integration and messaging patterns used across the platform.
- Lead capacity planning, demand forecasting, and performance tuning for platform services.
- Plan and execute zero-downtime upgrades, migrations, and releases of platform components.
- Lead platform-level incident response workflows, post-incident reviews, and drive systemic improvements rather than one-off fixes.
- Evaluate incoming platform requests and translate them into scalable, productized capabilities.
- Mentor engineers and drive platform adoption through documentation, enablement, and technical evangelism.
- Participate in a 24x7 on-call rotation as an escalation point for platform reliability and availability issues.
- Operate effectively in ambiguous problem spaces, making sound architectural and product decisions with limited guidance.
Requirements
- Bachelor’s degree in Computer Science or equivalent practical experience.
- 6+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Systems Engineering roles.
- Strong expertise in Linux and Windows operating systems.
- Advanced automation and scripting skills using Python, Bash, and/or PowerShell.
- Deep, hands-on experience designing and operating AWS and Azure platforms at scale.
- Strong experience building and operating CI/CD platforms (Jenkins, GitHub Actions or equivalent).
- Strong experience with Infrastructure as Code and configuration management (Terraform, CloudFormation, ARM, or similar).
- Production experience with containerized and orchestration platforms such as Docker and Kubernetes.
- In-depth experience with the HashiCorp ecosystem (Nomad, Consul, Vault).
- Strong understanding of distributed systems, cloud-native architectures, and reliability patterns.
- Experience designing and operating observability platforms (e.g., Splunk, Sumo Logic, or similar).
- Familiarity with security and compliance practices, including vulnerability scanning and enterprise security tooling.
- Strong understanding of the software delivery lifecycle, release engineering, and platform lifecycle management.
- Experience working in Agile / DevOps environments with a strong product mindset.
- Demonstrated ability to influence without authority, set standards, and drive adoption across teams.
- Excellent communication skills, able to translate platform capabilities into clear developer value.
- Strong problem-solving skills with a bias toward durable, scalable solutions over short-term fixes.
- A mindset of continuous improvement, curiosity, and learning.
- Comfortable supporting a global, follow-the-sun operation when needed.
Benefits
- Regular Employee
- Flexibility in career paths for self-development
- Opportunities for diversity, equity, and inclusion
Applicant Tracking System Keywords
Tip: use these terms in your resume and cover letter to boost ATS matches.
Hard skills
Platform EngineeringSite Reliability EngineeringDevOpsSystems EngineeringLinuxWindowsPythonBashPowerShellCI/CD
Soft skills
communicationproblem-solvinginfluence without authoritycontinuous improvementcuriositylearningmentoringtechnical evangelismoperating in ambiguitycollaboration
Certifications
Bachelor’s degree in Computer Science